### Tukey, not TuRkey (that's for tomorrow), on outliers and multiple comparisons

John Tukey contributed many things to statistics and data visualizations (box plots!).  I also cannot help thinking "TuRkey" every time I read or verbalize his name.  In my (biased, failable human) mind, the frequency with which I think to use Tukey methods in my own work, and the potential for an embarrassing Tukey-TuRkey switch during regular communication, increases across the year to look something like the figure to the right.

 See - it's not just me who mixes these up.
In this post, I discuss a few ways to use some of Tukey's contributions to help with some analyses I've been recently running in Stata. I use (a simulated version of) data from student survey responses, discipline involvement, and exam performance to do Tukey-related things like assess outliers and test for differences across multiple groups. This post includes some basic code run on fake data but in addition to what's presented below you could also consider using Stata programs (some from SSC) such as:  -extremes-, -winsor2-, -rvfplot-, -lvr2dplot-, -rreg-/-qreg- to help examine outliers.   (note: I'm not advocating any sort of outlier/extreme value deletion. Influential observations (especially those with large residuals) can be a pain, but they are a part of the data (unless they are the result of data entry error) and so you should include them. Though it's important to understand their influence, source, and consider ways to incorporate them with the least bias.)

Let's just get to the code.

To run the code below, copy and paste into a data editor (sans the annotations and commentary).

//First create some fake data to work with:
clear
set obs 1000
for X in num 1/5: g surveyX = int(1+runiform()*4)
lab def survey 1 "S.Disagree" 2 "Disagree" 3 "Agree" 4 "S.Agree", modify
lab val survey? survey
g id = _n
g group = int(1+runiform()*4)
g discipline_incidents = cond(inrange(_n, 700, 800) & group !=3, rpoisson(9), rpoisson(12), .)
bys group: replace discipline = 31  if inlist(group, 3, 4) & _n>=`=_N-3' //creates  strong outliers
g score = cond(_n<900 | group ==3, 25+runiform()*70, 75+runiform()*20 , .)
sort score

//Next, take a look at the data:
hist score, scheme(plottig) normal xlab(20(20)100) percent //get the plottig scheme from SSC!
hist discipline_incidents, scheme(plottig) normal xlab(0(5)30) percent by(group)

codebook, c

Here's what our data look like including the histograms of score and discipline_incidents from the code above:

Variable Obs Unique Mean Min Max Label
-------------------------------------------------------------------------------------------------------
survey1  1000 4     2.595 1 4
survey2  1000 4     2.5   1 4
survey3  1000 4     2.551 1 4
survey4  1000 4     2.448 1 4
survey5  1000 4     2.44  1 4
id       1000 1000  500.5 1 1000
group    1000       4     2.447 1 4
discipline~s 1000   22    11.773 3 25
score    1000 1000  62.4  25.0  94.92088
-------------------------------------------------------------------------------------------------------

Distribution of the exam score:

Distribution of the frequency of discipline incidents by group:

Before we get to Tukey methods of examining outliers and making multiple group comparisons, here's a quick reminder of some ways to examine outliers via residuals and leverage.

How influential are the discipline incident outliers?

. tostring group, g(group2)

. bys group2: extremes discipline //get extremes from SSC

-------------------------------------------------------------------------------------------------------
-> group2 = 1

+-----------------+
| obs:   discip~s |
|-----------------|
|   1.          4 |
|  32.          4 |
|  40.          4 |
| 155.          4 |
| 189.          4 |
+-----------------+

+-----------------+
| 109.         20 |
| 224.         20 |
| 244.         20 |
|  17.         23 |
|   7.         25 |
+-----------------+

note: 6 values of 4
note: 5 values of 20

-------------------------------------------------------------------------------------------------------
-> group2 = 2

+-----------------+
| obs:   discip~s |
|-----------------|
| 330.          4 |
| 486.          4 |
| 259.          5 |
| 375.          5 |
| 457.          5 |
+-----------------+

+-----------------+
| 351.         19 |
| 459.         19 |
| 296.         20 |
| 485.         21 |
| 518.         22 |
+-----------------+

note: 5 values of 19

-------------------------------------------------------------------------------------------------------
-> group2 = 3

+-----------------+
| obs:   discip~s |
|-----------------|
| 624.          3 |
| 742.          4 |
| 528.          5 |
| 580.          5 |
| 563.          6 |
+-----------------+

+-----------------+
| 723.         20 |
| 738.         20 |
| 652.         21 |
| 728.         21 |
| 708.         25 |
+-----------------+

note: 3 values of 6
note: 3 values of 20

-------------------------------------------------------------------------------------------------------
-> group2 = 4

+-----------------+
| obs:   discip~s |
|-----------------|
| 837.          3 |
| 980.          3 |
| 931.          4 |
| 805.          5 |
| 971.          5 |
+-----------------+

+-----------------+
| 945.         19 |
| 886.         20 |
| 918.         20 |
| 964.         21 |
| 800.         22 |
+-----------------+

. **examine residuals and leverage (no figures included)
. regress discipline_incidents  i.group survey*

Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(8, 991)       =      2.53
Model |  241.119939         8  30.1399924   Prob > F        =    0.0100
Residual |  11814.3511       991  11.9216459   R-squared       =    0.0200
Total |   12055.471       999  12.0675385   Root MSE        =    3.4528

------------------------------------------------------------------------------
discipline~s |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |
2  |   .5011374     .30258     1.66   0.098    -.0926336    1.094908
3  |   1.102353   .3084697     3.57   0.000     .4970245    1.707682
4  |   .8773734   .3156235     2.78   0.006     .2580062    1.496741
|
survey1 |  -.2206665   .0975013    -2.26   0.024    -.4119992   -.0293337
survey2 |  -.0614341   .0977131    -0.63   0.530    -.2531824    .1303143
survey3 |   .0163709   .0979074     0.17   0.867    -.1757588    .2085006
survey4 |  -.0209857   .0983404    -0.21   0.831    -.2139651    .1719936
survey5 |   .0232027   .0981865     0.24   0.813    -.1694747    .2158801
_cons |   11.84436   .6080077    19.48   0.000     10.65123    13.03749
------------------------------------------------------------------------------

. predict p,
(option xb assumed; fitted values)

. egen sd = sd(p)

. egen mean = mean(p)

. replace p = (p-mean/sd)^2

. predict cook, cooksd

. tw (scatter discipline_incidents p [fweight=cook]) ///
>         (scatter discipline_incidents p, mlabel(group2) msym(none))

.

. **Tukey Method
John Tukey’s IQR method for finding outliers is applicable to most ranges (since it isn’t dependent on distributional assumptions); it ignores the mean and standard deviation, making it resistant to being influenced by the extreme values in the range.  Tukey’s IQR method is just a guideline or set of
cut points where outliers are generally values that are 1.5xIQR lower than the first Quartile and higher than the third Quartile, or for the 'discipline_incidents' var we can :

. su discipline_incidents, d

. g q1 = `r(p25)'

. g q3 = `r(p75)'

. g iqr = q3-q1

. **let's look at the boxplot to see the IQR

.
.  replace q1 = (q1 - (1.5*iqr))

. replace q3 = (q3 + (1.5*iqr))

. cap drop outlier

. g outlier = 1 if discipline_incidents<q1
(1,000 missing values generated)

. replace outlier = 2 if discipline_incidents>q3

. sort  group outlier

. groups discipline  group outlier  if !mi(outlier), sepby(outlier) //-groups- from ssc

+----------------------------------------------+
| discip~s   group   outlier   Freq.   Percent |
|----------------------------------------------|
|       22       2         2       1     20.00 |
|       22       4         2       1     20.00 |
|       23       1         2       1     20.00 |
|       25       1         2       1     20.00 |
|       25       3         2       1     20.00 |
+----------------------------------------------+

. //Now, Tukey Pairwise comparisons in Stata

. tabstat discipline, by(group) stat(n mean sd)

Summary for variables: discipline_incidents
by categories of: group

group |         N      mean        sd
---------+------------------------------
1 |       255  11.16078  3.500511
2 |       270  11.68148  3.343199
3 |       248   12.2621  3.520883
4 |       227  12.03524  3.457529
---------+------------------------------
Total |      1000    11.773  3.473836
----------------------------------------

. oneway discipline group, sidak bonferroni scheffe

Analysis of Variance
Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      172.773978      3    57.591326      4.83     0.0024
Within groups       11882.697    996   11.9304187
------------------------------------------------------------------------
Total            12055.471    999   12.0675385

Bartlett's test for equal variances:  chi2(3) =   0.8345  Prob>chi2 = 0.841

Comparison of discipline~s by group
(Bonferroni)
Row Mean-|
Col Mean |          1          2          3
---------+---------------------------------
2 |    .520697
|      0.508
|
3 |    1.10131    .580615
|      0.002      0.338
|
4 |    .874458    .353761   -.226854
|      0.034      1.000      1.000

Comparison of discipline~s by group
(Scheffe)
Row Mean-|
Col Mean |          1          2          3
---------+---------------------------------
2 |    .520697
|      0.395
|
3 |    1.10131    .580615
|      0.005      0.302
|
4 |    .874458    .353761   -.226854
|      0.053      0.731      0.916

Comparison of discipline~s by group
(Sidak)
Row Mean-|
Col Mean |          1          2          3
---------+---------------------------------
2 |    .520697
|      0.412
|
3 |    1.10131    .580615
|      0.002      0.294
|
4 |    .874458    .353761   -.226854
|      0.033      0.830      0.979

. anova discipline group

Number of obs =      1,000    R-squared     =  0.0143
Root MSE      =    3.45404    Adj R-squared =  0.0114

Source | Partial SS         df         MS        F    Prob>F
-----------+----------------------------------------------------
Model |  172.77398          3   57.591326      4.83  0.0024
|
group |  172.77398          3   57.591326      4.83  0.0024
|
Residual |  11882.697        996   11.930419
-----------+----------------------------------------------------
Total |  12055.471        999   12.067539

.         *get these using -findit-:
. tukeyhsd group

Tukey HSD pairwise comparisons for variable group
studentized range critical value(.05, 4, 996) = 3.6394906
uses harmonic mean sample size =  249.022

mean
grp vs grp       group means           dif    HSD-test
-------------------------------------------------------
1 vs   2    11.1608    11.6815      0.5207   2.3789
1 vs   3    11.1608    12.2621      1.1013   5.0316*
1 vs   4    11.1608    12.0352      0.8745   3.9951*
2 vs   3    11.6815    12.2621      0.5806   2.6526
2 vs   4    11.6815    12.0352      0.3538   1.6162
3 vs   4    12.2621    12.0352      0.2269   1.0364

. tkcomp group

Tukey-Kramer pairwise comparisons for variable group
studentized range critical value(.05, 4, 996) = 3.6394906

mean
grp vs grp       group means          dif     TK-test
-------------------------------------------------------
1 vs   2    11.1608    11.6815      0.5207   2.4414
1 vs   3    11.1608    12.2621      1.1013   5.0560*
1 vs   4    11.1608    12.0352      0.8745   3.9236*
2 vs   3    11.6815    12.2621      0.5806   2.7028
2 vs   4    11.6815    12.0352      0.3538   1.6085
3 vs   4    12.2621    12.0352      0.2269   1.0112

. fhcomp group

Fisher-Hayter pairwise comparisons for variable group
studentized range critical value(.05, 3, 996) = 3.3194964

mean
grp vs grp       group means          dif     FH-test
-------------------------------------------------------
1 vs   2    11.1608    11.6815      0.5207   2.4414
1 vs   3    11.1608    12.2621      1.1013   5.0560*
1 vs   4    11.1608    12.0352      0.8745   3.9236*
2 vs   3    11.6815    12.2621      0.5806   2.7028
2 vs   4    11.6815    12.0352      0.3538   1.6085
3 vs   4    12.2621    12.0352      0.2269   1.0112

What does all this mean?
The Tukey method for multiple group comparisons is useful even when cell sizes are unequal. In the output for the simulated student data, Comparisons 1 versus 3, and 1 versus 4 were significant (except for Scheffe) (n.b., the Tukey and the FH tests are the same but they use different critical values of the Studentized range distribution).

Here's the results for  -pwcompare- using the Tukey multiple comparisons method (which are calculated differently than the FH/HSD tests above -- see the Stata User's Manual for -pwcompare- for a detailed description of this difference) and finding significant contrasts between the same groups as above:

. regress discipline i.group

Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(3, 996)       =      4.83
Model |  172.773978         3   57.591326   Prob > F        =    0.0024
Residual |   11882.697       996  11.9304187   R-squared       =    0.0143
Total |   12055.471       999  12.0675385   Root MSE        =     3.454

------------------------------------------------------------------------------
discipline~s |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |
2  |   .5206972   .3016168     1.73   0.085    -.0711801    1.112574
3  |   1.101312   .3080462     3.58   0.000     .4968184    1.705807
4  |    .874458    .315187     2.77   0.006     .2559511    1.492965
|
_cons |   11.16078   .2163006    51.60   0.000     10.73633    11.58524
------------------------------------------------------------------------------

. pwcompare group, effects mcompare(tukey) //uses the Studentized range distribution instead of the t d
> istribution.

Pairwise comparisons of marginal linear predictions

Margins      : asbalanced

---------------------------
|    Number of
|  Comparisons
-------------+-------------
group |            6
---------------------------

------------------------------------------------------------------------------
|                              Tukey                Tukey
|   Contrast   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |
2 vs 1  |   .5206972   .3016168     1.73   0.310    -.2554755     1.29687
3 vs 1  |   1.101312   .3080462     3.58   0.002     .3085944    1.894031
4 vs 1  |    .874458    .315187     2.77   0.029      .063364    1.685552
3 vs 2  |   .5806153   .3037981     1.91   0.224    -.2011706    1.362401
4 vs 2  |   .3537608   .3110364     1.14   0.666    -.4466521    1.154174
4 vs 3  |  -.2268545    .317275    -0.72   0.891    -1.043322    .5896127
------------------------------------------------------------------------------

Happy TuRkey, not Tukey, Day!