428 Conagin et al.
MODIFICATIONS FOR THE TUKEY TEST PROCEDURE
AND EVALUATION OF THE POWER AND EFFICIENCY OF
MULTIPLE COMPARISON PROCEDURES
Armando Conagin 1 ; Décio Barbin 2 *; Clarice Garcia Borges Demétrio 2
1
IAC - C.P. 28 - 13001-970 - Campinas, SP - Brasil.
2
USP/ESALQ, Depto. de Ciências Exatas, C.P. 09 - 13418-900 - Piracicaba, SP - Brasil.
*Corresponding author <debarbin@esalq.usp.br>
ABSTRACT: Multiple pairwise comparison tests of treatment means are of great interest in applied
research. Two modifications for the Tukey test were proposed. The power of unilateral and bilateral
Student, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet
statistical tests and the modified tests, Sidak, Bonferroni 1 and 2, Tukey 1 and 2, has been compared
using the Monte Carlo method. Data were generated for 600 experiments with eight treatments in a
randomized block design, of which 400 had four and 200 eight blocks. The differences between the
treatment means in relation to the control were 30%, 20%, 15%, 10%, 5%. Two extra treatments did not
differ from the control. A coefficient of variation of 10% and a probability Type I error of a = 0.05 were
adopted. The power of all the tests decreased when the differences to the control, decreased. The
unilateral and bilateral Student t, Waller-Duncan and Duncan tests showed greater number of
significative differences, followed by unilateral Dunnett, modified Sidak, modified Bonferroni 1 and 2,
modified Tukey 1, SNK, REGWF, REGWQ, modified Tukey 2, Tukey, Sidak and Bonferroni. There is
great loss of efficiency for all tests in relation to the unilateral Student t test for each difference of the
treatment to the control, when the differences between means decrease. The modified tests were
always more efficient than their original ones.
Key words: multiple comparison statistical tests, type I errors, Monte Carlo method, power of tests
MODIFICAÇÕES NO PROCEDIMENTO PARA O TESTE DE
TUKEY E PODER E EFICIÊNCIA DE TESTES DE
COMPARAÇÕES MÚLTIPLAS
RESUMO: Testes de comparações múltiplas entre médias de tratamentos são de grande interesse na
pesquisa aplicada. Duas propostas de modificação do teste de Tukey são apresentadas e, usando-se
simulação pelo método Monte Carlo, foi comparado o poder dos testes estatísticos: Student unilateral
e bilateral, Waller-Duncan, Duncan, SNK, REGWF, REGWQ, Tukey, Bonferroni, Sidak, Dunnet unilateral,
e dos testes modificados de Sidak, Bonferroni 1 e 2 e Tukey 1 e 2. Foram gerados dados para 600
experimentos em um delineamento casualizado em blocos com oito tratamentos, sendo 400 com quatro
repetições e 200 com oito repetições. Foram adotados coeficiente de variação de 10% e erro tipo I com
probabilidade a = 0.05. As diferenças entre as médias dos tratamentos e o controle foram de 30%, 20%,
15%, 10%, 5%; sendo, ainda incluídos, dois tratamentos que, parametricamente, não diferiram da
média do controle. Para todos os testes, o poder decresceu quando as diferenças das médias em
relação à média do controle decresceram; pela ordem, t de Student unilateral, t de Student bilateral e
Waller-Duncan apresentaram maior número de diferenças significativas; seguindo-se Duncan, Dunnett
unilateral, Sidak modificado e Bonferroni modificados 1 e 2 e Tukey modificado 1, SNK, REGWF,
REGWQ, Tukey modificado 2 e os testes de Tukey, Sidak e Bonferroni. Houve grande perda de
eficiência para todos os testes em relação ao teste t de Student unilateral, usado para comparar cada
tratamento com o controle, quando o valor da diferença entre médias diminui. Os testes modificados
foram sempre mais eficientes do que os respectivos testes originalmente propostos.
Palavras-chave: testes estatísticos de comparações múltiplas, erro tipo I, método Monte Carlo, poder dos testes
INTRODUCTION ing experiments in which different treatments are in-
cluded. Results are generally submitted to statistical
In applied research the evaluation of the hy- analysis of variance, testing a global null hypothesis
pothesis under investigation can be obtained develop- H0 using the F test and comparing the means by mul-
Sci. Agric. (Piracicaba, Braz.), v.65, n.4, p.428-432, July/August 2008
, Modification for the Tukey test 429
tiple comparison procedures (Hochberg & Tamhane, The aim of this study is to propose two modi-
1987; Hsu, 1996). A common practice is to compare fications for the Tukey test and to evaluate the power
new treatments to a control. In corn or wheat breed- and the efficiency of the 11 classical and five modi-
ing, for example, new cultivars have to be compared fied multiple comparison tests.
to the main cultivar. In animal husbandry, new feed-
ing treatments have to be compared to a main treat- MATERIAL AND METHODS
ment that is in use. In medical research, new promis-
ing medicines have to be compared to the one adopted, Two modifications for the statistical Tukey test
before FDA in USA or ANVISA in Brazil give permis- are suggested and the power of unilateral and bilateral
sion for their commercialization. Student, Waller-Duncan, Duncan, SNK, REGWF,
The area of rejection of the global null hypoth- REGWQ, Tukey, Bonferroni, Sidak, unilateral Dunnet
esis H0 is generally chosen in such a way that the prob- tests and the modified tests Sidak, Bonferroni 1 and
ability of a Type II error (acceptance of a wrong hy- 2, Tukey 1 and 2 have been compared using the Monte
pothesis) is as small as possible while the Type I er- Carlo simulation method. All classical tests were cal-
ror rate is prefixed or not. For the comparison of the culated using the SAS (2003) software.
means, the Type I error rate may be of the Data were generated for 600 experiments with
comparisonwise or experimentwise types. The latter eight treatments in a randomized block design, of
can be under global null hypothesis or partial null hy- which 400 had four and 200 eight blocks. The differ-
pothesis, or maximum experimentwise error rate ences between the treatment means in relation to the
(MEER) which is the preferred one. control were 30%, 20%, 15%, 10%, 5%; two extra
The behavior of certain statistical tests and treatments did not differ from the control. A coeffi-
their performance in terms of Type I error rate have cient of variation of 10% and a probability Type I er-
been evaluated, for example, by Gabriel (1964); ror of a = 0.05 were adopted. The evaluation of the
Boardman & Moffitt (1971); O’Neill & Wetheril (1971); power of each test was made by the value of the per-
Bernardson (1975); Hsu (1996) and many others but centage of the number of significative differences ob-
there are still many questions to be answered in this tained in relation to the number of experiments per-
research field (Hocking, 1985). formed. A brief description of the modifications of the
Studies by Boardman & Moffitt (1971), re- Tukey test is presented.
garding the Type I error rate per comparison for ex-
Modified Tukey Test 1, TuM1
periments with two to eleven treatments (identical
If the global null hypothesis Ho (t1 = t2 = ... =
treatments), under true global null hypothesis H0, re-
tt = 0, where ti, i = 1, …, t, is the i-th treatment ef-
vealed that the Student t test maintained a frequency
fect), is rejected, the greatest interest of the researcher
of rejection of the null hypothesis very near the
is to know how the t treatments means differ.
adopted value of a = 0.05; the Duncan test had val-
The Tukey test determines for every pair of
ues varying from near 0.05 for t = 2 to near 0.025
means whether they are significantly different and is
for t = 11; the SNK, Tukey and Scheffée tests
based on a familywise error rate for k = t (t-1)/2 com-
showed values gradually smaller, from 0.05 for t =
parisons. The procedure is to test the hypotheses: Ho:
2 to near 0.01 for t = 11, different of the adopted
mi = mi’ , versus Ho: mi ¹ mi’ , i ¹ i’ = 1, …, t, and Ho is
Type I error of 0.05.
rejected at an a significance level if
For the experimentwise Type I error, adopt-
ing a = 0.05, the t test revealed an increment of fre- mi – mi’ ³ q s Ö 1/r or mi – mi’ ³ q s Ö[1/2(1/ri + 1/ri’ )],
quency from 0.05, for t = 2, to near 0.55, for t = 11;
the Duncan test had values varying near 0.05 for t = where mi and mi’ are the estimates of the means and ri
2 to 0.25, for t = 11; the other three tests maintained and ri’ are the number of replicates of treatments i and
the frequencies near the nominal value a or gave i’ and q = qt,n,a is the value of the studentized range
smaller values. Similar results were obtained by with t means, n degrees of freedom associated to s2,
Bernardson (1975) and Perecin & Barbosa (1988). the Residual Mean Square.
Conagin (1998); Conagin et al. (1999); Conagin (1999) One problem of the Tukey test is that it can
and Conagin & Gomes (2004) using different number be conservative (Carmen & Swanson, 1973) because
of combinations of size, number of treatments, repli- it is based on the studentized range. A similar proce-
cations and different C.Vs. compared a great number dure employed for the BM 2 and siM tests (Conagin &
of tests. Conagin & Barbin (2006a, 2006b) evaluated Barbin, 2006a, 2006b) can be used here. The first
the behavior of various tests and introduced the modi- modification here proposed for the Tukey test, called
fied tests Sidak, Bonferroni 1 and 2. TuM1, is to carry out all the preliminary phases made
Sci. Agric. (Piracicaba, Braz.), v.65, n.4, p.428-432, July/August 2008