Shaffer (1995) Multiple hypothesis testing

hypotheses are of the form Hijk...:5ijk. ..... THE SIMPLE BONFERRONI METHOD This method takes the form: Reject Hi ifpi ...... Biometrika 69:493-502. Seeger P.
1MB taille 2 téléchargements 302 vues
Annual Reviews www.annualreviews.org/aronline Ann~Rev. Psychol.1995. 46:561-84 Copyright ©1995by AnnualReviewsInc. All rights reserved

MULTIPLEHYPOTHESISTESTING Juliet

Popper Shaffer

Department of Statistics, University of California,Berkeley,California94720 KEYWORDS: multiple comparisons, simultaneous testing, p-values, closed test procedures, pairwise comparisons

CONTENTS INTRODUCTION ..................................................................................................................... ORGANIZING CONCEPTS ..................................................................................................... PrimaryHypotheses,Closure, HierarchicalSets, andMinimalHypotheses...................... Families ................................................................................................................................ Type 1Error Control ............................................................................................................ Power ................................................................................................................................... P-Values andAdjusted P-Values ......................................................................................... Closed TestProcedures ....................................................................................................... METHODS BASED ONORDERED P-VALUES ................................................................... Methods BasedontheFirst-Order Bonferroni Inequality.................................................. Methods Based ontheSimes Equality ................................................................................. Modifications for Logically Related Hypotheses ................................................................. Methods Controlling theFalseDiscovery Rate................................................................... COMPARING NORMALLY DISTRIBUTED MEANS ......................................................... OTHER ISSUES ........................................................................................................................ Testsvs Confidence Intervals ............................................................................................... Directional vs Nondirectional Inference ............................................................................. Robustness ............................................................................................................................ Others ........................................................................ . .......................................................... CONCLUSION ..........................................................................................................................

561 564 564 565 566 567 568 569 569 569 570 571 572 573 575 575 576 577 578 580

INTRODUCTION Multipletesting refers to the testing of morethanonehypothesisat a time. It is a subfieldof the broaderfield of multipleinference,or simultaneous inference, whichincludes multipleestimationas well as testing. Thisreviewconcentrates on testing anddeals withthe special problemsarising fromthe multipleaspect. The term "multiple comparisons"has come to be used synonymouslywith

0066-4308/95/0201-0561505.00

561

Annual Reviews www.annualreviews.org/aronline 562 SHAFFER "simultaneous inference," even whenthe inferences do not deal with comparisons. It is used in this broader sense throughoutthis review. In general, in testing any single hypothesis, conclusionsbased on statistical evidence are uncertain. Wetypically specify an acceptable maximum probability of rejecting the null hypothesiswhenit is true, thus committinga Type I error, and base the conclusionon the value of a statistic meetingthis specification, preferably one with high power. Whenmanyhypotheses are tested, and each test has a specified TypeI error probability, the probability that at least someTypeI errors are committedincreases, often sharply, with the numberof hypotheses. This mayhave serious consequencesif the set of conclusions must be evaluated as a whole. Numerousmethods have been proposed for dealing with this problem,but no one solution will be acceptable for all situations. Threeexamplesare given belowto illuslrate different types of multiple testing problems. SUBPOPULATIONS: A HISTORICAL EXAMPLE Cournot (1843) described vividly the multiple testing problemresulting from the exploration of effects within different subpopulationsof an overall population. In his words, as translated from the French, "...it is clear that nothing limits...the numberof features accordingto whichone can distribute [natural events or social facts] into several groups or distinct categories." As an examplehe mentions investigating the chanceof a malebirth: "Onecould distinguish first of all legitimate births from those occurringout of wedlock.... one can also classify births accordingto birth order, accordingto the age, profession, wealth, or religion of the parents...usually these attempts through which the experimenter passed don’t leave any traces; the public will only knowthe result that has been foundworth pointing out; and as a consequence, someoneunfamiliar with the attempts which have led to this result completelylacks a clear rule for deciding whetherthe result can or can not be attributed to chance."(See Stigler 1986,for further discttssion of the historical context; see also Shafer &Olkin 1983, Nowak1994.) large social science surthousandsof variables are investigated, and participants are groupedin myriadways. The results of these surveys are often widely publicized and have potentially large effects on legislation, monetarydisbursements,public behavior, etc. Thus, it is important to analyze results in a waythat minimizes misleadingconclusions.Sometype of multiple error control is needed,but it is clearly impractical, if not impossible,to control errors at a small level overthe entire set of potential comparisons. LARGE SURVEYSAND OBSERVATIONALSTUDIES ]rl veys,

FACTORIAL DESIGNS The standard textbook presentation of multiple comparison issues is in the context of a one-factorinvestigation, wherethere is evidence

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 563

from an overall test that the meansof the dependentvariable for the different levels of a factor are not all equal, and morespecific inferences are desired to delineate whichmeansare different from whichothers. Here, in contrast to many of the examplesabove, the family of inferences for whicherror control is desired is usually clearly specified and is often relatively small. Onthe other hand, in multifactorial studies, the situation is less clear. Thetypical approachis to treat the maineffects of eachfactor as a separate familyfor purposesof error control, although both Tukey (1953) and Hartley (1955) gave examples of 2 x 2 factorial designsin whichthey treated all sevenmaineffect and interaction tests as a single family. The probability of finding somesignificances maybe very large if each of manymain effect and interaction tests is carried out at a conventionallevel in a multifactor design. Furthermore,it is importantin many studies to assess the effects of a particular factor separatelyat eachlevel of other factors, thus bringing in another layer of multiplicity (see Shaffer 1991). As noted above, Cournotclearly recognized the problemsinvolved in multiple inference, but he considered theminsoluble. Althoughthere were a few isolated earlier relevant publications, sustained statistical attacks on the problemsdid not begin until the late 1940s. Mosteller (1948) and Nair (1948) dealt with extreme value problems; Tukey (1949) presented a more comprehensive approach. Duncan(1951) treated multiple range tests. Related workon ranking and selection was published by Paulson (1949) and Bechhofer (1952). Scheff6 (1953) introduced his well-known procedures, and work by Roy Bose (1953) developed another simultaneous confidence interval approach. Also in 1953, a book-length unpublished manuscript by Tukey presented a general frameworkcovering a numberof aspects of multiple inference. This manuscriptremainedunpublisheduntil recently, whenit was reprinted in full (Braun 1994). Later, Lehmann(1957a,b) developed a decision-theoretic proach, and Duncan(1961) developed a Bayesian decision-theoretic approach shortly afterward. For additional historical material, see Tukey(1953), Harter (1980), Miller (1981), Hochberg&Tamhane(1987), and Shaffer (1988). The first published book on multiple inference was Miller (1966), which was reissued in 1981, with the addition of a review article (Miller 1977). Except in the ranking and selection area, there were no other book-length treatments until 1986, whena series of book-length publications began to appear: 1. Multiple Comparisons(Klockars &Sax 1986); 2. Multiple Comparison Procedures (Hochberg & Tamhane1987; for reviews, see Littell 1989, Peritz 1989); 3. Multiple Hypothesenpriifung(Multiple HypothesesTesting) (Bauer et el 1988; for reviews, see L~iuter 1990, Holm1990); 4. Multiple Comparisonsfor Researchers (Toothaker 1991; for reviews, see Gaffan 1992, Tatsuoka 1992) and Multiple ComparisonProcedures (Toothaker 1993); Multiple Comparisons, Selection, and Applications in Biometry (Hoppe 1993b; for a review, see Ziegel 1994); 6. Resampling-basedMultiple Testing

Annual Reviews www.annualreviews.org/aronline 564 SHAFFER (Wesffall &Young1993; for reviews, see Chaubey1993, Booth 1994); 7. The Collected Works of John W. Tukey, VolumeVII: Multiple Comparisons:19481983 (Braun 1994); and 8. Multiple Comparisons: Theory and Methods (Hsu 1996). This review emphasizesconceptual issues and general approaches. In particular, two types of methodsare discussed in detail: (a) methodsbased ordered p-values and (b) comparisons amongnormally distributed means. The literature cited offers manyexamplesof the application of techniques discussed here.

ORGANIZING CONCEPTS Primary Hypotheses, Hypotheses

Closure,

Hierarchical

Sets,

and Minimal

Assumesomeset of null hypotheses of primary interest to be tested. Sometimes the numberof hypothesesin the set is infinite (e.g. hypothesizedvalues of all linear contrasts amonga set of population means), although in most practical applications it is finite (e.g. values of all pairwisecontrasts among set of populationmeans).It is assumedthat there is a set of observations with joint distribution dependingon someparameters and that the hypotheses specify limits on the values of those parameters. The following examplesuse a primaryset basedon differences ~tl, ~t2 ..... ~tmamongthe meansof mpopulations, althoughthe conceptsapply in general. Let ~ij be the difference ~ti - ~tj; let ~)ijk be the set of differences amongthe means~ti, ~tj, and ~tk, etc. The hypothesesare of the form Hijk...:5ijk... = 0, indicating that all subscripted meansare equal; e.g. H1234is the hypothesis 91 = ~x2 = ~x3 = ~. The primary set neednot consist of the individual pairwisehypothesesHij. If m= 4, it may, for example, be the set H12,H123,H1234,etc, whichwouldsignify a lack of interest in including inference concerning someof the pairwise differences (e.g. H23)and therefore no need to control errors with respect to those differences. Theclosure of the set is the collection of the original set together with all distinct hypotheses formedby intersections of hypotheses in the set; such a collection is called a closed set. For example,an intersection of the hypotheses Hij andHigis the hypothesisHijk: ~ti = ~tj = ~tk. Thehypothesesincludedin an intersection are called components of the intersection hypothesis. Technically, a hypothesis is a componentof itself; any other componentis called a proper component.In the exampleabove, the proper componentsof nijk are Hij, Hi~, and, if it is includedin the set of primaryinterest, Hjk becauseits intersection with either Hij or Hik also gives Hijk. Notethat the Wuthof a hypothesisimplies the truth of all its proper components.

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 565

Anyset of hypotheses in which someare proper componentsof others will be called a hierarchical set. (That term is sometimesused in a morelimited way, but this definition is adopted here.) A closed set (with more than one hypothesis) is therefore a hierarchical set. In a closed set, the top of the hierarchyis the intersection of all hypotheses:in the examplesabove, it is the hypothesisH12...m, or [Xl = ~t2 ..... ptm. The set of hypothesesthat have no proper components represent the lowest level of the hierarchy; these are called the minimalhypotheses (Gabriel 1969). Equivalently, a minimalhypothesis one that does not imply the truth of any other hypothesis in the set. For example,if all the hypothesesstate that ttlere are no differences amongsets of means,and the set of primaryinterest includesall hypothesesH/j for all i ,~ j = 1,...m, these pairwise equality hypothesesare the minimalhypotheses. Families Thefirst and perhapsmostcrucial decision is whatset of hypothesesto treat as a family, that is, as the set for whichsignificance statementswill be considered and errors controlled jointly. In someof the early multiple comparisonsliterature (e.g. Ryan1959, 1960), the term "experiment" rather than "family" was usedin referring to error control. Implicitly, attention wasdirected to relatively small and limited experiments. As a dramaticcontrast, consider the exampleof large surveys and observational studies described above. Here, because of the inverse relationship betweencontrol of TypeI errors and power,it is unreasonable if not impossible to consider methodscontrolling the error rate at a conventionallevel, or indeed any level, over all potential inferences from such surveys. Anintermediate case is a multifactorial study (see aboveexample), which it frequently seemsunwise from the point of view of powerto control error over all inferences. The term "family" was introduced by Tukey(1952, 1953). Miller (1981), Diaconis (1985), Hochberg & Tamhane(1987), others discuss the issues involved in deciding on a family. Westfall &Young (1993) give explicit advice on methodsfor approaching complexexperimental studies. Becausea study can be used for different purposes, the results mayhave to be considered under several different family configurations. This issue came up in reporting state and other geographical comparisons in the National Assessmentof Educational Progress (see Ahmed1991). In a recent national report, each of the 780 pairwise differences amongthe 40 jurisdictions involved(states, territories, and the District of Columbia) was tested for significance at level .05/780 in order to control TypeI errors for that family. However, fromthe point of viewof a single jurisdiction, the familyof interest is the 39 comparisonsof itself with each of the others, so it wouldbe reasonable to test those differences each at level .05/39, in which case somedifferences would be declared significant that were not so designated in the national

Annual Reviews www.annualreviews.org/aronline 566 SHAFFER report. See Ahmed (1991) for a discussion of this exampleand other issues the context of large surveys. Type I Error Control In testing a single hypothesis,the probability of a TypeI error, i.e. of rejecting the null hypothesis whenit is true, is usually controlled at somedesignated level a. The choice of ct should be governedby considerations of the costs of rejecting a true hypothesis as comparedwith those of accepting a false hypothesis. Becauseof the difficulty of quantifyingthese costs and the subjectivity involved, ct is usually set at someconventionallevel, often .05. Avariety of generalizationsto the multiple testing situation are possible. Somemultiple comparisonmethodscontrol the TypeI error rate only when all null hypothesesin the family are true. Others control this error rate for any combination of true and false hypotheses. Hochberg&Tamhane(1987) refer to these as weakcontrol and strong control, respectively. Examplesof methods with only weakerror control are the Fisher protected least significant difference (LSD) procedure, the Newman-Keulsprocedure, and some nonparametric procedures (see Fligner 1984, Keselmanet al 1991a). The multiple comparison literature has been confusing because the distinction between weak and strong control is often ignored. In fact, weakerror rate control without other safeguards is unsatisfactory. This review concentrates on procedures with strong control of the error rate. Several different error rates have been consideredin the multiple testing literature. The major ones are the error rate per hypothesis, the error rate per family, and the error rate familywise or familywise error rate. The error rate per hypothesis (usually called PCE,for per-comparisonerror rate, although the hypothesesneed not be restricted to comparisons)is defined for each hypothesis as the probability of Type I error or, whenthe numberof hypotheses is finite, the average PCEcan be defined as the expected value of (numberof false rejections/number of hypotheses), where a false rejection meansthe rejection of a true hypothesis. The error rate per family (PFE) def’medas the expectednumberof false rejections in the family. This error rate does not apply if the family size is infinite. Thefamilywiseerror rate (FWE) definedas the probability of at least one error in the family. A fourth type of error rate, the false discovery rate, is described below. To makethe three definitions above clearer, consider what they imply in a simple examplein whicheach of n hypothesesH1..... Hnis tested individually at a level eti, and the decision on each is basedsolely on that test. (Procedures of this type are called single-stage; other procedureshave a morecomplicated structure.) If all the hypothesesare true, the averagePCEequals the averageof the ~xi, the PFEequals the sumof the cti, and the FWE is a function not of the

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 567

cti alone, but involvesthe joint distribution of the test statistics; it is smaller than or equal to the PFE,andlarger than or equal to the largest cti. A common misconceptionof the meaningof an overall error rate ct applied to a family of tests is that on the average, only a proportion ct of the rejected hypothesesare true ones, i.e. are falsely rejected. To see whythis is not so, consider the case in whichall the hypotheses are true; then 100%of rejected hypothesesare true, i.e. are rejected in error, in those situations in whichany rejections occur. This misconception,however,suggests considering the proportion of rejected hypothesesthat are falsely rejected and trying to control this proportion in someway. Letting V equal the numberof false rejections (i.e. rejections of true hypotheses)and R equal the total numberof rejections, the proportion of false rejections is Q = V/R. Someinteresting early work related to this ratio is described by Seeger (1968), whocredits the initial investigation to unpublishedpapers of Eklund. Sori6 (1989) describes a different approachto this ratio. Thesepapers (Seeger, Eklund, and Sorir) advocated informal consideration of the ratio; the following new approachis moreformal. The false discovery rate (FDR)is the expected value of Q = (number false significances/number of significances) (Benjamini& Hochberg1994).

Power As shownabove, the error rate can be generalized in different ways when movingfrom single to multiple hypothesis testing. The sameis true of power. Three definitions of powerhave been common:the probability of rejecting at least one false hypothesis, the average probability of rejecting the false hypotheses, and the probability of rejecting all false hypotheses.Whenthe family consists of pairwise meancomparisons,these have been called, respectively, any-pair power(Ramsey1978), per-pair power (Einot & Gabriel 1975), all-pairs power(Ramsey1978). Ramsey(1978) showedthat the difference powerbetweensingle-stage and multistage methodsis muchgreater for allpairs than for any-pair or per-pair power(see also Gabriel 1978, Hochberg Tamhane1987). P-Values and Adjusted

P-Values

In testing a single hypothesis, investigators have movedaway from simply accepting or rejecting the hypothesis, giving instead the p-value connected with the test, i.e. the probabifity of observinga test statsfic as extremeor more extremein the direction of rejection as the observedvalue. This can be conceptualized as the level at which the hypothesis wouldjust be rejected, and therefore both allows individuals to apply their owncriteria and gives more information than merely acceptanceor rejection. Extensionof this concept in its full meaningto the multiple testing context is not necessarily straightforward. A concept that allows generalization from the test of a single hypothesis

Annual Reviews www.annualreviews.org/aronline 568 SHAFFER tO the multiple context is the adjusted p-value (Rosenthal &Rubin 1983). Given any test procedure, the adjusted p-value correspondingto the test of a single hypothesisHi can be defined as the level of the entire test procedureat whichHi wouldjust be rejected, given the values of all test statistics involved. Application of this definition in complexmultiple comparisonprocedures is discussed by Wright (1992) and by Westfall & Young(1993), whobase their methodologyon the use of such values. These values are interpretable on the same scale as those for tests of individual hypotheses, making comparison with single hypothesistesting easier. Closed Test Procedures Most of the multiple comparisonmethodsin use are designed to control the FWE.The most powerful of these methods are in the class of closed test procedures, described in Marcuset al (1976). To define this general class, assumea set of hypothesesof primary interest, add hypothesesas necessary to formthe closure of this set, and recall that the closed set consists of a hierarchy of hypotheses.The closure principle is as follows: A hypothesis is rejected at level ~t if and only if it and every hypothesisdirectly aboveit in the hierarchy (i.e. every hypothesisthat includes it in an intersection and thus implies it) rejected at level c~. For example,given four means,with the six hypothesesHij, i # j = 1 ..... 4 as the minimal hypotheses, the highest hypothesis in the hierarchy is H1234,and no hypothesis belowH1234can be rejected unless it is rejected at level c~. Assumingit is rejected, the hypothesis H12cannot be rejected unless the three other hypothesesaboveit in the hierarchy, H123,H124, and the intersection hypothesisH12and H34(i.e. the single hypothesis~tl = ~t2 and ~t3 = ~4), are rejected at level et, and then H~2is rejected if its associated test statistic is significant at that level. Anytests can be usedat each of these levels, provided the choice of tests does not dependon the observedconfiguration of the means. The proof that closed test procedures control the I:3VE involves a simple logical argument. Consider every possible true situation, each of which can be represented as an intersection of null and alternative hypotheses. Only one of these situations can be the true one, and under a closed testing procedurethe probability of rejecting that one true configuration is -: c~. All true null hypothesesin the primaryset are containedin the intersection correspondingto the true configuration, and none of themcan be rejected unless that configuration is rejected. Therefore, the probability of one or more of these true primary hypothesesbeing rejected is METHODS BASED

ON ORDERED

P-VALUES

The methodsdiscussed in this section are defined in terms of a finite family of hypothesesHi, i = 1 ..... n, consisting of minimalhypothesesonly. It is as-

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 569

sumedthat for each hypothesisHi there is a correspondingtest statistic Ti with a distribution that dependsonly on the truth or falsity of Hi. It is further assumedthat Hi is to be rejected for large values of Ti. (The Ti are absolute values for two-sidedtests.) Thenthe (unadjusted)p-value pi of Hi is defined as the probability that Ti is larger than or equal to ti, whereT refers to the random variable and t to its observed value. For simplicity of notation, assumethe hypothesesare numbered in the order of their p-valuesso that pl -: p2 "~...~ pn, with arbitrary ordering in case of ties. Withthe exceptionof the subsection on MethodsControlling the FDR,all methods in this section are intended to provide strong control of the FWE. Methods Based on the First-Order

Bonferroni

Inequality

The first-order Bonferroniinequality states that, given any set of events A1, A2..... An, the probabilityof their union(i.e. of the event A1orA2or...or An) smaller than or equal to the sumof their probabilities. Letting Ai stand for the rejection of Hi, i =1 ..... n, this inequality is the basis of the Bonferroni methodsdiscussed in this section. THESIMPLEBONFERRONI METHOD This methodtakes the form: Reject Hi ifpi -: ai, wherethe cti are chosenso that their sumequals ct. Usually, the cti are chosen to be equal (all equal to ~n), and the method is then called the unweightedBonferroni method.This procedure controls the PFEto be .~ ~t and to be exactly c~ if all hypothesesare true. TheFWE is usually < ct. This simple Bonferroni method is an exampleof a single-stage testing procedure. In single-stage procedures, control of the FWE has the consequence that the larger the numberof hypothesesin the family, the smaller the average powerfor testing the individual hypotheses. Multistage testing procedurescan partially overcomethis disadvantage. Somemultistage modifications of the Bonferroni methodare discussed below. HOLM’S SEQUENTIALLY-REJECTIVE BONFERRONI METHOD The unweighted methodis described here; for the weighted method, see Holm(1979). This methodis applied in stages as follows: At the first stage, H1is rejected ifpl ~ ct/n. If H1is accepted, all hypothesesare acceptedwithout further test; otherwise, H2is rejected if p2 ~ a/(n - 1). Continuingin this fashion, at any stage Hj is rejected if and only if all Hi have been rejected, i o/k, althoughthe difference is small small values of ct. Thesesomewhat higher levels can also be used whenthe test statistics are positive orthant dependent,a class that includes the two-sidedt statistics for pairwise comparisonsof normallydistributed meansin a one-way layout. Holland &Copenhaver(1988) note this fact and give examplesof other positive orthant dependentstatistics.

Methods Based on the Simes Equality Simes(1986) provedthat if a set of hypothesesH1, H2..... Hnare all true, and the associated test statistics are independent,then with probability 1 - ct, pi > io/n for i = 1 ..... n, wherethepi are the orderedp-values, and ~t is any number between 0 and 1. Furthermore, although Simes noted that the probability of this joint event could be smaller than 1 - ct for dependenttest statistics, this appearedto be true only in rather pathological cases. Simesand others (I-Iommel 1988, Holland 1991, Klockars & Hancock 1992) have prowided simulation results suggestingthat the probability of the joint event is larger than 1 - ct for manytypes of dependencefound in typical testing situations, including the usual two-sided t test statistics for all pairwise comparisons amongnormally distributed treatment means. Simessuggestedthat this result could be used in mukipletesting but did not provide a formal procedure. As Hochberg(1988) and Hommel(1988) pointed out, on the assumptionthat the inequality applies in a testing situation, rnore powerfulproceduresthan the sequentially rejective Bonferronican be obtained by invoking the Simes result in combinationwith the closure principle. Because carrying out a full Simes-basedclosure procedure testing all possible hypotheses would be tedious with a large closed set, Hochberg(1988) and Hommel(1988) each give simplified, conservative methodsof utilizing the Simesresult. HOCHBERG’S MULTIPLE TEST PROCEDUREHochberg’s(1988)

procedure can

be described as a "step-up" modification of Holm’sprocedure. Considerthe set of primaryhypothesesH1..... Hn.Ifpj "~ o/(n -j +1) for anyj = 1 ..... n, reject

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 571

all hypothesesHi for i .~j. In other words,ifpn ~ ct, reject all Hi; otherwise,if pn- 1 ~ 0./2, reject H1..... Hn-1, etc. HOMMEL’S MULTIPLE TESTPROCEDURE Hommel’s(1988) procedure is more powerful than Hochberg’sbut is moredifficult to understand and apply. Letj be the largest integer for whichpn- j +k >ktx/j for all k = 1 ..... j. If no suchj exists, reject all hypotheses;otherwise,reject all Hi withpi ~ ct/j. (1990) gave slightly higher critical p-value levels that can be used with Hochberg’sprocedure, makingit somewhatmore powerful. The values must be calculated; see Rom (1990)for details and a table of values for small ROM’S MODIFICATION OF HOCHBERG’SPROCEDURERom

Modifications

for Logically

Related Hypotheses

Shaffer (1986) pointed out that Holm’ssequentially-rejective multiple test procedure can be improvedwhenhypotheses are logically related; the same considerations apply to multistage methodsbased on Simes’equality. In many testing situations, it is not possible to get all combinationsof true and false hypotheses. For example, if the hypotheses refer to pairwise differences amongtreatment means,it is impossible to have ~tl = ~t3. Usingthis reasoning, with four meansand six possible pairwise equality null hypotheses,if all six are not true, then at mostthree are tree. Therefore,it is not necessaryto protect against error in the event that five hypothesesare true and one is false, because this combinationis impossible. Let tj be the maximum numberof hypothesesthat are true given that at leastj - 1 hypotheses are false. Shaffer (1986) gives recursive methodsfor finding the values tj for several types of testing situations (see also Holland &Copenhaver1987, Westfall & Young1993). The methods discussed above can be modified to increase powerwhenthe hypothesesare logically related; all methodsin this section are intended to control the FWE at a level ~ MODIFIED METHODS As is clear from the proof that it maintains FWEcontrol, the Holmprocedure can be modified as follows: At stage j, instead of rejecting Hj only if pj ~ ct/(n - j + 1), Hj can be rejected if pj < a/tj. Thus, whenthe hypothesesof primaryinterest are logically related, as in the example above, the modifiedsequentially-rejective Bonferroni methodis morepowerful than the unmodified method. For somesimple applications, see Levin et al (1994). Hochberg & Rom(1994) and Hommel(1988) describe modifications of their Simes-based procedures for logically related hypotheses. The simpler of the two modificationsthe formerdescribes is to proceedfromi = n, n 1, n - 2, etc until for the first time pi ~ oJ(n - i + 1). Thenreject all Hi for

Annual Reviews www.annualreviews.org/aronline 572 SHAFFER which pi ~ oJti + 1. [The Rom(1990) modification of the Hochbergprocedure can be improvedin a similar way.] In the Hommel modification, let j be the largest integerin the set n, t2 ..... tn, and proceed as in the unmodifiedHommel procedure. Still further modifications at the expense of greater complexity can be achieved, since it can also be shown(Shaffer 1986) that for FWE control it necessary to consider only the numberof hypotheses that can be true given that the specific hypothesesthat have been rejected are false. Hommel (1986), Conforti & Hochberg(1987), Rasmussen(1993), Rom& Holland (1994), Hochberg& Rom(1994) consider more general procedures. COMPARISON OF PROCEDURES Amongthe unmodified procedures, Hommel’s and Rom’sare more powerful than Hochberg’s, which is more powerful than Holm’s;the latter two, however,are the easiest to apply (Hommel 1988, 1989; Hochberg1988; Hochberg&Rom1994). Simulation results using the unmodified methodssuggest that the differences are usually small (Holland 1991). Comparisonsamongthe modified procedures are more complex (see Hochberg & Rom1994). A CAUTION All methodsbased on Simes’s results rest on the assumption that the equality he provedfor independenttests results in a conservative multiple comparisonprocedure for dependenttests. Thus, the use of these methodsin atypical multiple test situations should be backedup by simulation or further theoretical results (see Hochberg& Rom1994). Methods Controlling

the False Discovery

Rate

The ordered p-value methodsdescribed above provide strong control of the FWE.Whenthe test statistics are independent, the following less conservative step-up procedure controls the FDR(Benjamini & Hochberg1994): If pj ~ o/n, reject all Hi for i .~ j. A recent simulation study (Y Benjamini, Hochberg,&Y Kling, manuscript in preparation) suggests that the FDRis also controlled at this level for the dependenttests involved in pairwise comparisons. VSLWilliams, LVJones, &JWTukey (manuscript in preparation) show in a numberof real data examplesthat the Benjamini-HochbergFDR-controlling proceduremayresult in substantially morerejections than other multiple comparison methods. However, to obtain an expected proportion of false rejections, Benjamini & Hochberghave to define a value whenthe denominator, i.e. the numberof rejections, equals zero; they define the ratio then as zero. Asa result, the expectedproportion, given that somerejections actually occur, is greater than ct in somesituations (it necessarily equals one whenall hypothesesare la’ue), so moreinvestigation of the error properties of this procedure is needed.

Annual Reviews www.annualreviews.org/aronline COMPARING

MULTIPLEHYPOTHESISTESTING 573

NORMALLY DISTRIBUTED

MEANS

Themethodsin this section differ fromthose of the last in three respects: They deal specifically with comparisonsof means, they are derived assumingnormally distributed observations, and they are based on the joint distribution of all observations. In contrast, the methodsconsidered in the previous section are completely general, both with respect to the types of hypotheses and the distributions of test statistics, and except for someresults related to independenceof statistics, they udlize only the individual marginaldistributions of thosestatistics. Contrasts amongtreatment meansare linear functions of the form XCi~ti, where Xci -- 0. The pairwise differences amongmeans are called simple contrasts; a general contrast can be thought of as a weightedaverage of some subset of meansminus a weighted average of another subset. The reader is presumably familiar with the most commonlyused methods for testing the hypotheses that sets of linear contrasts equal zero with FWEcontrol in a one-wayanalysis of variance layout under standard assumptions. They are described briefly below. Assumern treatments with N observations per treatment and a total of T observations over all treatments, let ~i be the samplemeanfor treatment i, and let MSW be the within-treatment meansquare. If the primary hypotheses consist of all linear contrasts amongtreatment means, the Scheff6 method (1953) controls the FWE. Using the Scheff6 method, a contrast hypothesis Eci[ti = 0 is rejected if I Xci~il"x/XciZ(MSW/N)(m-1) Fm-l,T-m;c~, where Fm- 1, r- m; ct is the a-level critical value of the F distribution with rn - 1 and T - rn degrees of freedom. If the primaryhypothesesconsist of the pairwise differences, i.e. the simple contrasts, the Tukey method(1953) controls the FWEover this set. Using this method, any simple contrast hypothesis 5ij = 0 is rejected if [ ~i - "~j I >MSvr’~--~-’Nqm,T-m;ct, whereqm,T-m;txis the a-critical value of the studentized range statistic for rn meansand T - rn error degrees of freedom. If the primary hypothesesconsist of comparisonsof each of the first rn - 1 means with the mth mean (e.g. of rn - 1 treatments with a control), the Dunnett method(1955) controls the FWEover this set. Using this method, any hypothesis ~3im = 0 is rejected if I~i - "~mI > x/2MSW/Ndm-I,T-m;~, where dm - 1, T- m; ct is the a-level critical value of the appropriatedistributionfor this test. Both the Tukey and Dunnett methods can be generalized to test the hypotheses that all linear contrasts amongthe meansequal zero, so that the three procedurescan be comparedin poweron this wholeset of tests (for discussion of these extended methodsand specific comparisons,see Shaffer 1977). Rich-

Annual Reviews www.annualreviews.org/aronline

574 SHAFFER

mond(1982)providesa moregeneral treatmentof the extensionof confidence intervalsfor a finite set to intervalsfor all linearfunctionsof the set. All three methodscan be modifiedto multistagemethodsthat give :more powerfor hypothesistesting. In the Scheff6method,if the F test is significant, the FWEis preserved if rn - 1 is replaced by rn - 2 everywherein the expressionfor Scheff6significance tests (Scheff61970). TheTukeymethod can be improvedby a multiplerange test using significancelevels described by Tukey(1953) and sometimesreferred to as Tukey-Welsch-Ryan levels (see also Einot & Gabriel 1975, Lehmann& Shaffer 1979). Begun Gabriel (1981) describe an improvedbut more complexmultiple range procedurebasedon a suggestionby E Peritz [unpublishedmanuscript(1970)] using closure principles, and denotedthe Peritz-Begun-Gabrielmethodby Grechanovsky(1993). Welsch(1977) and Dunnett & Tamhane(1992) posedstep-upmethods(lookingfirst at adjacentdifferences)as opposedto the step-downmethodsin the multiplerange proceduresjust described. Thestepup methodshave somedesirable properties (see Ramsey1981, Dunnett Tamhane1992, Keselman& Lix 1994) but require heavy computation or specialtables for application.TheDunnetttest canbe treatedin a sequentiallyrejectivefashion,whereat stagej the smallervaluedm-j,T-m;ct canbe substitutedfor dm-1, T-m; Becausethe hypothesesin a closedset mayeachbe tested at level ct by a variety of procedures,there are manyother possible multistageprocedures. For example,results of Ramsey(1978), Shaffer (1981), and Kunert(1990) suggestthat for mostconfigurationsof means,a multiple F-test multistage procedureis morepowerfulthan the multiple range procedures described abovefor testing pairwise differences, althoughthe opposite is true with single-stage procedures. Other approaches to comparingmeansbased on ranges havebeeninvestigated by Braun& Tukey(1983), Finner (1988)., Royen(1989, 1990). The Scheff6 methodand its multistage version are easy to apply when samplesizes are unequal;simplysubstitute Ni for N in the Scheff6formula given above,whereNi is the numberof observationsfor treatment i. Exact solutions for the Tukeyand Dunnettproceduresare possible in principle but involveevaluationof multidimensional integrals. Morepractical approximate methodsare basedon replacing MSW/N, whichis half the estimatedvariance ofT/- ~ in the equal-sample-size case, with (1/2) MSW (llNi + I/Nj), whichis half its estimatedvariancein the unequal-sample-size case. Thecommon value MSW/N is thus replacedby a different valuefor eachpair of subscriptsi andj. TheTukey-Kramer method(Tukey1953, Kramer1956) uses the single-stage Tukeystudentizedrangeprocedurewith these half-varianceestimates substituted for MSW/N.Kramer(1956) proposed a similar multistage method; preferred, somewhatless conservative methodproposedby Duncan(1957)

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 575

modifies the Tukeymultiple range methodto allow for the fact that a small difference maybe more significant than a large difference if it is based on larger sample sizes. Hochberg& Tarnhane(1987) discuss the implementation of the Duncanmodification and showthat it is conservative in the unbalanced one-waylayout. For modifications of the Dunnett procedure for unequal sample sizes, see Hochberg&Tamhane(1987). The methods must be modified when it cannot be assumed that withintreatment variances are equal. If variance heterogeneity is suspected, it is important to use a separate variance estimate for each samplemeandifference or other contrast. The multiple comparisonprocedure should be based on the set of values of each meandifference or contrast divided by the square root of its estimated variance. The distribution of each can be approximatedby a t distribution with estimated degrees of freedom (Welch 1938, Satterthwaite 1946). Tamhane(1979) and Dunnett (1980) compared a number of singlestage proceduresbasedon these approximatet statistics; several of the procedures providedsatisfactory error control. In one-wayrepeated measures designs (one factor within-subjects or subjects-by-treatments designs), the standard mixedmodelassumessphericity of the treatment covariance matrix, equivalent to the assumptionof equality of the variance of each difference between sample treatment means. Standard modelsfor between-subjects-within-subjects designs have the added assumption of equality of the covariance matrices amongthe levels of the betweensubjects factor(s). Keselmanet al (1991b) give a detailed account of calculation of appropriate test statistics whenboth these assumptionsare violated and showin a simulation study that simple multiple comparisonprocedures based on these statistics have satisfactory properties (see also Keselman &Lix 1994).

OTHERISSUES Tests vs ConfidenceIntervals The simple Bonferroni and the basic Scheff6, Tukey, and Dunnett methods described aboveare single-stage methods,and all have associated simultaneous confidenceinterval interpretations. Whena confidenceinterval for a difference does not include zero, the hypothesis that the difference is zero is rejected, but the confidenceinterval gives moreinformation by indicating the direction and somethingabout the magnitudeof the difference or, if the hypothesis is not rejected, the powerof the procedurecan be gaugedby the width of the interval. In contrast, the multistage or stepwise procedureshave no such straightforward confidence-interval interpretations, but morecomplicatedintervals can sometimesbe constructed. The first confidence-interval interpreta-

Annual Reviews www.annualreviews.org/aronline 576 SHAFFER tion of a multistage procedure was given by Kimet al (1988), and Hayt~x Hsu(1994) have described a general methodfor obtaining these intervals. The intervals are complicatedin structure, and moreassumptionsare requirexl for them to be valid than for conventional confidence intervals. Furtherrnore, although as a testing methoda multistage procedure might be uniformly more powerfulthan a single-stage procedure, the confidenceintervals corresponding to the former are sometimesless informative than those correspondingto the latter. Nonetheless,these are interesting results, and morealong this line are to be expected. Directional

vs Nondirectional

Inference

In the examplesdiscussed above, most attention has been focused on simple contrasts, testing hypothesesHo:6ij = 0 vs HA:6ij ~ O. However,in most cases, if H0is rejected, it is crucialto conclude either [t.ti > [LI,j or [Lti < [Ltj. Different types of testing problemsarise whendirection of difference is considered: 1. Sometimes the interest is in testing one-sidedhypothesesof the form~ti -: ~tj vs ~ti > ~j, e.g. if a newtreatmentis beingtested to see whetherit is better than a standardtreatment, and there is no interest in pursuingthe matter further if it is inferior. 2. In a two-sidedhypothesistest, as formulatedabove, rejection of the hypothesisis equivalent to the decision ~ti ~, ~xj. Is it appropriate to further conclude~tl > ~tj if~i > ~j and the opposite otherwise?3. Sometimes there is an a priori ordering assumption~tl .~ ~t2 ~....~ ~tm, or somesubset of these means are considered ordered, and the interest is in deciding whether someof these inequalitiesare strict. Eachof these situations is different, and different considerationsarise. An important issue in connection with the second and third problems mentioned aboveis whetherit makessense to even consider the possibility that the means under two different experimental conditions are equal. Somewriters contend that a priori no difference is ever zero (for a recent defenseof this position, see Tukey1991, 1993). Others, including this author, believe that it is not necessary to assumethat every variation in conditions must have an effect. In any case, even if one believes that a meandifference of zero is impossible, an intervention can havean effect so minutethat it is essentially undeteetable and unimportant,in whichcase the null hypothesis is reasonable as a practical way of framing the question. Whateverthe views on this issue, the hypotheses in the second case described above are not correctly specified if directional decisions are desired. Onemust consider, in addition to Type I and Type II errors, the probably more severe error of concluding a difference exists but makingthe wrong choice of direction. This has sometimesbeen called a Type III error and maybe the most important or even the only concemin the second testing situation.

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 577

For methodswith correspondingsimultaneousconfidence intervals, inspection of the intervals yields a directional answerimmediately.For manymultistage methods,the situation is less clear. Shaffer (1980) showedthat an additional decision on direction in the secondtesting situation does not control the FWEof Type III for all test statistic distributions. Hochberg&Tamhane (1987) describe these results and others found by S Holm[unpublished manuscript (1979)] (for newerresults, see Finner 1990). Otherless powerfulmethods with guaranteed Type I and/or Type I11 FWEcontrol have been developed by Spj~tvoll (1972), Holm[1979; improved and extended by Bauer et (1986)], Bohrer (1979), Bofinger (1985), and Hochberg(1987). Somewriters have considered methodsfor testing one-sided hypotheses of the third type discussed above(e.g. Marcuset al 1976, SpjCtvoll 1977, Berenson 1982). Budde&Bauer (1989) compare a numberof such procedures both theoretically and via simulation. In another type of one-sided situation, Hsu (1981,1984) introduced methodthat can be used to test the set of primaryhypothesesof the formHi: ~ti is the largest mean.Thetests are closely related to a one-sidedversion of the Dunnett methoddescribed above. They also relate the multiple testing literature to the rankingand selection literature. Robustness This is a necessarily brief look at robustness of methodsbased on the homogeneity of variance and normality assumptionsof standard analysis of variance. Chapter 10 of Scheff6 (1959) is a good source for basic theoretical results concerningthese violations. As Tukey(1993) has pointed out, an amountof variance heterogeneity that affects an overall F test only slightly becomesa more serious concern when multiple comparisonmethodsare used, because the variance of a particular comparison may be badly biased by use of a commonestimated value. Hochberg&Tamhane(1987) discuss the effects of variance heterogeneity the error properties of tests basedon the assumptionof homogeneity. Withrespect to nonnormality, asymptotic theory ensures that with sufficiently large samples, results on Type I error and powerin comparisonsof means based on normally distributed observations are approximately valid under a wide variety of nonnormaldistributions. (Results assumingnormally distributed observations often are not even approximatelyvalid under nonnormality, however,for inference on variances, covariances, and correlations.) This leaves the question of Howlarge is large? In addition, alternative methods are more powerful than normal theory-based methodsunder manynonnormal distributions. Hochberg&Tamhane(1987, Chap. 9) discuss distributionfree and robust procedures and give references to manystudies of the robustness of normal theory-based methodsand of possible alternative methodsfor

Annual Reviews www.annualreviews.org/aronline 578 SHAFFER multiple comparisons. In addition, Westfall & Young(1993) give detailed guidance for using robust resampling methods to obtain appropriate error control. Others Frequentist methodscontrol error without any assumptionsabout possible alternative values of parametersexcept for those that maybe implied logically. Meta-analysis in its simplest form assumesthat all hypothesesrefer to the sameparameter and it combinesresults into a single statement. Bayes and Empirical Bayes procedures are intermediate in that they assume someconnection amongparameters and base error control on that assumption.A major contributor to the Bayesian methods is Duncan(see e.g. Duncan1961, 1965; Duncan& Dixon 1983). Hochberg & Tamhane(1987) describe Bayesian approaches (see Berry 1988). Westfall &Young(1993) discuss the relations amongthese three approaches. FREQUENTIST METHODS, BAYESIAN METHODS, AND META-ANALYSIS

DECISION-THEORETIC OPTIMALITY Lehmann(1957a,b), Bohrer (1979), SpjCtvoll (1972) defined optimal multiple comparisonmethodsbased on fiequentist decision-theoretic principles, and Duncan(1961, 1965) and coworkers developed optimal procedures from the Bayesian decision-the0retic point of view. Hochberg&Tamhane(1987) discuss these and other results. RANKING ANDSELECTION The methods of Dunnett (1955) and Hsu (1981, 1984), discussedabove, forma bridge betweenthe selection and multiple testing literature, and are discussedin relation to that literature in Hochberg&Tamhane (1987). B echhoferet al (1989) describe another methodthat incorporates aspects of both approaches. GRAPHS AND DIAGRAMS As with all statistical results, the results of multiple comparison procedures are often most clearly and comprehensivelyconveyed through graphs and diagrams, especially when a large number of tests is involved. Hochberg&Tamhane(1987) discuss a numberof procedures. Duncan (1955) includes several illuminating geometricdiagramsof acceptanceregions, as do Tukey (1953) and Bohrer &Schervish (1980). Tukey(1953, 1991) a numberof graphical methodsfor describing differences amongmeans(see also Hochberget al 1982, Gabriel & Gheva1982, Hsu & Pemggia1994). Tukey (1993) suggests graphical methodsfor displaying interactions. Schweder Spjctvoll (1982) illustrate a graphical methodfor plotting large numbers ordered p-values that can be used to help decide on the numberof true hypotheses; this approach is used by Y Benj amini &Y Hochberg(manuscript submitted

Annual Reviews www.annualreviews.org/aronline

MULTIPLEHYPOTHESISTESTING 579

for publication) to develop a more powerful FDR-controlling method. See Hochberg& Tamhane(1987) for further references. One way to use partial knowledgeof joint distributions is to consider higher-order Bonferroni inequalities in testing someof the intersection hypotheses, thus potentially increasing the powerof FWE-controlling multiple comparison methods. The Bonferroniinequalities are derived froma general expressionfor the probability of the union of a numberof events. The simple Bonferroni methods using individual p-values are based on the upper bound given by the first-order inequality. Second-orderapproximationsuse joint distributions of pairs of test statistics, third-order approximationsuse joint distributions of triples of test statistics, etc, thus forminga bridge betweenmethodsrequiring only univariate distributions and those requiring the full multivariate distribution (see Hochberg & Tamhane1987 for further references to methods based on second-order approximations; see also Bauer &Hackl 1985). Hoover(1990) gives results using third-order or higher approximations,and Glaz (1993) includes an extensive discussion of these inequalities (see also Naiman& Wynn1992, Hoppe 1993a, Seneta 1993). Someapproachesare based on the distribution of combinations of p-values (see Cameron& Eagleson 1985, Buckley& Eagleson 1986, Maurer&Mellein 1988, Rom&Connell 1994). Other types of inequalities are also useful in obtaining improved approximate methods (see Hochberg Tarnhane1987, Appendix2). HIGHER-ORDER BONFERRONI AND OTHER INEQUALITIES

WEIGHTS In the description of the simple Bonferroni methodit was noted that each hypothesis Hi can be tested at any level Cti with the FWEcontrolled at c~ = 53cti. In mostapplications, the ~i are equal, but there maybe reasons to prefer unequalallocation of error protection. For methodscontrolling FWE,see Holm(1979), Rosenthal & Rubin (1983), DeCani (1984), and Hochberg Liberman (1994). Y Benjamini & Y Hochberg (manuscript submitted publication) extend the FDRmethodto allow for unequal weights and discuss various purposesfor differential weightingand alternative methodsof achieving it. OTHER AREAS OFAPPLICATION Hypotheses specifying values of linear combinations of independentnormalmeansother than contrasts can be tested jointly using the distribution of either the maximum modulusor the augmentedrange (for details, see Scheff6 1959). Hochberg& Tamhane(1987) discuss methods in analysis of covariance, methodsfor categorical data, methodsfor comparing variances, and experimentaldesign issues in various areas. Cameron&Eagleson (1985) and Buckley&Eagleson (1986) consider multiple tests for significance of correlations. Gabriel (1968) and Morrison (1990) deal with methods

Annual Reviews www.annualreviews.org/aronline 580

SHAFFER

multivariate multiplecomparisons.Westfall& Young(1993, Chap.4) discuss resamplingmethodsin a variety of situations. Thelarge literature on model selection in regressionincludesmanypapersfocusingon the multipletesting aspectsof this area. CONCLUSION Thefield of multiplehypothesistesting is too broadto be coveredentirely in a reviewof this length; apologiesare due to manyresearcherswhosecontributions have not been acknowledged.Theproblemof multiplicity is gaining increasing recognition,and research in the area is proliferating. Themajor challengeis to devisemethodsthat incorporatesomekind of overall controlof TypeI error while retaining reasonable powerfor tests of the individual hypotheses.This review,while sketchinga numberof issues and approaches, has emphasized recent research on relatively simple and general multistage testing methods that are providingprogressin this direction. ACKNOWLEDGMENTS

Researchsupportedin part throughthe NationalInstitute of Statistical Sciences by NSF Grant RED-9350005. Thanks to Yosef Hochberg, Lyle V. Jones, Erich L. Lehmann, Barbara A. Mellers, Seth D. Roberts, and Valerie S.

L. Williamsfor helpful comments andsuggestions. AnyAnnualReviewchapter,as well as anyartide cited in an AnnualReviewchapter, maybe purchased fromthe AnnualReviewsPreprintsandReprintsservice. 1-800-347-8007;415-259-5017; email: [email protected]

Literature Cited Ahmed SW.1991.Issues arising in the applicationof Bonferrortiproceduresin federal surveys. 1991ASAProc.Surv. Res. Methotis Sect., pp. 344-49 BauerP, HacklP. 1985. Theapplication of Hunter’sinequalityto simultaneous testing. Biometr.J. 27:25-38 Bauer P, Hackl P, Hommel G, SonnemannE. 1986.Multipletesting of pairs of one-sided hypotheses.Metrika33:121-27 Bauer P, Hommel G, Sonnemann E, eds. 1988. Multiple Hypothesenprgifung.(Multiple Hypotheses Testing.) Berlin: Springer-Verlag (In German andEnglish) BechhoferRE.1952.Theprobability of a correct ranking.Anr~Math.Star. 23:139-40

Bechhofer RE, Durmett CW,TamhaneAC. 1989. Two-stageproceduresfor comparing treatmentswitha control:eliminationat the first stage and estimation at the second stage. Biometr.J. 31:545-61 BegunJ, Gabriel KR.1981. Closure of the Newman-Keuls multiple comparisonprocedure.J. Am.Stat. Assoc.76:241--45 BenjaminiY, HochbergY. 1994. Controlling the false discoveryrate: a practical and powerfulapproach to multipletesting. J. R. Stat. Soc.Ser. B. In press BerensonML.1982. A comparisonof several k sampletests for orderedalternatives in completely randomized designs. Psychometrika47:265-80(Corr. 535-39)

Annual Reviews www.annualreviews.org/aronline MULTIPLE Berry DA.1988. Multiple comparisons, multiple tests, and data dredging: a Bayesian perspective (with discussion). In Bayesian Statistics, ed. JMBernardo, MHDeGroot, DVLindley, AFMSmith, 3:79-94. London: Oxford Univ. Press Bofinger E. 1985. Multiple comparisons and Type III errors. J. Am.Stat. Assoc. 80:43337 Bohrer R. 1979. Multiple three-decision rules for parametric signs. J. Am. Star. Assoc. 74:432-37 Bohrer R, Schervish MJ. 1980. An optimal multiple decision rule for signs of parameters. Proc. Natl. Acad. Sci. USA77:52-56 Booth JG. 1994. Review of "Resampling Based Multiple Testing." J. Am. Stat. Assoc. 89:354-55 BraunHI, ed. 1994. The Collected Works of John W. Tukey. Vol. VIII: Multiple Comparisons:1948-1983. NewYork: Chapman & Hall Braun HI, Tukey JW. 1983. Multiple comparisons through orderly partitions: the maximumsubrange procedure. In Principals of ModemPsychological Measurement: A Festschrift for Frederic M. Lord, ed. H Wainer, S Messick, pp. 55-65. Hillsdale, NJ: Erlbaum Buckley MJ, Eagleson GK. 1986. Assessing large sets of rank correlations. Biometrika 73:151-57 BuddeM, Bauer P. 1989. Multiple test procedures in clinical dose finding studies. J. Am. Stat. Assoc. 84:792-96 CameronMA,Eagleson GK. 1985. A new procedure for assessing large sets of correlations. Aust. J. Stat. 27:84-95 Chaubey YP. 1993. Review of "Resampling Based Multiple Testing." Technometrics 35:450-51 Conforti M, Hochberg Y. 1987. Sequentially rejective pairwise testing procedures. J. Stat. Plan. Infer. 17:193-208 Cournot AA. 1843. Exposition de la Thgorie des Chances et des Probabilitgs. Paris: Hachette. Reprinted 1984 as Vol. 1 of Cournot’s Oeuvres Completes, ed. B Bru. Paris: Vrin DeCaniJS. 1984. Balancing Type I risk and loss of powerin ordered Bonferroni procedures. J. Educ. Psychol. 76:1035-37 Diaconis P. 1985. Theories of data analysis: from magical thinking through classical statistics. In Exploring Data Tables, Trends, and Shapes, ed. DCHoaglin, F Mosteller, JWTukey, pp. 1-36. NewYork: Wiley DuncanDB.1951. A significance test for differences between ranked treatments in an analysis of variance. Va. J. Sci. 2:172-89 DuncanDB. 1955. Multiple range and multiple F tests. Biometrics11 : 1-42 DuncanDB.1957. Multiple range tests for cor-

HYPOTHESIS

TESTING

581

related and heteroscedastic means.Biometrics 13:164-76 Duncan DB. 1961. Bayes rules for a common multiple comparisons problem and related Student-t problems. Ann. Math. Stat. 32: 1013-33 Duncan DB. 1965. A Bayesian approach to multiple comparisons. Technometrics 7: 171-222 DuncanDB,DixonDO.1983. k-ratio t tests, t intervals, and point estimates for multiple comparisons.In Encyclopediaof Statistical Sciences, ed. S Kotz, NLJohnson, 4: 40310. NewYork: Wiley Dunnett CW.1955. A multiple comparison procedure for comparingseveral treatments with a control. J. Am. Stat. Assoc. 50: 1096-1121 Dunnett CW.1980. Pairwise multiple comparisons in the unequal variance case. J. Am. Stat. Assoc. 75:796-800 Dunaett CW,Tamhane AC. 1992. A step-up multiple test procedure.J. Am.Stat. Assoc. 87:162-70 Einot I, Gabriel KR.1975. Astudy of the powers of several methodsin multiple comparisons. J. Am. Stat. Assoc. 70:574-83 Finner H. 1988. Abgeschlossene Spannweitentests (Closed multiple range tests). See Bauer et al 1988, pp. 10-32 (In German) Finner H. 1990. Onthe modified S-methodand directional errors. Commun. Stat. Part A: Theory Methods 19:41-53 Fligner MA.1984. Anote on two-sided distribution-free treatment versus control multiple comparisons. J. Am. Stat. Assoc. 79: 208-11 Gabriel KR. 1968. Simultaneous test procedures in multivariate analysis of variance. Biometrika 55:489-504 Gabriel KR. 1969. Simultaneous test procedures-some theory of multiple comparisons. Ann. Math. Stat. 40:224-50 Gabriel KR. 1978. Commenton the paper by Ramsey.J. Am. Stat. Assoc. 73:485-87 Gabriel KR, GhevaD. 1982, Somenew simultaneous confidence intervals in MANOVA and their geometric representation and graphical display. In ExperimentalDesign, Statistical Models,and Genetic Statistics, ed. K Hinkelmann, pp. 239-75. NewYork: Dekker Gaffan EA. 1992. Review of "Multiple Comparisons for Researchers." Br. J. Math. Stat. PsychoL 45:334-35 Glaz J. 1993. Approximatesimultaneous confidence intervals. See Hoppe 1993b, pp. 149-66 Grechanovsky E. 1993. Comparing stepdown multiple comparisonprocedures. Presented at Annu.Jt. Stat. Meet., 153rd, San Francisco Harter HL. 1980. Early history of multiple comparison tests. In Handbookof Statis-

Annual Reviews www.annualreviews.org/aronline 582

SHAFFER

tics, ed. PRKrishnaiah, 1:617-22. Amsterdam: North-Holland Hartley HO. 1955. Somerecent developments in analysis of variance. Commun.Pure Appl. Math. 8:47-72 Hayter AJ, HsuJC. 1994. On the relationship between stepwise decision procedures and confidence sets. J. Am. Stat. Assoc. 89: 128-36 Hochberg Y. 1987. Multiple classification rules for signs of parameters.J. Stat. Plan. Infer. 15:177-88 HochbergY. 1988. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800-3 Hochberg Y, Liberman U. 1994. An extended Simestest. Stat. Prob.Lett. In press Hochberg Y, RomD. 1994. Extensions of multiple testing procedures based on Simes’ test. J. Stat. Plan.Infer. In press Hochberg Y, Tamhane AC. 1987. Multiple Comparison Procedures. New York: Wiley Hochberg Y, Weiss G, Hart S. 1982. On graphical procedures for multiple comparisons. J. Am. Stat. Assoc. 77:767-72 blolland B. 1991. Onthe application of three modified Bonferroni procedures to pairwise multiple comparisons in balanced repeated measuresdesigns. Comput.Stat. Q. 6:21%31.(Corr. 7:223) Holland BS, Copenhaver MD. 1987. An improved sequentially rejective Bonferroni test procedure. Biometrics 43:417-23. (Corr:43:737) Holland BS, Copenhaver MD.1988. Improved Bonferroni-type multiple testing procedures. Psychol. Bull 104:145-49 HolmS. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6: 65-70 HolmS. 1990. Review of "Multiple Hypothesis Testing." Metrika 37:206 Hommel G. 1986. Multiple test procedures for arbitrary dependence structures. Metrika 33:321-36 Hommel G. 1988. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75:383-86 HommelG. 1989. A comparison of two modified Bonferroni procedures. Biometrika 76: 624-25 Hoover DR. 1990. Subset complememaddition upper bounds--an improved inclusion-exclusion method.J. Stat. Plan. Infer. 24:195-202 Hoppe FM. 1993a. Beyond inclusion-and-exclusion: natural identities for P[exactly t events] and Plat least t evems]and resulting inequalities. Int. Stat. Rev. 61:435-46 HoppeFM, ed. 1993b. Multiple Comparisons, Selection, and Applications in Biometry. NewYork: Dekker Hsu JC. 1981. Simultaneous confidence inter-

vals for all distances from the ’best’. Ann. Stat. 9:1026-34 Hsu JC. 1984.Constrainedsimultaneou,,; confidence intervals for multiple comparisons with the best. Ann. Star. 12:1136-44 Hsu JC. 1996. Multiple Comparisons: Theory and Methods. New York: Chap~nan & Hall. In press Hsu JC, Peruggia M. 1994. Graphical representations of Tukey’s multiple comparison method. J. Comput.Graph. Stat. 3:143-61 Keselman HJ, Keselman JC, Games PA. 1991a. Maximum familywise Type I error rate: the least significant difference, Newman-Keuls, and other multiple comparison procedures. Psychol. Bull 110:155-61 Keselman H J, Keselman JC, Shaffer JP. 1991b. Multiple pairwise comparisons of repeated measures means under violation of multisample sphericity. Psychol. Bull. 110:162-70 Keselman HJ, Lix LM. 1994. Improved repeated-measures stepwise multiple comparison procedures.J. Educ.Stat. In press Kim WC, Stefansson G, Hsu JC. 1988. On confidencesets in multiple comparisons.In Statistical Decision Theory and Related Topics IV, ed. SS Gupta, JO Berger, 2:89104. NewYork: Academic Klockars AJ, HancockGR. 1992. Power of recent multiple comparisonprocedures as applied to a completeset of planned orthogonal contrasts. PsychoLBull. 111:505-10 Klockars AJ, Sax G. 1986. Multiple Compari. sons. NewburyPark, CA: Sage KramerCY. 1956. Extension of multiple range tests to group meanswith unequal numbers of replications. Biometrics 12:307-10 KunertJ. 1990. Onthe powerof tests for multiple comparison of three normal means. J. Am. Stat. Assoc. 85:808-12 L~iuter J. 1990. Reviewof "Multiple Hypotheses Testing." Comput.Stat. Q. 5:333 LehmannEL. 1957a. A theory of somemultipie decision problems. I. Ann. Math. Stat. 28:1-25 LehmannEL. 1957b. A theory of some multiple decision problems.1/. Ann. Math.Stat. 28:547-72 LehmannEL, Shaffer JP. 1979. Optimumsignificance levels for multistage comparison procedures. Ann. Stat. 7:27-45 Levin JR, Serlin RC, Seaman MA.1994. A controlled, powerful multiple-comparison strategy for several situations. PsychoL Bull. 115:153-59 Littell RC. 1989. Review of "Multiple Comparison Procedures." Technometrics 31: 261-62 Marcus R, Peritz E, Gabriel KR. 1976. On closed testing procedureswith spex:ial reference to ordered analysis of variance. Biometrika 63:655-60 Maurer W, Mellein B. 1988. Onnew multiple

Annual Reviews www.annualreviews.org/aronline MULTIPLE tests based on independentp-values and the assessment of their power. See Bauer et al 1988, pp. 48-66 Miller RG.1966. SimultaneousStatistical Inference. NewYork: Wiley Miller RG. 1977. Developments in multiple comparisons 1966-1976. J. Am. Stat. Assoc. 72:779--88 Miller RG.1981. SimultaneousStatistical lnferenee. NewYork: Wiley. 2nd ed. Morrison DF. 1990. Multivariate Statistical Methods. NewYork: McGraw-Hill.3rd ed. Mosteller F. 1948. A k-sampleslippage test for an extreme population. Ann. Math. Stat. 19:58-65 NaimanDQ, WynnHP. 1992. Inclusion-exclusion-Bonferroniidentities and inequalities for discrete tube-like problemsvia Euler characteristics. Ann.Stat. 20:43-76 Nair KR.1948. Distribution of the extremedeviate from the sample mean. Biometrika 35:118-44 NowakR. 1994. Problemsin clinical trials go far beyond misconduct. Science 264:153841 Paulson E. 1949. A multiple decision procedure for certain problemsin the analysis of variance. Ann. Math.S~at. 20:95-98 Peritz E. 1989. Reviewof "Multiple Comparison Procedures." J. Educ. Stat. 14:103-6 RamseyPH. 1978. Power differences between pairwise multiple comparisons.J. Am.Stat. Assoc. 73:479-85 RamseyPH. 1981. Power of univariate pairwise comparisonprocedures. Psychol.multiple Bull, 90:352-66

HYPOTHESIS

TESTING

583

RyanTA. 1959. Multiple comparisons in psychological research. PsychoL Bull 56:2647 RyanTA. 1960. Significance tests for multiple comparisonof proportions, variances, and other statistics. Psychol. Bull. 57:318-28 Satterthwaite FE. 1946, Anapproximatedistribution of estimates of variance components. Biometrics 2:110-14 Scheff6 H. 1953. Amethodfor judging all contrasts in the analysis of variance. Biometrika 40:87-104 Scheff6 H. 1959. The Analysis of Variance. New York: Wiley Scheff6H. 1970. Multiple testing versus multiple estimation. Improperconfidence sets. Estimation of directions and ratios. Ann. Math. Stat. 41:1-19 SchwederT, Spj0tvoll E. 1982. Plots of P-values to evaluate manytests simultaneously. Biometrika 69:493-502 Seeger P. 1968. A note on a method for the analysis of significances en masse. Technometrics 10:586-93 Seneta E. 1993. Probability inequalities and Dunnett’s test. See Hoppe1993b, pp. 2945 Shafer G, Olkin I. 1983. Adjusting p values to account for selection over dichotomies. J. Am. Stat. Assoc. 78:674-78 Shaffer JP. 1977. Multiple comparisons emphasizing selected contrasts: an extension and generalization of Dunnett’s procedure. Biometrics 33:29 3-303 Shaffer JP. 1980. Control of directional errors with stagewise multiple test procedures. RasmussenJL. 1993. Algorithm for Shaffer’s Anr~Stat. 8:1342-48 multiple comparisontests. Educ. Psychol. Shaffer JP. 1981. Complexity:an interpretabilMeas. 53:329-35 ity criterion for multiple comparisons.J. RichmondJ. 1982. A general methodfor conAm. Stat. Assoc. 76:395-401 structing simultaneous confidence interShaffer JP. 1986. Modifiedsequentially rejecvals. J. Am. Stat. Assoc. 77:455-60 tive multiple test procedures. J. Am. Stat. RomDM.1990. A sequentially rejective test Assoc. 81:826-31 procedure based on a modified Bonferroni Shaffer JP. 1988. Simultaneoustesting. In Eninequality. Biometrika 77:663-65 cyclopedia of Statistical Sciences, ed. S RomDM, Connell L. 1994. A generalized Kotz, NL Johnson, 8:484-90. NewYork: family of multiple test procedures. ComWiley mun.Stat. Part A: Theory Methods, 23. In Shaffer JP. 1991. Probability of directional erpress rors with disordinal (qualitative) interacRomDM,Holland B. 1994. A new closed multion. Psychometrika 56:29-38 tiple testing procedure for hierarchical Simes RJ. 1986. Animproved Bonferroni profamilies of hypotheses.J. Stat. Plan. Infer. cedure for multiple tests of significance. In press Biometrika 73:751-54 Rosenthal R, Rubin DB. 1983. Ensemble-ad- Sorid B. 1989. Statistical "discoveries" and efjusted p values. Psychol. Bull, 94:540-41 fect-size estimation. J. Am. Star. Assoc. Roy SN, Bose RC. 1953. Simultaneous confi84:608-10 dence interval estimation. Ann. Math.Stat. Spjctvoll E. 1972. Onthe optimality of some 24:513-36 multiple comparison procedures. Ann. Royen T. 1989. Generalized maximumrange Math. Stat. 43:398-411 tests for pairwise comparisonsof several SpjOtvoll E. 1977. Ordering ordered paramepopulations. Biometr. J. 31:905-29 ters. Biometrika 64:327-34 RoyenT. 1990. A probability inequality for Stigler SM. 1986. The History of Statistics. ranges and its application to maximum Cambridge:Harvard Univ, Press range test procedures. Metrika 37:145-54 TamhaneAC. 1979. A comparison of proce-

Annual Reviews www.annualreviews.org/aronline 584

SHAFFER

dures for multiple comparisons of means with unequalvariances. J. Am. Stat. Assoc. 74:471-80 Tatsuoka MM.1992. Review of "Multiple Comparisons for Researchers." Contemp. Psychol. 37:775-76 Toothaker LE. 1991. Multiple Comparisonsfor Researchers. NewburyPark, CA: Sage Toothaker LE. 1993. Muttiple Comparison Procedures. NewburyPark, CA: Sage Tukey JW. 1949. Comparingindividual means in the analysis of variance. Biometrics 5: 99-114 TukeyJW. 1952. Reminder sheets for "Multiple Comparisons." See Braun 1994, pp. 341-45 Tukey JW. 1953. The problem of multiple compmfsons.See Braun 1994, pp. 1-300 Tukey JW. 1991. The philsophy of multiple

comparisons.Stat. Sci. 6:100-16 Tukey JW. 1993. Whereshould multiple comparisons go next? See Hoppe 1993b, pp. 187-207 WelchBL. 1938. The significance of the difference between two means when the population variances are unequ~fl. Bioo metrika 25:350-62 Welsch RE. 1977. Stepwise multiple comparison procedures, J. Am. Stat. Assoc. 72: 566-75 Westfall PH, YoungSS. 1993. Resamplingbased Multiple Testing. NewYork: Wiley Wright SP. 1992. Adjusted p-values for simultaneous inference. Biometrics 48:1005-13 Ziegel ER. 1994. Review of "Multiple Comparisons, Selection, and Applications in Biometry." Technometrics 36:230-31