Comparison of false discovery rate methods in identifying genes

Hypothesis. Accept. Reject. Total. Null true. N0|0. N1|0 m0. Null not true. N0|1. N1|1 m1 m – R. R m. H.-R. Qian, S. Huang / Genomics 86 (2005) 495–503. 496 ...
464KB taille 20 téléchargements 307 vues
Genomics 86 (2005) 495 – 503 www.elsevier.com/locate/ygeno

Method

Comparison of false discovery rate methods in identifying genes with differential expression Hui-Rong Qian, Shuguang Huang* Statistics and Information Science, Lilly Corporate Center, Eli Lilly and Company, Indianapolis, IN 46285, USA Received 1 February 2005; accepted 11 June 2005 Available online 27 July 2005

Abstract Current high-throughput techniques such as microarray in genomics or mass spectrometry in proteomics usually generate thousands of hypotheses to be tested simultaneously. The usual purpose of these techniques is to identify a subset of interesting cases that deserve further investigation. As a consequence, the control of false positives among the tests called ‘‘significant’’ becomes a critical issue for researchers. Over the past few years, several false discovery rate (FDR)-controlling methods have been proposed; each method favors certain scenarios and is introduced with the purpose of improving the control of FDR at the targeted level. In this paper, we compare the performance of the five FDR-controlling methods proposed by Benjamini et al., the qvalue method proposed by Storey, and the traditional Bonferroni method. The purpose is to investigate the ‘‘observed’’ sensitivity of each method on typical microarray experiments in which the majority (or all) of the truth is unknown. Based on two well-studied microarray datasets, it is found that in terms of the ‘‘apparent’’ test power, the ranking of the FDR methods is given as Step-down < Step-up: dependent < Step-up: one-stage (BH95) < Step-up adaptive < qvalue. The BH95 method shows the best control of FDR at the target level. It is our hope that the observed results could provide some insight into the application of different FDR methods in microarray data analysis. D 2005 Elsevier Inc. All rights reserved. Keywords: False discovery rate; Microarray

Contents Methods for false discovery rate Material and results. . . . . . . Example 1. . . . . . . . . . Example 2. . . . . . . . . . Discussion . . . . . . . . . . . Acknowledgments . . . . . . . References . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Current high-throughput techniques such as microarray in genomics or mass spectrometry in proteomics usually generate thousands or even more hypotheses to be tested at the same time. For example, Affymetrix’s new version of the * Corresponding author. Fax: +1 317 277 3220. E-mail address: [email protected] (S. Huang). 0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2005.06.007

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

497 498 498 498 500 503 503

human genome GeneChip (U133 plus 2.0) allows the monitoring of the expression of about 55,000 probe sets (representing ¨40,000 genes) simultaneously. Most of these high-throughput techniques are used for screening purposes in that the goal of these applications is to identify a subset of interesting cases that deserve further investigation. Generally, the findings from such studies need to be followed up by other

496

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

technologies (oftentimes of higher accuracy but in smaller scale due to the cost); the validated findings are then further investigated. For instance, the interesting genes identified by microarray technology are deemed as putative biomarkers and are generally to be confirmed by RT-PCR and/or may be followed up by other technologies for confirmation. Another example is that the potential biomarkers discovered by mass spectrometry may be followed up by assays such as enzyme-linked immunosorbent assay (ELISA) or Western blot technology for validation purposes. Due to the nature of these high-throughput technologies, the concern of the multiplicity issue in this type of studies is no longer on the control of the traditional family-wise error rate (FWER), which controls the false positive rate under all possible configurations of the true and false hypotheses. Instead, the focus is on controlling the proportion of false positives among the tests called ‘‘significant’’, the false among the ‘‘discoveries’’ or the ‘‘interesting cases’’. This challenge prompts a fresh view of the conventional multiplicity problem. After the seminal paper by Benjamini and Hochberg [1], the concept of false discovery rate (FDR) has been widely accepted in large-scale data analysis. The method has now been generally used as the criterion for controlling the proportion of false positives among the identified interesting cases. This is especially important in screening studies that are preliminary to more detailed confirmatory investigations, the purpose is to identify a subset of cases that would be later validated with high success rate. In the paper of Benjamini and Hochberg [1], FDR is defined as the expected rate of erroneous rejection of hypotheses among total rejected hypotheses. In this paper, we follow the approach by Genovese and Wasserman [9] and focus on the realized FDR. Suppose that m hypotheses are tested, let R denote the total number of rejected null hypotheses. The layout of a typical multiple testing is given in Table 1. In Table 1, m 0 is the unknown number of true null cases, m 1 is the unknown number of nontrue null cases. The realized FDR is defined as N 1j0 R if R > 0 : FDR ¼ 0 if R ¼ 0 In comparison, the conventional per-comparison error rate (PCER) is given as   N1j0 PCER ¼ E ; m

As is easily seen from the definition above, the realized FDR estimates the rate of the false positives among the rejected hypotheses. Therefore, FDR-controlling methods control the rate of the false rejections among the cases called significant rather than controlling the FWER of all the tested hypotheses. This is very appealing in most largescale data analysis due to the nature of the investigation explained earlier. Under complete null hypotheses, i.e., when all the null hypotheses are true, the control of FDR is equivalent to the control of FWER. When the null hypotheses are not all true, FDR is less stringent than FWER, thus FDR procedures are more powerful than procedures controlling FWER. It should be emphasized that the main point is not about the higher testing power of the FDR method, the true novelty and appeal of FDR is the concept that it controls the relevant error rate, the false among the discoveries, for high throughput studies. The FDR-controlling approach developed by Benjamini and Hochberg [1] (called the Step-up one-stage method or BH95 FDR method later) guarantees that the expected value of the FDR is controlled under the targeted value. Mathematically, one immediate concern of the BH95 FDR method is its conservativeness. It is proved that the expected value of the realized FDR (and even the upper bound) is actually lower than the targeted value when there are nonnull hypotheses, i.e., E ðFDRÞ V

m0 q ¼ p 0 q; m

where k0 is the proportion of the true null cases and q is the targeted false discovery rate. This result holds regardless of how many null hypotheses are true and regardless of the distribution of the p values under the alternatives. As a trade-off of the robustness, this approach is underpowered when the hypotheses are not all true. In view of this property, several FDR methods have been introduced to improve the control of the FDR at a level closer to the targeted value. For example, several methods have been proposed to ‘‘differentiate’’ the observed p values (or the test statistics) into those that are from true nulls and those that are from the nontrue nulls. Several methods consider the observed p values (or the test statistics) to be from a mixture model consisting of a component f 0 corresponding to the null cases and a component f 1 corresponding to the nonnull cases [7,9,14,15]. The density of the p values can be expressed as f ð pÞ ¼ p0 f0 ð pÞ þ ð1  p0 Þ f1 ð pÞ:

and FWER is defined as FWER = p (N 1|0  1). Table 1 The setting for multiple tests Hypothesis

Accept

Reject

Total

Null true Null not true

N 0|0 N 0|1 m – R

N 1|0 N 1|1 R

m0 m1 m

Biologically, it is a well-accepted fact that there exist interaction and coregulation among genes. In terms of mRNA expression, this coregulation could be reflected in the correlation (positive or negative, with varying strength) of the expression values. In accordance with the complexity of the data structure caused by the complicated biology, several FDR-controlling methods have been developed to

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

cope with special situations. In particular, methods have been developed to handle situations in which the test statistics might be positively correlated or situations in which only a small portion of the cases is expected to be nonnulls. Moreover, methods have been developed for the even more complicated scenarios in which the nonnull cases are of varying effect sizes (composite alternatives). However, the true underlying distributional structure of the data almost always remains unknown to us. Even though there are some vague and general beliefs such as the existence of association and coregulation among some genes, or we may entertain an educated conjecture regarding the percentage of genes regulated by a certain drug, and so on, at best we have only very limited knowledge of the true structure. Therefore, in reality the methods used tend to be ‘‘empirical’’ and data driven. Based on the data observed, we have to rely on our judgment and choose a particular FDR method we deem the most appropriate. It is thus our research interest to investigate the stringency and sensitivity of several currently available FDR methods on typical microarray experiments in which the majority (or all) of the truth is unknown. In this paper, we adopt the concept initiated by Pavlidis et al. [13] using the observed number of significance calls as the measure of the ‘‘apparent power’’, which is defined as the proportion (or number) of tests called significant at a chosen significance level. In particular, we consider five methods proposed by Benjamini et al. and the qvalue method by Storey. The Bonferroni method is also considered for comparison purpose. It is our hope that these findings could provide some general insight into the application of different FDR methods in microarray data analysis.

Methods for false discovery rate In this section we introduce several FDR methods proposed by Benjamini et al. and the qvalue method proposed by Storey. A brief description of the decision rule is given for each FDR-controlling procedure, followed by its implementation in calculation from the observed p values. Suppose that being tested are m hypotheses H i with corresponding p value P i , and i = 1, 2, . . . , m. Let P (i) be the ith p value ranked from the smallest to the largest with the corresponding hypothesis noted as H (i), and let the adjusted p value be labeled as P˜. The following methods control FDR at q. 1. Step-up one-stage (BH95) ([1,6]. Decision rule: Let k ¼ max i : PðiÞ V H ( j), j = 1, 2, . . ., k.

i m

)

q . Reject all

Algorithm: (i) P˜ (m) = P (m),  (ii) P˜ ðiÞ ¼ min mi PðiÞ ; P˜ ði þ 1Þ , for i = 1, 2, . . ., m  1.

497

2. Step-up dependent (updpd) [6]. 8 Decision rule: Let k ¼ max< Reject all H ( j), j = 1, 2, . . ., k.

:

i : PðiÞ V

i m

9 =.

q

P m

1 i

i ¼ 1

;

Algorithm: m P 1 (i) P˜ ðmÞ ¼ min PðmÞ  i ¼ 1 i ; 1 ,

ð

(ii) P˜ ðiÞ ¼ min m  1.

ð

m i

Þ

PðiÞ 

m P 1 i¼1

˜

i ; P ði þ 1Þ

Þ, for i = 1, 2, . . . ,

3. Step-up adaptive (adapt) [2]. Decision rule: If PðiÞ  mi q for all i, then accept all H i . Otherwise, let Si ¼ m1þ1PðiÞ i , l ¼ minfi : Si < Si  1 g,   n o m0 ¼ min S1l þ 1; m , and k ¼ max i : PðiÞ V mi q . Reject all 0 H ( j ) , j = 1, 2, . . . , k. Algorithm: (i) P˜ ðiÞ ¼ PðiÞ  mi :   (ii) If min P˜ ðiÞ  q , then use step-up one-stage approach to get the adjusted p value: P˜ m ¼ PðmÞ ,   P˜ ðiÞ ¼ min mi PðiÞ ; P˜ ði þ 1Þ , for i = 1, 2, . . ., m  1. Stop. 1P (iii) Otherwise, Si ¼ m þ 1 ðiÞ i , l ¼ minfi : Si < S i  1 g, and m0 ¼ minð S1l þ 1; mÞ.   (iv) P˜ ðmÞ ¼ min mm0 PðmÞ ; 1 .  m0  (v) P˜ ðiÞ ¼ min i PðiÞ ; P˜ ði þ 1Þ , for i = 1, 2, . . . , m  1. 4. Step-down independenth (dwnind) [4]. i 1  m iþ 1 Decision rule: Let ci ¼ 1  1  min 1; m mi þ 1 q , then   k ¼ min i : PðiÞ > ci . Reject all H ( j) , j = 1, 2, . . . , k  1. Algorithm:  nh

1

(i) P˜ ð1Þ ¼ 1  1  Pð1Þ m ; 

(ii) P˜ ðiÞ ¼ max 1  1  PðiÞ for i = 2, . . . , m.

 m  1i þ 1 i

o  m mi þ 1 ; P˜ ði  1Þ ,

5. Step-down distribution free (dwnfree) [5]. Decision rule: Let ci ¼ min ð1; ðm  mi þ 1Þ2 qÞ and   k ¼ min i : PðiÞ > ci . Reject all H (j) , j = 1, 2, . . ., k1. Algorithm: (i) P˜ (1) = P (1),  2 (ii) P˜ ðiÞ ¼ max ðm  mi þ 1Þ PðiÞ ; P˜ ði  1Þ , for i = 2, . . . , m, (iii) P˜ (i) = min (P˜ i , 1). 6. qvalue (qvalue) [15]. Decision rule: Reject H ( j ) , j = 1, . . . , k, with k = min f j : qˆ ( P ( j ) ) qg.

498

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

Algorithm: (i) Choose a value for k (please refer to the paper for optimizing the choice of k value). Estimate p 0 by pˆ 0 ðkÞ ¼ #ð1fPi >kÞmkÞ : (ii) For any rejection region of interest [0,g] estimate pFˆDR(g) by pˆ 0 ðkÞc ˆ ; pFDR k ðcÞ ¼ ˆ Pr ðP V cÞf1  ð1  cÞm g max ð#fPi V cg; 1Þ : where Prˆ ðP V cÞ ¼ m (iii) Set qˆ(P ( m)) = pFˆDR(P ( m)). (iv) Set qˆ(P (i) = minf pFˆDR(P (i) ), qˆ(P (i + 1)g for i = m  1, m  2, . . . , 1. Material and results Example 1 We apply the methods to the leukemia data of Golub et al. [16], which consist of 27 acute lymphoblastic leukemia (ALL) samples and 11 acute myeloid leukemia (AML) samples. The goal of the research is to find genes with differential expression between ALL and AML. In this experiment, RNA prepared from bone marrow mononuclear cells was hybridized to Affymetrix HuGeneFL (HU6800) arrays. Each array contains 7129 probe sets representing 6817 human genes. A simple two-sample t test (assuming homogeneity) is performed on each probe set separately, and the resulting p values are pooled and adjusted using the discussed FDR procedures. The numbers of significance calls, using cutoffs 0.01 and 0.05, are given in Table 2. It can be seen that there is a big variation across different methods. Quite obviously the two step-down methods have very small apparent test power, i.e., fewer cases were called interesting by these two procedures. Both methods identified the same number of cases as the conventional Bonferroni method, which Table 2 Number of interesting genes at cutoff levels 0.05 and 0.1 Method

Cutoff = 0.01

Cutoff = 0.05

Permutation FDR (%) for cutoff = 0.05

0. 1. 2. 3. 4. 5. 6. 7.

799 122 34 133 22 22 179 22

1651 488 82 524 37 37 746 37

20.490 4.507 0.451 5.073 0.138 0.138 7.680 0.138

p value BH95 updpd adapt dwnind dwnfree qvalue Bonferroni

Two cutoff significance levels, 0.05 and 0.01, are applied to each of the FDR methods (as well as the raw p value), and the count of interesting cases under each cutoff is given. The last column gives the estimate of the actual FDR (when the targeted FDR = 0.05) based on 1000 permutations.

controls the FWER. The updpd method shows dramatic difference from the other two step-up procedures, identifying many fewer significant cases. As expected, BH95 is slightly more conservative compared to the adapt procedure. To assess the actual FDR controlled by each method, the last column of Table 2 provides the permutation result based on 1000 runs. Let T = (t 0, t 1. . .t 7) denote the corresponding cutoff t value (absolute value) used by each FDR method. For each permutation, the proportion of cases with t statistics (absolute value) bigger than the cutoff t value is recorded for every method. The actual FDR for each method is estimated by the average of the proportions from the 1000 runs. The permutation results show that BH95, adapt, and qvalue control the FDR roughly at the target level, all the other methods are too conservative. Fig. 1 gives the graphical view of the comparison among different methods. Each version of adjusted p values (unadjusted p values also) is plotted against the ranks. Fig. 1A plots the data at the raw scale; Fig. 1B gives a close-up view of the interesting cases with small p values. Both show the trend that, in terms of the apparent power, the ranking of the methods is Step-down distribution-free/independent < Step-up dependent < BH95 < Step-up adaptive < qvalue. Example 2 The dataset is provided by Lemon et al. [12]. It can be downloaded from URL: http://www.bios.unc.edu/~fwright/ fbss. In this experiment, human fibroblasts cells were treated under two conditions: stimulated and starved. A third RNA sample, mixed, was produced from an equimolar mixture of the two samples (50:50). One goal of the study was to identify the genes differentially expressed between the two conditions. Affymetrix HuGeneFL (HU6800) array was used, and there were six technical replicates run for each of the three samples (total 18 arrays). Some bacterial genes were spiked-in the prepared cRNA: Stimulated samples received Lys and Phe RNAs at 0.08 ng/8 Ag total RNA, the starved samples received the same amount of Dap and Thr, and the 50:50 mixture samples received all four at 0.04 ng/8 ng. Hybridization controls BioB, BioC, BioD, and Cre were added in final concentrations of 1.5, 5, 25, and 100 pM, respectively. The concentration for each type of control was constant across the three types of samples. More detailed information about the experiment is given in the above Web site maintained by Wright. The expression signal was extracted from the CEL file using the SUM method [10]. The basic idea of the SUM method can be explained by the fact that both PM and MM have specific bindings, so PM and MM together ought to catch most of the targeted RNA (some RNA may bind to probes outside the pair). The sum of PM and MM intensities could thus be a more exhaustive measure of the RNA’s abundance. The nonspecific component of the sum is then handled in the same manner as in the RMA method [11].

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

499

Fig. 1. (A) Comparison of different FDR methods for the p values from the Golub data (raw scale). (B) Comparison of different FDR methods for the p values from the Golub data (log scale, a close-up view of the interesting cases). Each version of FDR as well as the unadjusted p value is plotted against the fractional rank (defined as i/m, where i is the rank of the current p value, and m = 7129, the total number of tests). The straight line is the 45- diagonal line. Fracrank, fractional rank.

Simply put, the model is assumed as SUM i j ¼ PM i j þ MM i j ¼ si j þ bi j ; where i is the index for array and j is the index for probes within the probe set, s i j is the signal component following

an exponential distribution, and b i j is the background noise component following a normal distribution. Using the same idea of the RMA method, the background-adjusted SUM is given by the conditional expectation     B SUM i j ¼ E sij j SUM i j ;

500

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

and the log additive model for the expression value is    log B SUM i j ¼ li þ aj þ eij : The reason for using SUM is that this method has been demonstrated to have good performance in precision, accuracy, ability to detect differential expression and other good properties. In particular, based on the research by Huang et al. [10], the SUM method outperforms RMA in terms of the ROC profile of the control genes. For greater detail on this method, see Huang et al. [10]. First, the interest is in identifying genes that are differentially expressed between the stimulated cells and the starved cells. A simple ANOVA model is applied to the signal values of each individual gene, and the resulting p values from each pair-wise comparison are adjusted using different FDR procedures. Similar to the findings from the Golub data, the two stepdown methods have very small apparent test power; each identified a number of cases similar to that from the Bonferroni method. qvalue makes the most significance calls, followed by, in the given order, Step-up adaptive method, Step-up one-stage method, and then Step-up dependent method. The last column of Table 3 provides the estimate of the actual false discovery rate for each method based on 300 permutations. The permutation results show that BH95 controls FDR roughly at the target level; the FDR of the Step-up adaptive method is a bit higher than the targeted level but acceptable; the false discovery rate of the qvalue method is too high; all the other methods are very conservative. Fig. 2 plots the FDRs versus the corresponding rank. From Fig. 2A we can see that several methods can have FDR smaller than the unadjusted p value (in theory, this happens for those p values with rank greater than the estimated m 0). Similar to the results in Example 1, the order of the apparent power for the methods is given as

Table 3 Number of interesting genes at cutoff levels 0.05 and 0.1 Method

Cutoff = 0.01

Cutoff = 0.05

Permutation FDR (%) for cutoff = 0.05

0. 1. 2. 3. 4. 5. 6. 7.

2805 2248 1346 2623 539 538 3122 522

3838 3322 1936 3838 755 752 4621 722

9.1493 5.0148 0.3824 9.1493 0.0022 0.0022 18.4937 0.0023

p value BH95 updpd adapt dwnind dwnfree qvalue Bonferroni

The number of interesting genes identified by the FDR methods from the Wright data. Two cutoff significance levels, 0.05 and 0.01, are applied to each of the FDR methods (as well as the raw p value), and the count of interesting cases under each cutoff is given. The last column gives the estimate of the actual FDR (when the targeted FDR = 0.05) based on 300 permutations.

Step-down distribution-free/independent < Step-up dependent < BH95 < Step-up adaptive < qvalue. Different FDR methods are applied to the test p values of the spike-in control genes exclusively. There are 12 probe sets, corresponding to Phe, Lys, Thr, and Dap, spiked in at different concentrations across the three groups (36 nonnulls), 18 probe sets, corresponding to BioB, BioC, BioD, and Cre spiked in constant concentration (54 nulls). Table 4 gives the number of false positives and false positives at two cutoff thresholds. In terms of the number of significance calls, the same ranking of the methods is observed on the spiked-in genes. qvalue and Step-up adaptive methods show bad specificity. Note that the p values of the nulls are overall unusually small, for unknown reasons. They might be caused by intensity data preprocessing (signal data extraction from the probe level data, normalization, etc.). Due to the magnitude of the values, the lines of the ROC curve are hard to distinguish (graph not shown here). Fig. 3 gives the modified ROC curve on the control genes based on different p value adjustment methods, in which the ratios of the number of true positives over the number of false positives are plotted against varying cutoff values. The plot shows the same trend revealed by Table 4. Note that no FDR method controls the FDR at the nominal level, so the table is rather of a relative sense—comparing the FDR methods to each other, not to the FDR in truth.

Discussion We investigated six FDR methods by comparing their apparent test power using public microarray experiment datasets. Of the six approaches we investigated, the qvalue method has the highest apparent test power, followed by the Step-up adaptive method and then BH95 and Step-up dependent method. The step-down approaches are most conservative, giving similar numbers of findings compared to the Bonferroni method. In theory, the Step-up adaptive method and the Step-down independent method are suitable when test statistics are independent; the Step-up one-stage method has been shown to be applicable in situations in which test statistics are independent or positively correlated (and presumably the Step-up adaptive method as well); the Step-up dependent method and the Step-down distribution-free method are distribution free. Of all the methods, Step-up one-stage BH95 appears to control the FDR best, close to the nominal level; Step-up dependent, Step-down independent, and Step-down distribution-free methods are very conservative; the qvalue method and Step-up adaptive methods showed high apparent power, with the sacrifice of tending to inflate the false discovery rates. Similar to Step-up two-stage and Step-up multiple-stage [3], the Step-up adaptive method is meant to be more powerful by further adjusting the p values by a factor of k0 = m 0 /m. These three methods are closely related to the

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

501

Fig. 2. (A) Comparison of different FDR methods for the p values from the Wright data (raw scale). (B) Comparison of different FDR methods for the p values from the Wright data (log scale, a close-up view of the interesting cases). Each version of FDR as well as the unadjusted p value is plotted against the fracrank (defined as i/m, where i is the rank of the current p value, and m = 7129, the total number of tests). The straight line is the 45- diagonal line.

One-stage step-up approach, but control the FDR more accurately (closer to the target FDR level). The gain in power is big when the nonnull cases are far away from the nulls. When the nonnulls are close to the nulls, the p values of the two kinds are mixed together, leading to a conservative estimate of m 0 and thus a lower test power. As expected, the Step-up adaptive procedure has more

findings than the classic One-stage step-up procedure, giving results similar to those of Step-up two-stage and Step-up multiple-stage methods on these two datasets (results not shown). However, it could happen that the Step-up adaptive method identifies fewer interesting cases than the One-stage step-up approach. This phenomenon could be explained by the critical values defined in the

502

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

Table 4 The number of significant calls on the spiked-in genes

and multistage approaches would have more test power only if

FDR method

True_pos/false_pos Cutoff = 0.0001

True_pos/false_pos Cutoff = 0.001

0. 1. 2. 3. 4. 5. 6. 7.

26/3 25/1 25/0 26/3 24/0 24/0 26/4 24/0

27/11 26/5 26/3 27/11 25/0 25/0 29/14 25/0

p value BH95 updpd adapt dwnind dwnfree qvalue Bonferroni

The specificity and sensitivity of the FDR methods on the control genes from the Wright data. The numbers of true positives and false positives among the significance calls using cutoffs 0.0001 and 0.001 are shown.

original papers for the one-stage, the two-stage, and the multiple-stage approaches. For the m hypotheses H i , i = 1, 2, . . ., m, which have corresponding p values P ( i); these algorithms compare P(i) with critical values defined as follows: one-stage approach, mi  q; adaptive approach; m i K  qT , K is the estimate of number of rejections. Here q and q* are the targeted FDR levels. Since the twostage approach controls the FDR at q* / (1  q*), control FDR at the same level q as the one-stage approach, we need to replace q* with q / (1 + q). Thus it is clear that the two-stage

i i mq q<  q=ð1 þ qÞ; i:e:; K > : m mK 1þq In this paper, all of the methods discussed focus on the simple case in which the p values corresponding to the nonnull hypotheses are assumed to follow the same distribution. In reality, this assumption can rarely hold. For instance, differentially expressed genes may have different effect sizes and thus the p values do not have the same distribution. More complicated modeling techniques are needed to capture the distribution of the p values when the alternative hypotheses are composite. Pounds and Morris [14] introduced the Beta-Uniform mixture model on the observed p values; it has been shown to be useful for some situations. Genovese and Wasserman [9] performed an analytical investigation of the asymptotic properties of the ‘‘deciding point’’ that determines the critical p value and briefly discussed two special cases with composite alternatives. The local FDR method by Efron [7] takes an empirical Bayesian approach and gets a robust density estimate of the observed p values (transformed under z i = A1 ( p i ), where A is the CDF of a standard normal variable). The Local FDR is defined as FDR = f 0 (z)/f (z). It is a more flexible approach in the sense that it does not need a parametric form of the distribution density and the FDR is

Fig. 3. The modified receiver operator characteristic (ROC) curve on the control genes from the Wright data.

H.-R. Qian, S. Huang / Genomics 86 (2005) 495 – 503

calculated locally (as opposed to tail area FDR). Efron [8] gives more discussion on how to ‘‘select’’ the nonnull cases and how to estimate the effect sizes in the situation with composite alternatives. When the t tests (or ANOVA) are used to identify differentially expressed genes, each probe set is tested separately. The hypotheses and their test statistics are treated as independent, even though this does not agree with the reality. However, it is typical in many analyses of microarray data to have certain assumptions apparently ‘‘violated’’ (normality, equal variance, dependency, etc.). A part of the interest in this research is to explore the performance of the methods without knowing the truth. It is true that different FDR methods are introduced to favor different situations. However, as discussed earlier, the true structure of microarray data is mostly unknown. Observed from the comparison of the two chosen datasets, it is found that, in terms of the apparent power, the ranking of the six FDR methods is given as step-down methods (dwnfree, dwnind) < Step-up dependent < Step-up onestage (BH95) < Step-up adaptive < qvalue. It is our hope that the two chosen datasets share some common features of microarray experiments and so the observations based on these two datasets could provide some general insights into the sensitivity of different FDR methods in typical microarray studies.

Acknowledgments We thank Kerry Bemis for several insightful discussions. We also thank Rick Higgs and Nicholas Lewin Koh for their valuable input, and thanks to Faming Zhang for his support and review of the manuscript. References [1] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. B 57 (1995) 289 – 300.

503

[2] Y. Benjamini, Y. Hochberg, On the adaptive control of the false discovery rate in multiple testing with independent statistics, J. Educ. Behav. Sci. 25 (2000) 60 – 83. [3] Y. Benjamini, A. Krieger, D. Yekutieli, Two staged linear step up FDR controlling procedure. Technical report, Department of Statistics and Operation Research, Tel Aviv University, and Department of Statistics, Wharton School, University of Pennsylvania, 2001, http://www.math. tau.ac.il/~ybenja/. [4] Y. Benjamini, W. Liu, A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence, J. Stat. Plann. Inference 82 (1999) 163 – 170. [5] Y. Benjamini, W. Liu, A distribution-free multiple-test procedure that controls the false discovery rate. Research paper 99-3, Department of Statistics and Operation Research, Tel Aviv University, 1999, http://www.math.tau.ac.il/~ybenja/. [6] Y. Benjamini, D. Yekutieli, The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001) 1165 – 1188. [7] B. Efron, Large-scale simultaneous hypothesis testing: the choice of a null hypothesis, JASA 99 (2003) 96 – 104. [8] B. Efron, Selection and estimation for large-scale simultaneous inference, 2005 (Online: http://www-stat.stanford.edu/~brad/papers/). [9] C. Genovese, L. Wasserman, Operating characteristics and extensions of the false discovery rate procedure, J. R. Stat. Soc. B 64 (2002) 499 – 517. [10] S. Huang, Y. Wang, H. Qian, P. Chen, K.G. Bemis, SUM: a new way to incorporate mismatch probe measurements, Genomics 84 (2004) 767 – 777. [11] R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, T.P. Speed, Summaries of Affymetrix GeneChip probe level data, Nucleic Acids Res. 31 (2003) e15. [12] W.J. Lemon, J.J.T. Palatini, R. Krahe, F.A. Wright, Theoretical and experimental comparisons of gene expression indexes for oligonucleotide arrays, Bioinformatics 18 (2002) 1470 – 1476. [13] P. Pavlidis, Q. Li, W.S. Noble, The effect of replication on gene expression microarray experiments, Bioinformatics 19 (2003) 1620 – 1627. [14] S. Pounds, S.W. Morris, Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values, Bioinformatics 19 (2003) 1236 – 1242. [15] J.D. Storey, A direct approach to false discovery rate, J. R. Stat. Soc. B. 64 (2002) 479 – 498. [16] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, E.S. Lander, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531 – 537.