Practical FDR-based sample size calculations in microarray

May 22, 2005 - in Drosophila melanogaster. Nat. Genet., 29, 389–395. Johnson,N.L., Kotz,S. and Balakrishnan,N. (1994) Continous Univariate Distributions,.
391KB taille 2 téléchargements 239 vues
BIOINFORMATICS

ORIGINAL PAPER

Vol. 21 no. 15 2005, pages 3264–3272 doi:10.1093/bioinformatics/bti519

Gene expression

Practical FDR-based sample size calculations in microarray experiments Jianhua Hu1,∗ , Fei Zou2 and Fred A. Wright2 1 Department

of Biostatistics and Applied Mathematics, University of Texas M.D. Anderson Cancer Center, TX 77030-4009, USA and 2 Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599-3260, USA Received on January 3, 2005; revised on May 22, 2005; accepted on May 25, 2005 Advance Access publication June 2, 2005

ABSTRACT Motivation: Owing to the experimental cost and difficulty in obtaining biological materials, it is essential to consider appropriate sample sizes in microarray studies. With the growing use of the False Discovery Rate (FDR) in microarray analysis, an FDR-based sample size calculation is essential. Method: We describe an approach to explicitly connect the sample size to the FDR and the number of differentially expressed genes to be detected. The method fits parametric models for degree of differential expression using the Expectation–Maximization algorithm. Results: The applicability of the method is illustrated with simulations and studies of a lung microarray dataset. We propose to use a small training set or published data from relevant biological settings to calculate the sample size of an experiment. Availability: Code to implement the method in the statistical package R is available from the authors. Contact: [email protected]

INTRODUCTION cDNA and oligonucleotide microarrays have become powerful tools for the global estimation and comparison of gene expression. A main application of microarrays is the detection of genes that are differentially expressed under two (or more) different conditions. This problem is more difficult than one might expect, owing to the multiplicity of tests and the attendant need to increase power by employing sensitive modeling. Early approaches had used simple thresholds for ratios of expression estimates under the two conditions (Chen et al., 1997), whereas ordinary t-tests and Wilcoxon tests (Dudoit et al., 2002; Troyanskaya et al., 2002) have been used in a manner that controls family-wise error rate (FWER) or false discovery rate (FDR). The tests may be improved by explicitly utilizing the relationship between the mean and the variance of estimated expression [Hu and Wright, 2004 (http://www.bios.unc.edu/ ∼fwright/TechReport/); Chen et al., 1997; Ideker et al., 2000]. Similar ideas are employed in regularized t-tests; one version involves adding a constant to the variance estimate in the t denominator (Tusher et al., 2001; Efron et al., 2001). Once a suitable test statistic is chosen, permutation can be used to obtain a suitable null distribution

∗ To

whom correspondence should be addressed.

3264

for empirical testing or estimating the FDR (Tusher et al., 2001; Efron et al., 2001; Pan et al., 2001). Many array studies have demonstrated biologically plausible results with very few arrays (e.g., Yoon et al., 2002), leading to a perception that a researcher might casually hybridize a handful of arrays in the hope of finding something meaningful. To statisticians concerned with multiple-testing issues, such a sample size might seem inherently insufficient, and permutation-based approaches may not be possible. Our view is that in some cases surprisingly few arrays may be sufficient, but the current dominance of cancer microarray research has produced an optimistic view of differential expression that may be unwarranted in other settings. Examples with greater biological subtlety will probably include the search for the downstream effects of a single gene mutation, or examining expression differences among closely related rodent strains. As the microarray field moves beyond casual hypothesisgenerating efforts, it becomes increasingly important to prospectively estimate required sample sizes prior to undertaking an experiment. Unfortunately, there is sparse literature on sample size estimation for microarrays. Using ANOVA analyses of gene expression, Black and Doerge (2002) propose a parametric approach assuming lognormal or gamma distributions for gene expression intensities. Lee and Whitmore (2002) also describe sample size calculations for the ANOVA model, under several different experiment designs. Pan et al. (2002) discuss sample size calculations using a combination of parametric and non-parametric approaches. All these methods relate power to sample size while controlling for the Type I error, although Lee and Whitmore (2002) also discuss some limited connections to the FDR. While the methods in these papers are useful, it is more natural to base sample size calculations directly on the FDR, as this criterion is often used as an error bound in accounting for multiple tests. Our approach is applicable to any array platform, assuming that a single expression estimate is available for each gene and each sample after any necessary platform-dependent pre-processing and normalization. For two-color arrays, log ratios (where one sample serves as a common reference) or estimates based on separate modeling of the color channels (Jin et al., 2001) are often used. Although we highlight the application to microarrays, the approach described here can be used for any high-throughput data involving two-sample comparisons that employs the FDR. Our method is described in the next section, followed by simulation

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

FDR-based sample size calculations for microarrays

studies and the analysis of real datasets. We conclude with some remarks and discussion.

METHODS Notation and assumptions Let x1i (i = 1, . . . , n1 ) and x2j (j = 1, . . . , n2 ) denote the expression levels for a single gene under two conditions, where n1 and n2 are the total number of arrays under conditions 1 and 2, respectively. For simplicity, we assume n1 = n2 = n. Extensions to more general experimental designs are relatively straightforward and appear in the discussion. For a single gene, we assume that the gene expression estimates (perhaps after suitable transformation) are normally distributed within each condition. A growing literature supports this assumption for estimates derived from two-color arrays (Chen et al., 1997) and oligonucleotide arrays (Giles and Kipling, 2003). We assume there is a total of m genes represented on the arrays. We wish to identify the genes whose expression differs under the two conditions. Let µ1 and µ2 denote true mean expressions for the gene under conditions 1 and 2, and σ12 and σ22 the corresponding variances. For testing purposes, the hypotheses are H0 : µ1 = µ2 versus H1 : µ1  = µ2 , and we use the statistic   √ √ T = n(x¯1 − x¯2 )/ s12 + s22 , where x¯1 = ni=1 x1i /n, x¯2 = nj=1 x2j /n.   s12 = nj=1 (x1j − x¯1 )2 /(n − 1) and s22 = nj=1 (x2j − x¯2 )2 /(n − 1). The quantity µ1 − µ2 δ=  σ12 + σ22 governs the power to detect departures from H0 (Neter et al., 1985; Cohen, 1988), and has a biological interpretation as a scaled expression difference. We can now restate the hypotheses as H0 : δ = 0 versus H1 : δ  = 0. T then has an approximate t density with 2n − 2 degrees of freedom and √ non-centrality parameter nδ (Welch, 1947). T is exactly t-distributed if σ12 = σ22 , and we implicitly use this assumption in model-fitting. Numerical illustrations given further below examine the effect of departures from √ the assumption. We denote the non-central t density as g(t; nδ), with √ cumulative distribution function (CDF) G(t; nδ) (Johnson et al., 1994). With thousands of genes assayed in a typical microarray experiment, we treat the δs as realizations from a common CDF, F , for a random variable . The distributions of T and derived P -values depend entirely on F . Because the sign of δ has biological importance, we decompose F as the mixture: F (δ) = π0 F0 (δ) + π1 F1 (δ) + π2 F2 (δ).

(1)

Here π0 , π1 and π2 are the probabilities that  is 0, positive or negative, respectively, with π0 + π1 + π2 = 1. F0 (δ) = I (δ ≥ 0) is the CDF of the random variable with a point mass at 0, whereas F1 and F2 are the conditional CDFs for positive and negative . Equation (1) encompasses a large number of situations of interest, where F1 and F2 can be discrete or continuous. Finally, we use f1 and f2 to denote the probability distribution functions for F1 and F2 . We consider three situations for the remainder of the paper. (1) The discrete model assumes that F is discrete with point masses at 0 and constants a1 (positive) and a2 (negative); (2) the exponential mixture model assumes a point mass at 0 and exponential densities for positive and negative δ, or f1 (δ) = λ1 exp(−λ1 δ) and f2 (δ) = λ2 exp(λ2 δ); (3) the normal mixture model assumes a point mass at 0 and two normal densities truncated at 0:      1/ 2π ξ12 exp −(δ − υ1 )2 /2ξ12    f1 (δ) =  +∞   2 exp −(δ − υ )2 /2ξ 2 dδ 2π ξ 1/ 1 1 1 0 and

     1/ 2π ξ22 exp −(δ − υ2 )2 /2ξ22    f2 (δ) = . 0   2 exp −(δ − υ )2 /2ξ 2 dδ 2π ξ 1/ 2 2 2 −∞

We believe these simple models cover many situations of interest to biological researchers, and in contrast to other descriptions of differential expression (Parmigiani et al., 2002), models (2) and (3) expand the specification of H1 to include varying degrees of alternatives. As might be expected for situations in which only those genes with extreme statistics are rejected, the tail behavior of F is important in determining the appropriate sample size. Thus one may view the discrete, exponential and normal models as representing distributions with comparatively short, heavy and medium tails, respectively.

Statistical models We let P be the random P -value for a gene. The required number of arrays is related to the distribution of P as shown below. Thus the key of our method is to estimate the distribution of P , from which the sample size can be computed. Suppose  has a CDF as given in Equation (1). Conditional on a given δ, the CDF of P at a specific p0 is, Pr(P < p0 | δ) = Pr(T > tp0 /2 | δ) + Pr(T < −tp0 /2 | δ) √ √ = 1 − G(tp0 /2 ; nδ) + G(−tp0 /2 ; nδ),

(2)

where tα is the 1 − α quantile of the central t density with 2n − 2 degrees of freedom. We will use p0 as a threshold for rejecting H0 , and thus it is the Type I error rate for a single test. The marginal distribution of P , in contrast, reflects the mixture of varying alternatives:  Pr(P < p0 ) = Pr(P < p0 | δ) dF (δ) δ



= π 0 p 0 + π1 

Pr(P < p0 | δ) dF1 (δ) δ>0

Pr(P < p0 | δ) dF2 (δ),

+ π2

(3)

δ0

 + π1 δ0.1. This specific choice of p0 is used for the illustration—the derivations apply to any p0 . Later examples demonstrate sample size calculations when a fixed number of genes is rejected and a pFDR is specified. With the number of arrays per condition set to n = 10, the expected number of rejected genes E(R) and the pFDR are exhibited in Table 1. For much of the range of a, the continuous model has a lower pFDR because of a small but often-rejected portion of genes with extreme δ under the exponential mixture model. One interesting observation is that the two E(R) functions cross, so that the discrete model rejects fewer genes than the exponential mixture model for small a, but more genes for large a. It is difficult to draw general lessons about the conservativeness [in terms of pFDR for a given E(R)] of the competing models, so that simulation or numeric integration is essential to evaluate the relationship among the relevant quantities. By combining Equations (3) and (4), we can find the relationship among n, E(R) and the pFDR by solving    E(R) E(R)pFDR (1 − pFDR) − π1 Pr P < δ dF1 (δ) m mπ0 δ>0    E(R)pFDR Pr P < − π2 δ dF2 (δ) = 0. mπ0 δ 0) log(π1 f1 (δi )g(ti ;

(8)

Superscript k is used to signify the parameter estimates in the current EM step, and F2k , F2k the corresponding CDFs. The E-step updates the expected values of unobserved quantities in Equation (8), and we let  ∞ √ hk+1 (ti ) = π0 g(ti ; 0) + π1 g(ti ; nδi ) dF1k (δi )  + π2

0

0

−∞

√ g(ti ; nδi ) dF2k .

The following expectations will be required: π0k g(ti ; 0) hk+1 (ti ) ∞ √ g(ti ; nδi ) dF1k (δi ) πk E(I (δi > 0) | ti , θ k ) = 1 0 hk+1 (ti ) E(I (δi = 0) | ti , θ k ) =

E(I (δi < 0) | ti , θ k ) = 1 − E(I (δi = 0) | ti , θ k )

E(δi I (δi E(δi I (δi E(δi2 I (δi E(δi2 I (δi

− E(I (δi > 0) | ti , θ k ) ∞ √ π1k 0 δi g(ti ; nδi ) dF1k (δi ) k > 0) | ti , θ ) = hk+1 (ti )  √ 0 π2k −∞ δi g(ti ; nδi ) dF2k (δi ) k < 0) | ti , θ ) = hk+1 (ti )  √ k ∞ 2 π1 0 δi g(ti ; nδi ) dF1k (δi ) k > 0) | ti , θ ) = hk+1 (ti )  √ 0 π2k −∞ δi2 g(ti ; nδi ) dF2k (δi ) . < 0) | ti , θ k ) = hk+1 (ti )

The maximization (M) step provides updated estimates θ k+1 . For the probability mass estimates, we have m E(I (δi = 0) | ti , θ k ) π0k+1 = i=1 m m k E(I (δ i > 0) | ti , θ ) π1k+1 = i=1 m π2k+1 = 1 − π0k+1 − π1k+1 . Solutions for the remaining parameters η1 , η2 are specific to the model. Despite the simplicity of the discrete model, no closed form is

3268

available to update a1 , a2 . For this and the continuous models, numerical integration and maximization of the expected log-likelihood were performed using functions in R. For the exponential mixture model, updates for the F1 and F2 parameters can be represented by m k i=1 E(I (δi > 0) | ti , θ )  = λk+1 m 1 k i=1 E(δi I (δi > 0) | ti , θ ) m E(I (δi < 0) | ti , θ k ) λk+1 = − mi=1 . 2 k i=1 E(δi I (δi < 0) | ti , θ )

EM ALGORITHM RESULTS FOR SIMULATED AND REAL DATASETS Simulation studies We implemented the normal mixture model in simulations to test the EM algorithm and to highlight any difficulties in handling the greater number of parameters in the model, because the real dataset (discussed later) favored this model. The number of genes per array was set to m = 10 000, with π0 = 0.8 and π1 = π2 = 0.1. There were n = 5 arrays under each condition. The distribution of  was assumed to follow the normal mixture model with υ1 = 1, ξ12 = 1.25 and υ2 = −1, ξ22 = 1.25. To generate the data with the appropriate characteristics, for each gene i we needed to simulate a δi from , and then generate an expression profile for the gene consistent with that effect size. We implemented the approach described by Hu and Wright (2004), based on the analysis and modeling of four Affymetrix datasets. The vector of µ1 were generated as independent χ52 observations multiplied by 1000. For fixed β0 = −4.6 and β1 = 1.7, σ 21 was obtained from the mean–variance model log(σ12 ) = β0 + β1 log(µ1 ) for each gene. Once µ1 and σ 21 were chosen, the remaining vectors were obtained in one of two ways: (1) σ 22 was chosen as σ 22 = σ 21 , and then µ2 chosen to accord with the choice of δ for each gene (the equal variance scenario). (2) µ2 and σ 22 were chosen to satisfy both the choice of δ for the gene and the mean–variance model for condition 2, log(σ22 ) = β0 + β1 log(µ2 ) (the unequal variance scenario). The unequal variance scenario is more realistic, and the simulations allow us to examine the effect of this modest departure from the assumed model when estimating the pFDR and sample sizes. We also explored the effect of correlation among sets of genes on the array, by introducing correlation among ‘blocks’ of genes on the array. Within a block of genes of size B, we wanted the correlation coefficient ρ between all pairs of genes. We assumed that all genes within the block had the same µ1 , µ2 , σ12 and σ22 , and therefore the same δ. Within condition k (k = 1, 2), we generated a prototypical gene expression ‘profile’ wk1 , wk2 , . . . , wkn as iid N (µk , ρσk2 ). Then for gene i, i = 1, . . . , B, we let xkij = wkj + , where the are all iid N (0, σ 2 ), and σ 2 = (1 − ρ)σ12 . Thus w served as a latent expression profile to produce the x values. It is easy to confirm that within a block and within each condition the expression

FDR-based sample size calculations for microarrays

Fig. 2. pFDR results in the simulation depicted in Figure 1. Use of the correctly identified normal mixture model gives pFDR estimates that are closest to the true values.

Fig. 1. Model-fit results (Q–Q plots) in a simulation study (n = 5, m = 10 000), in which the truncated normal model is correctly identified (right panels). The labels (discrete, exponential, normal) refer to the assumed F used in model-fitting.

values for different genes have the correlation ρ and the appropriate means and variances. Figure 1 shows the model fits of the three different distributions when the true distribution is truncated normal as described above. The top row shows the results for the equal variance, no correlation scenario. This scenario corresponds to the situation where the model assumptions hold exactly, and the fit is indicated by quantile–quantile (Q–Q) plots of the observed 10 000 t-statistics versus 100 000 simulations of T using the fitted model. The bottom row shows the fits for the situation where two aspects of the model assumptions do not hold: the variances are unequal and genes are correlated (ρ = 0.5) in block of size B = 50. Under both scenarios, the correct truncated normal model fits the best, especially in the extreme tails that typically form rejection regions. This result is confirmed by Pearson correlation coefficents of the Q–Q plots and Kolmogorov–Smirnov statistics for the simulated versus observed values. For the two other intermediate scenarios (equal variance with correlation, unequal variance with/no correlation), the truncated normal model is also correctly identified (data not shown). For the two scenarios considered in Figure 1, we used the fitted F distributions for each of the three model types to plot the

theoretical pFDRs at various E(R) values (Fig. 2). These fitted values can be compared with the true pFDR using the known parameters (bold curve). Not surprisingly, the curve under the estimated normal mixture model is closest to the true model under both the scenarios. Another comparison is provided by an empirical estimate of the pFDR created by comparing the observed t-statistics with those generated under 252 exhaustively permuted (therefore null) assignments of arrays to the two conditions. Essentially this is the approach implemented in the popular SAM software (Tusher et al., 2001; Storey and Tibshirani, 2003). We consider the t-statistics arising under different permutations to be exchangeable (Reiner et al., 2003), which leads to use of the mean number of rejected genes for each permutation. This is a slight difference from the SAM procedure, which counts the median number of rejected genes for each null permutation (Chu et al., 2001). The resulting empirical pFDR estimate is shown as the thick dashed line (Fig. 2). The jagged appearance of the empirical pFDR is due to the finite number of genes in the observed dataset. Note that the empirical pFDR is quite far from the pFDR obtained under the true model. This phenomenon appears to largely result from an overestimate of π0 , ∼0.95 in both the cases versus the true value 0.8, which has a direct effect on the estimated pFDR, and leads to a less-extreme estimated rejection threshold, also increasing the apparent pFDR. We expect that this may often occur in situations where  follows a continuous distribution, because many genes with small δ may appear to be effectively null. From Figure 2 and the other intermediate scenarios (equal variance with correlation, unequal variance with/no correlation, data not shown), we note that neither the

3269

J.Hu et al.

variance nor the block correlation structure among genes has a great impact on the pFDR estimates.

A real dataset We applied the EM algorithm for the three parametric models to a murine lung microarray dataset submitted to Gene Expression Omnibus (E. Hoffman, Children’s National Medical Center, GDS251). The study used the Affymetrix U74Av2 array (12 488 probesets) and compared samples from the C57BL6/J (n = 12) and Balb/c (n = 12) strains in sensitivity with pulmonary fibrosis. The study had additional factors balanced over the strains, but we applied the simple two-sample comparison to illustrate our approach. The two-sample t-statistics for all the genes were computed. The three models discussed above were fitted separately. The parameter estimates for the discrete model were πˆ 0 = 0.955, πˆ 1 = 0.040, πˆ 2 = 0.005, aˆ 1 = 0.972 and aˆ 2 = −1.092. The exponential mixture model estimates were πˆ 0 = 0.681, πˆ 1 = 0.264, πˆ 2 = 0.055, λˆ 1 = 3.516 and λˆ 2 = 3.122. For the truncated normal mixture model, πˆ 0 = 0.848, πˆ 1 = 0.119, πˆ 2 = 0.033, υˆ 1 = 0.346, υˆ 2 = −0.344, ξˆ12 = 0.048 and ξˆ22 = 0.068. The π0 estimates varied among the models. However, we note that the continuous F1 and F2 models allow mass near zero, for which the genes are ‘effectively’ null. Accordingly, and as suggested earlier, the mean and standard deviations of the corresponding  distributions are fairly comparable. For example, the fit to the discrete model gives E() = −0.033, SD() = 0.207, while for the exponential mixture model the corresponding values are 0.058 and 0.227, and for the normal mixture model are 0.031 and 0.157. Although the true form of F is unknown, an indication of model fit is given in Q–Q plots of the observed 12 488 t-statistics versus 100 000 simulations of T (first three panels in Fig. 3). It is clear that the normal mixture F model offers the best fit, especially in the extreme tails (with the possible exception of the two most extreme genes). As a simple illustration of effectiveness of the model fitting procedure, we performed 1000 simulations of 12 488 t-statistics using the normal mixture model, fixing the parameters at the values estimated from the murine pulmonary dataset. The true values were compared with the observed means ± SD as follows: π0 = 0.848 (0.851 ± 0.005); π1 = 0.119 (0.117 ± 0.002); υ1 = 0.346 (0.367 ± 0.012); υ2 = −0.344 (−0.398 ± 0.025), ξ12 = 0.048 (0.045 ± 0.002) and ξ22 = 0.068 (0.064 ± 0.003). This amount of variation in the estimates has only a minimal effect on sample size computation. Figure 3 (lower right) shows the sample sizes needed to reject varying numbers of genes at a variety of pFDR values. pFDR results were computed using numeric integration. pFDRs for given sample sizes along a range of expected number of rejected genes are exhibited in Table 3, along with the expected number of rejected genes for given sample sizes at a variety of pFDR values. For this dataset, rejection of many genes requires many more arrays than were run in the analyzed dataset. For example, while n = 14 arrays in each condition would be required to reject 30 genes while controlling the pFDR at 0.05, n = 26 would be required to control the pFDR at 0.05 for 300 rejected genes (Fig. 3 and Table 3). Similar tables and curves will be of use to practicing researchers who wish to balance competing considerations in the sample size versus experimental costs. R computer code to fit the three parametric

3270

Fig. 3. Model-fit and sample size results for the murine pulmonary data. Q–Q plots show that the normal mixture model fits the best among the models considered. Sample size requirements, along with the number of genes rejected, are shown in the lower right panel.

models, and estimate sample sizes and pFDR values is available from the authors.

CONCLUDING REMARKS We have described an approach to estimate sample sizes necessary to control the FDR in microarray experiments. By explicitly modeling the degree of differential expression under the alternative hypothesis, we expand the framework of the FDR. This approach is necessary, as the specific tail behavior of  under the alternative is important in determining the pFDR. Although we illustrated our method using a few simple parametric models for , a larger and more flexible range of parameterizations might be explored. Alternatively, empirical approaches might be used, in which the t-statistics are appropriately shrunk toward a common mean to provide an estimate of F . Other extensions to our approach will account for more powerful statistics and a wider range of experimental designs. The essence of the approach, however, will remain unchanged as the FDR will depend on the distribution F of unknown , which can be estimated from observed T s. Extensions of our two-sample approach to more general linear model situations (e.g. ANOVA and regression) require a more general specification of δ, such as the ratio of explained to residual variance

FDR-based sample size calculations for microarrays

Table 3. pFDR values and numbers of rejected genes for the murine pulmonary data

n

8 10 12 14 16 18 20 22 24 26 28 30

E(R) 20 pFDR

30

40

50

100

150

200

250

300

pFDR 0.01 E(R)

0.025

0.05

0.1

0.245 0.134 0.070 0.036 0.018 0.009 0.004 0.002 0.001 0.001 0 0

0.269 0.156 0.087 0.047 0.025 0.013 0.007 0.004 0.002 0.001 0.001 0

0.288 0.175 0.101 0.057 0.032 0.017 0.009 0.005 0.003 0.002 0.001 0

0.304 0.191 0.115 0.067 0.038 0.022 0.012 0.007 0.004 0.002 0.001 0.001

0.361 0.251 0.169 0.111 0.071 0.045 0.028 0.017 0.011 0.007 0.004 0.002

0.399 0.294 0.212 0.149 0.102 0.070 0.047 0.031 0.020 0.013 0.009 0.006

0.428 0.329 0.247 0.183 0.133 0.095 0.067 0.047 0.033 0.022 0.015 0.011

0.452 0.357 0.278 0.213 0.161 0.121 0.089 0.065 0.047 0.034 0.024 0.017

0.472 0.382 0.306 0.241 0.189 0.146 0.111 0.084 0.063 0.047 0.035 0.025

0 0 0 2 9 22 42 66 95 127 160 195

0 0 2 11 30 57 91 129 170 212 255 297

0 0 10 33 67 110 158 209 260 310 360 408

0 8 39 87 145 209 275 339 401 461 518 571

The left columns give the pFDR as a function of the sample size and E(R), while the right columns give E(R) as a function of the sample size and the pFDR.

in the linear model. In principle, such an approach is not difficult, but again requires that the sample size be the same under each condition, or at least that the sample size allocation across conditions be preserved from the training set to the anticipated larger study. We do not view the sample size allocation restriction as a major drawback as designs with (nearly) equal numbers of arrays in each condition are common, and sensitivity analysis can be used for reassurance in cases of modest departure from equal allocation. In practice we may wish to develop sample size estimates by using some similar pilot study, or published data from biological settings that are different, but considered similar enough to the proposed experiment in order to provide meaningful predictions. Here our parametric estimate of F from the published data leads to a compact estimate of the distribution of effect sizes that might be easily carried over to different settings, or possibly to different array platforms. By computing the pFDR under a number of parametric models, it is also straightforward to be conservative in the sample size calculation. For example, we might base our sample size on conservative pFDR estimates from Figure 2, using the greatest pFDR among the three models for each E(R). Finally, we note that a limitation on the sample size choice can occur when the proportion of truly differentially expressed genes is very close to 1, so that for a fixed p0 , E(R) reaches a limit with increasing n. Essentially this results from an inequality implicit from Equation (4): pFDR ≥

π0 p0 . π0 p0 + (1 − π0 )

(9)

This phenomenon did not occur in any of our examples and only presents a problem when πˆ 0 is very close to 1, and for fixed p0 we must always be rejecting some null genes.

ACKNOWLEDGEMENTS The authors would like to thank the editors and reviewers for helpful comments that strengthened the manuscript. This work is supported in part by NIH grant 3 P30 HD003110. Conflict of Interest: none declared.

REFERENCES Black,M.A. and Doerge,R.W. (2002) Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics, 18, 1609–1616. Chen,Y. et al. (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 2, 364–367. Chu,G., Narasimhan,B., Tibshirani,R. and Tusher,V. (2001) SAM, “Significance Analysis of Microarrays”. Users Guide and Technical Document. Cohen,J. (1988) Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Erlbaum, Hillsdale, NJ. Dempster,A.P. et al. (1977) Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. Ser. B, 39, 1–38. Dudoit,S. et al. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sinica, 12, 111–139. Efron,B. et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., 96, 1151–1160. Giles,P. and Kipling,D. (2003) Normality of oligonucleotide microarray data and implications for parametric statistical analyses. Bioinformatics, 19, 2254–2262. Hu,J. and Wright,F.A. (2004) Assessing differential gene expression with small sample sizes in oligonucleotide arrays using a mean–variance model. Submitted. Ideker,T. et al. (2000) Testing for differentially expressed genes by maximum likelihood analysis of microarray data. J. Comput. Biol., 7, 805–817. Jin,W. et al. (2001) The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster. Nat. Genet., 29, 389–395. Johnson,N.L., Kotz,S. and Balakrishnan,N. (1994) Continous Univariate Distributions, 2nd edn. Wiley, NY. Lee,M.T. and Whitmore,G.A. (2002) Power and sample size for DNA microarray studies. Stat. Med., 21, 3543–3570. Liao,J.G. et al. (2004) A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics, 20, 2694–2701. Neter,J., Wasserman,W. and Kutner,M.H. (1985) Applied Linear Statistical Models, 2nd edn. Irwin, Homewood, IL. Pan,W. et al. (2001) A mixture model approach to detecting differentially expressed genes in replicated microarray experiments. Funct. Integr. Genomics, 3, 117–124. Pan,W. et al. (2002) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol., 3, research0022.1–0022.10. Parmigiani,G. et al. (2002) A statistical framework for expression-based molecular classification in cancer. J. R. Stat. Soc. Ser. B, 64, 717–736. Reiner,A. et al. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics, 19, 368–375. Storey,J.D. (2003) The positive false discovery rate: a Bayesian interpretation and the Q-value. Ann. Stat., 31, 2013–2035. Storey,J.D. and Tibshirani,R. (2003) The Analysis of Gene Expression Data: Methods and Software. Springer, NY.

3271

J.Hu et al.

Troyanskaya,O.G. et al. (2002) Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics, 18, 1454–1461. Tusher,V.G., Tibshirani,R. and Chu,G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121.

3272

Welch,B.L. (1947) The generalization of ‘students’ problem when several different population variances are involved. Biometrika, 34, 28–35. Yoon,H. et al. (2002) Gene expression profiling of isogenic cells with different tp53 gene dosage reveals numerous genes that are affected by tp53 dosage and identifies cspg2 as a direct target of p53. Proc. Natl Acad. Sci. USA, 99, 15632–15637.