Supplement - Biometric Research Program

will be affected by the correlation structure. 3 Sample size ..... then half the normal tissue would be tagged with Cy3 and the other half with Cy5. Researchers ...
123KB taille 23 téléchargements 268 vues
Supplementary Material for: Sample Size Determination in Microarray Experiments for Class Comparison and Prognostic Classification

1

Sample size calculation for single-channel array

For simplicity, we assume samples coming from two classes. To simplify the presentation, we switch notation, so that Yg1f s = zf s and Yg2f s = wf s . The normalized, background-adjusted log-intensities, zij and wkl , are described by the model:

zij xi ²ij wkl yk

= ∼ ∼ = ∼

µx + xi + ²ij , i = 1...n, j = 1...m N ormal(0, τx2 ) N ormal(0, σ 2 ) µy + yk + ²kl , k = 1...n, l = 1...m N ormal(0, τy2 )

²kl ∼ N ormal(0, σ 2 )

where all the xi , yk , ²ij and ²kl are independent. We wish to test µx = µy against the alternative µx 6= µy . The log-likelihood is µ° ° ¶ −m/2 n c° ll = − log °°Σ ° − x 2

1

1X (Zi − µx Jm,1 )T Σ−1 x (Zi − µx Jm,1 ) 2 i µ° ° ¶ n ° c °−m/2 − log °Σy ° 2 1X − (Wi − µy Jm,1 )T Σ−1 y (Zi − µy Jm,1 ) 2 i where Jm,1 is a vector of 1’s of length m, Zi is the vector (zi1 , ..., zim )T , and similarly Wi , and Σx and Σy are the compound symmetric covariance matrices corresponding to the Z’s and W ’s. Assuming that the variance parameters are known, then the covariance matrices are known and can be shown to be compound symmetric. Then the transformations

−1

Zi∗ = Σx 2 Zi −1

Wi∗ = Σy 2 Wi

result in covariance matrices of the form

cov(Zi∗ ) = I cov(Wi∗ ) = I where I is the identity matrix. Now if we can write the hypothesis test in a simple form in this transformed space, then approximate sample size calculations will be relatively simple functions of the variance parameters, which we can then transform back into the space of the original problem to compare with the sample size formulas for the fixed-effects model. The following seven theorems comprise the formula derivation. Theorem 1 The covariance matrices of the Zi and the Wk are compound symmetric Proof: 2

var(zij ) = τx2 + σ 2 cov(zi1 , zi2 ) = cov(µx + xi + ²i1 , µx + xi + ²i2 ) = var(xi ) = τx2 var(wkl ) = τy2 + σ 2 cov(wk1 , wk2 ) = τy2 By symmetry, what holds for j = 1 and j = 2 holds for all j1 6= j2 , and all i; and similarly for l = 1 and l = 2 holds for all l1 6= l2 , and all k. Therefore, the covariance matrices are

cov(Zi ) = Iσ 2 + Jm,m τx2 cov(Wk ) = Iσ 2 + Jm,m τy2 which is compound symmetric. Q.E.D. Theorem 2 The inverse of a matrix of the form Im α2 + Jm,m β, if it exists, is I α12 − Jm,m α2 (α2β+mβ) . Proof:

1 β − Jm,m 2 2 ) 2 α α (α + mβ) β β mβ 2 = Im + Jm,m 2 − Jm,m 2 − Jm,m 2 2 α α + mβ α (α + mβ) 2 2 2 β(α + mβ) − βα − mβ = Im + Jm,m α2 (α2 + mβ) = Im

(Im α2 + Jm,m β) · (Im

Q.E.D. 3

Theorem 3 If Σz = cov(Zi ) and Σw = cov(Wk ), then

1 τx2 − J m,m 2 σ2 σ (σ 2 + mτx2 ) τy2 1 = Im 2 − Jm,m 2 2 σ σ (σ + mτy2 )

Σ−1 = Im z Σ−1 w if the inverses exist.

Proof: This follows directly from the previous two theorems. Q.E.D. √

1 2

2

Theorem 4 If Σ = Im α + Jm,m β, then Σ = Im α + Jm,m

mβ+α2 −α m

Proof: √

! Ã ! √ mβ + α2 − α mβ + α2 − α Im α + Jm,m · Im α + Jm,m m m #2 "√ √ mβ + α2 − α mβ + α2 − α 2 + mJm,m = Im α + 2αJm,m m m √ √ 2α mβ + α2 − 2α2 mβ + α2 − 2α mβ + α2 + α2 2 = Im α + Jm,m + Jm,m m m = Im α2 + Jm,m β

Ã

Q.E.D. Theorem 5 If the inverses exist, then − 12

µ

Σw = I σ1 + Jm,m m1 √

−1 Σz 2

¶ 1 σ 2 +mτy2



√1 σ2

Proof: Let 4

µ

=

I σ1

+ Jm,m m1





1 σ 2 +mτx2



√1 σ2

and

α2 = β =

1 σ2

−τx2 σ 2 (σ 2 + mτx2 )

then

1 σ v  u 2 1 q 2 1 u 1 mτ 1 x t − ( α + mβ − α) = −  m m σ 2 σ 2 (σ 2 + mτx2 ) σ α =





1  1 1 q = −√  m σ2 σ 2 + mτx2 −1

The proof is similar of Σw 2 . Q.E.D. Theorem 6 The distribution of Zi∗ and Wi∗ are N ormal(µx /(σ 2 + mτx2 )Jm,1 , I) and N ormal(µy /(σ 2 + mτy2 )Jm,1 , I) respectively. −1

−1

Proof: Recall that Zi∗ = Σx 2 Zi , and Wi∗ = Σy 2 Wi . ·

−1

E Σx 2 Z

¸

−1

= Σx 2 Jm,1 µx 



1 1 µx −  = Jm,1 + Jm,1 µx  q σ σ 2 + mτx2 σ µx = Jm,1 q σ 2 + mτx2 The mean vector for the Wi∗ follows from a similar calculation. The covariance matrices in both cases are the identity matrix by construction. Q.E.D. 5

Theorem 7 If τ = τx = τy , then the sample size for achieving power 1 − β at distance δ for a two-sided test at level α of the null hypothesis H0 : µx = µy versus H1 : µx 6= µy is ·

zα/2 + zβ n = 4 δ

¸2 Ã 2 σ

m

!



2

.

Proof: First, reduce by sufficiency. Ã

Z¯i∗



¯∗ ∼ W k √ √

σ 2 + mτ 2 Z¯i∗ ∼

¯∗ ∼ σ 2 + mτ 2 W k

!

µx 1 N ormal √ 2 , σ + mτ 2 m à ! µy 1 N ormal √ 2 , σ + mτ 2 m à ! σ2 2 N ormal µx , τ + m à ! σ2 2 N ormal µy , τ + m

The required number of distinct samples n, assuming the variance parameters are known, is then

·

zα/2 + zβ n = 4 δ

¸2 Ã

σ2 τ + m

!

2

as desired. Q.E.D. This sample size formula will only be approximate, because in reality the variance parameters are not known exactly and the distribution of the test statistic under the alternative hypothesis is non-central t, not normal.

6

2

Impact of correlation among genes on falsepositive rate

Suppose that there are a total of G genes, and that an hypothesis test is carried out at the α significance level for each gene. For example, if there are two varieties of interest, numbered variety 1 and variety 2, then the test of the null hypothesis H0 : V G1g = V G2g versus the alternative H1 : V G1g 6= V G2g may be carried out for each gene g to identify those that are differentially expressed in the varieties. Define the a collection of indicator variables dg such that, if the hypothesis test for gene g is significant, then dg = 1, otherwise, dg = 0. The number of significant P hypothesis tests is then g dg . The expected number of significant tests is then by linearity of expectation "

E

X

#

dg

=

g

X

E [dg ]

g

= Gα where G is the number of genes. Recall that linearity of expectation holds even when the variables are not independent. While the expectation of the number of false positives will remain unaffected by the correlation among the genes, the distribution of the number of false positives will be affected by the correlation structure.

3

Sample size comparisons for technical replicates in a reference design

Fix the total number of arrays available at n. We compare a design that uses no technical replicates to a design that uses r technical replicates for each sample.

ygadvf = Gg + GAga + GDgd + GVgv + (GF )gf (v) + ²gadvf GFgf ∼ N (0, κ2 ) 7

² ∼ N (0, λ2 )

Under this model, the variance of the log-ratios in a reference design is κ2 +2λ2 . 2 The variance of an average of r technical replicate log-ratios is κ2 + 2λr . Technical replicates can be handled by averaging the replicate measurement log-ratios, then using the averages as if they were the original log-ratios. If there are r replicates per sample, this results in n/r total averages, one for each nonreference sample. Hence the efficiency of the class mean estimate for a technical replicate design is

κ2 + 2λr n/r

2

=

rκ2 + 2λ2 . n

For example, suppose n1 arrays are required to achieve a specified power or efficiency when no technical replicates are used. Then let nr be the number of arrays required to achieve that same power or efficiency when r technical replicates per sample are performed. Then the two required sample sizes are related by the equation:

nr rκ2 + 2λ2 = n1 κ2 + 2λ2 r(κ/λ)2 + 2 = (κ/λ)2 + 2 For instance, we have observed (κ/λ)2 with median approximately 7.3 in one dataset. Hence, if n1 = 20 arrays are required with a single replicate on each array, it would require n2 = 36 arrays and 18 samples with two technical replicates per sample, and n3 = 51 arrays and 17 samples with three technical replicates per sample, to achieve the same power and efficiency.

8

4

Sample size comparison formula for a paired samples design with technical replicates

The analysis of variance model for paired data requires a conceptual shift in the parameter interpretations (reference to our dye swap paper and the appendix thereto). In this paper one sees that several models can be fit to paired data, and the most appropriate one will depend on the level and extent of the replication. For the model we think most likely to be appropriate in practice, the variance of the contrast of interest for a design which is balanced with respect to the dyes, and no technical replicates is

τg2 + 2σg2 nbalanced where τg2 is the variance in the effect of the cancer on the expression level of the gene g in the population of interest, and σg2 is the error term associated with gene g. The variance for a model with one technical replicate or one dye swap technical replicate for each array is

2τg2 + 2σg2 . ndyeswap

5

Sample size for pooling

For the reference design, recall the model for non-pooled dual-label data was ³

unpooled log Ygadvf

´

= Gg + GAga + GDgd + GVgv + (GF )gf (v) + ²gadvf ,

If we pool k samples, and make the simplifying assumption that each sample is equally represented in the pool and that the effect of pooling is additive on the 9

log scale, then for pool p (for example), which is made up of samples 1...k, we will have

³

pooled log Ygadvp

´

{2}

{2} {2} = G{2} g + GAga + GDgd + GVgv +

k 1X {2} {2} (GF )gf (v) + ²gadvp . k f =1

Hence, what will change in the model is GF terms. Noting that var

h P k 1 k

i

f =1 (GF )gf (v) =

τ2 , k

we see that a reduction in the biological variation associated with the pooled sample is the only substantive change in the model. Therefore, sample size calculations will only be affected though this change.

6

A bijective function relating the terms in a single-label model to the terms in a dual-label model

Recall that the single-label model is ³

{1}

log Ygvf

´

{1}

{1}

{1} = G{1} g + GVgv + (GF )gf (v) + ²gvf .

Also recall that the dual-label model is ³

{2}

log Ygadvf

´

{2}

{2}

{2}

{2} = Gg{2} + GA{2} ga + GDgd + GVgv + (GF )gf (v) + ²gadvf ,

which, in terms of the log-ratio for a single array containing, say, sample 1 from variety 1, is ³

{2}

{2}

log Ygadv1 /Ygadv0

´

{2}

{2}

{2}

{2}

{2}

{2}

{2}

{2}

= GDg1 − GDg2 + GVg1 − GVg0 + (GF )g1(v) − (GF )g0(v) + ²gadv1 − ²gadv0

All log-ratios in a dual-label reference design will have the GDg1 − GDg2 term. This represents the difference between the dyes with respect to dye incorporation. 10

Since the single-dye array has only one dye, the impact of this dye is contained in the Gg term of that model. Note that the Bijective functions

{2}

{2}

{2}

{2}

{2}

{2}

G{1} ⇔ GDg1 − GDg2 g GVgv{1} ⇔ GVg1 − GVg0 {1}

GFgf {1}

⇔ GFg1 − GFg0 {1}

{2}

²gadvf ⇔ ²gadv1 − ²gadv0

form a bijective map from the single-dye probability model to the dual-dye probability model. Finally, recall that the correlation structure among the log-ratios in the reference design was shown to have no impact on the statistical inference procedures we considered. Therefore, we have shown that the two models are equivalent.

7

Table showing performance of FDR estimate

11

π .005 .005 .005 .05 .05 .05 .20 .20 .20

α .001 .001 .001 .001 .001 .001 .001 .001 .001

1−β .95 .90 .80 .95 .90 .80 .95 .90 .80

MC FDR: Mean(SD) .166(.045) .175(.047) .188(.049) .020(.0058) .021(.0072) .022(.0064) .0042(.0014) .0045(.0014) .0051(.0016)

b E[FDR] .17 .18 .20 .02 .02 .02 .004 .004 .005

.005 .005 .005 .05 .05 .05 .20 .20 .20

.01 .01 .01 .01 .01 .01 .01 .01 .01

.95 .90 .80 .95 .90 .80 .95 .90 .80

.673(.023) .679(.025) .705(.025) .165(.014) .168(.013) .189(.017) .040(.0050) .041(.0045) .047(.0048)

.68 .69 .71 .17 .17 .19 .04 .04 .05

.005 .005 .005 .05 .05 .05 .20 .20 .20

.005 .005 .005 .005 .005 .005 .005 .005 .005

.95 .90 .80 .95 .90 .80 .95 .90 .80

.512(.040) .510(.034) .549(.039) .090(.011) .094(.012) .101(.012) .020(.0035) .021(.0031) .023(.0034)

.51 .53 .55 .09 .10 .11 .02 .02 .02

Table 1: Monte Carlo (MC) simulation of the false discovery rate (FDR). Each b line represents 100 MC simulations. E[FDR] is the estimated FDR using the estimation formula in the text. α is the significance level, 1 − β is the power, and π is the true number of differentially expressed genes.

12

8

Review of other microarray sample size papers

Most of the papers addressing sample size issues have focused on determining an adequate sample size for comparing individual specimens, without making statistical inference to larger populations. For instance: • Lee et al. (2000) and Black and Doerge (2002) discuss methods for determining the number of replicate spots needed on an array when the goal of the experiment is to compare two individual labeled cDNA samples; • Pan et al. (2002) discuss computational algorithms to determine the number of (technical) replicate arrays required for nonparametric analyses when the goal is to compare two individual RNA samples adequately; • Wolfinger et al. (2001) present power plots for comparing individual samples in a mixed effects analysis of variance model; • Lee and Whitmore (2002) present a Bayesian approach to sample size and power for comparing individual specimens. Other investigators have suggested methods for sample size determination for linear discrimination analysis (Hwang et al., 2002), where the goal is to test the global hypothesis that the classes do not differ with regard to expression of any genes. We have presented some sample size formulas for statistical inference in microarray studies for particular experimental situations (Simon et al., 2002; Dobbin et al., 2003a; Dobbin et al., 2003b), where the goal is to make statistical inference to populations of specimens. Zien et al. (2003) used a model similar to the one we proposed (Dobbin and Simon, 2002) to develop a simulation program that provides estimates of sensitivity and specificity in response to a given set of user inputs, one of which is number of samples in each group.

9

Design choice discussions

A simple reference design places sub-samples from a single reference RNA sample on each array, tagged with the same dye. Suppose we have a simple reference design experiment comparing two groups. In general, if the goal is to compare two groups, we recommend a balanced block design, in which no reference RNA is used 13

and each pair of classes appears together the same number of times over the arrays (Scheff´e, p. 161). But there are reasons one might prefer to use a reference design. Because spot effects, which are generally fairly large, are confounded with genespecific sample effects (i.e., biological variation), comparing samples on different arrays directly becomes problematic in a balanced block design. This confounding has a number of implications: 1) cluster analysis of samples may be impossible unless one ignores the spot effects or introduces additional model assumptions, neither of which seems advisable (Dobbin and Simon, 2002); 2) similarly, it may be impossible to construct a class predictor out of a subset of genes; or, 3) to develop a multi-gene prognostic marker, because of this confounding; 4) normalizing data gathered at different times or in different experiments will also be very problematic with a balanced block design. Another problem with a balanced block design is that if there is more than one way to classify the samples, then designing the experiment so that each potential classification can be made efficiently may be difficult or impossible. In sum, if one is considering an analysis that goes beyond just identifying differentially expressed genes, then a reference design will often be preferable. Hence, a sample size formula for comparing classes in a reference design may be of some interest. In general, we (Dobbin et al. 2003), and others (Liang et al., 2003), don’t recommend using dye-swaps (i.e., running the same samples on two different arrays, one with the dye labeling reversed) in a reference design unless a goal is comparison of experimental samples to the reference. For comparisons of classes of non-reference samples, dye-swap arrays do not remove any bias from the comparisons and generally represent replication at the wrong level, and hence result in a loss of efficiency for the comparisons. On the other hand, some technical replicates may be informative to assure that the assay is reproducible. The motivation for having more than one label in a spotted microarray system is that variability in log intensity corresponding to spots on different arrays is greater than variability within a spot, from one labeled sample to the other. This is because of variability in spot size for printed arrays and because the labeled sample is not distributed uniformly across the surface of the array. This situation has an analogue in agricultural experiments, where often the variability between different tracts of land, due to location, weather, soil conditions, etc., is greater than the variability within a tract of land. The tracts of land, which are analogous to the spots on the microarray, are referred to as blocks, and the resulting experimental design is called a block design. One essentially ”blocks out” the effect of the spots/tracts in such an experiment by basing all inference on within-spot/tract comparisons. When comparing several classes with this type of block experiment, 14

it is a well-established fact in the agricultural model that a balanced block design, in which samples from each pair of varieties appear together on an array the same number of times, produces the greatest efficiency and power. The model for microarray data is somewhat different than the agricultural analogue, but we have shown that it is still true that the balanced block design is the most efficient and powerful, and can be a significant improvement in these respects over other designs (Dobbin and Simon, 2002; Dobbin et al., 2003). The goal of a paired design is typically to identify genes expressed differently within each pair, e.g., genes whose expression level is affected by the treatment, or the presence of the tumor. For paired samples and dual-label microarrays, putting one member from each pair together on each array is a natural design that is also most efficient. In this type of design, each individual serves as their own control, which can effectively eliminate much biological variation from the comparison of interest, increasing precision. We have recommended balancing the pairs with respect to the dyes in order to eliminate the dye bias from the comparisons; that is, if the sample pairs consist of normal and tumor tissue from the same individual, then half the normal tissue would be tagged with Cy3 and the other half with Cy5. Researchers have also run each sample pair twice, once with each dye labeling, to eliminate the dye bias (Boer et al., 2001; Lossos et al., 2002). Such a design is sometimes called a complete dye-swap design. Kendziorski et al. (2003) study the efficiency of pooling as a function of the ratio of the biological variation to the technical measurement error variation, similar to our investigation into the effects of this ratio on experimental design efficiency (Dobbin and Simon, 2002). There are practical limits to the applicability of the sample size formula for pooled samples. For instance, it is only correct under the assumption that each sample is equally represented in the pool and the pool is homogeneous; but as the number of samples increases, then intuitively the validity of these assumptions becomes more questionable. And, in general, the more samples are pooled together the more things that can go wrong. For instance, there is a potential quality control issue because one must be sure that all the contributing RNA to the pool is of good quality, since problems with a particular RNA sample may not be detectable after pooling, and this could lead to erroneous inference. Also, if one is combining a large number of samples, then the different RNA may interact in unexpected ways that would result in non-additive effects on the expression level of individual genes. Sometimes investigators with a limited number of samples and arrays ask whether, given a fixed number of arrays, pooling some of the samples together 15

and replicating the pooled samples over the arrays may be preferable to assigning each sample to a different array. For instance, if one has 24 samples and plans to run 24 arrays, one can either run one sample per array, or pool pairs of samples together and run each pool twice (technical replication), or pool triples of samples together and run each triple three times, etc. In fact, the one sample per array strategy is best, because one gains no efficiency by pooling together samples rather than assigning each to its own array. But one will lose degrees of freedom for error, and therefore power for detecting differential expression. In conclusion, if one is not forced to pool for technical reasons (e.g., insufficient RNA from each individual is available for the microarray), then it appears that pooling will probably rarely be a good idea. The tradeoff between the reduction in arrays required under a pooled design will rarely be worth the increase in the number of samples required, particularly when one factors in the potential loss of robustness and power that comes with the pooled design.

16