Mer

Sep 10, 2008 - Question. Null hypothesis. Type of data. Number of data. Number of ... To be computed for each sample before any test. Standard : n ...
2MB taille 23 téléchargements 350 vues
Biostatistics Yves Desdevises Observatoire Océanologique de Banyuls/Mer Université Pierre et Marie Curie - Paris 6 France

Outline 

1. Session 1 - Basics



2. Session 2 - Small samples



Timing for each session: 

Lecture (45-60 mn)



Demo on computer (30 mn)



Computer lab + Discussion on your data, ... (90 mn)

Session I - Basics

Question 

Null hypothesis



Type of data



Number of data



Number of groups/variables

A few definitions 

 



Object = observation = sample unit: item on which characteristics (variables) are measured Sample: all objects Target population: object group on which is focused the study Statistic population: objects represented by the sample. Inference is made on this population.



Variable = descriptor = factor = trait: feature measured or observed on objects. e.g. length, temperature, ... 

Dependent variable (Y) = response



Independent variable (X) = explanatory



Random variable: value unknown before measure



Fixed variable: set by the experimenter, error = measure only (NB: error = random (or residual) variation, not bad job!)





Parameter: quantitative value allowing a condensed representation of the information contained in a dataset. e.g. mean, slope, ... Variance = inertia = mean square: sum of squares of deviations to the mean/number of objects (dispersion parameter)



Standard deviation: square root of the variance (same unit as the variable) 2 Sx = √Sx





Standard error: standard deviation of the sampling distribution of the mean (dispersion of the means of several samples from a same population)

Dispersion = information (≠ noise). Must be assessed via replication.

Data 

Binary: 2 states. e.g. presence/absence



Multiple: more than 2 states 

Non-ordered = nominal. e.g. color



Ordered 

Semi-quantitative = ordinal = ranked. e.g. classes



Quantitative 

Discrete. e.g. number of individuals



Continuous. e.g. length

Descriptive statistics  

To be computed for each sample before any test Standard : n, mean, variances, histogram or normal quantile plot (distribution),...



Identify outliers



Often forgotten...

Statdisk - Explore Data Printed on Mer 10 sep 2008 at 9:08

Statdisk - Scatterplot Printed on Mer 10 sep 2008 at 9:08

X Value

Planning experiments 

Work on a simplified system: study the response to the variation of few factors



Important: rigorous (no test for that: assumption)



Construction: null and alternative hypotheses



Specifically answer a question





Minimise Type I error: do not erroneously conclude to an effect due to the factor Maximise power (Type II): ability to find small differences







Importance of a control group 

Untreated group



Special treatment (placebo, manipulation, ...)

Block 

A priori group of sample units



Allows a good repartition of treatment

Replication 



Assess natural variability

Avoid pseudoreplication

Randomized Randomized Block Systematic Simple Segregation Clumped Segregation Isolative Segregation Randomized but interdependent replicates Hurlbert, 1984

No replication

Methods 

Depends on question, variable type, number of groups



1 variable: difference between groups



2 or more variables: links, dependence



... but statistical principles remain the same

Statistical tests  



Define null hypothesis (H0) The only hypothesis tested, distribution known. Assumed to be true, rejected ot not. Rejection function of Type I error (α = fixed treshold) Alternative hypothesis(H1): what we are often interested

in, biological hypothesis. Failing to reject H0 (“Accepting H1”) depends on power (hence sample size) 

Cannot be “proven”, but, if experiment is correctly designed, is highly likely is H0 is rejected

Conditions 

Independence  

Depends on experimental plan If none, you might generate a Type I error (finding a relationship/effect when there is none)



Check variances: must generally be homogeneous (~ equal)



Distribution: normality or not



The only question: reject or not H0



Error from natural variability: accounted for by repetitions



Test refers to a distribution, where falls the test statistic (a value linked to what you are testing) 

Theoretical distribution: e.g. Normal curve (parametric test, many softwares and tables). Require some conditions



No distribution: non parametric test (ranked values)



Generate distribution from data: permutation tests 

Power ≥ parametric test, better for small samples... requires a computer



Possibility of confidence intervals for the test statistic and parameters (depends on sample size and strength of relationship)

Permutation testing: example 

20 individuals in 2 groups of 10; variable = height (H)



Question: is mean height different in each group?



H0: same mean heights, H1 = H2



Test statistic: difference H1 - H2



If H0 is true, H1 - H2 is about 0











If we mix all 20 individuals and randomly select 2 groups of 10 individuals, we do not expect a difference in sizes (no more than “random noise”) Any random combination in two groups is a realization of H0 (H1 - H2 ~ 0) Most of these random differences are close to 0, a few are high (> 0 or < 0): generation of a distribution These random realizations are generated via random permutations of the data: permutation test You only need to assess where your original observation of H1 - H2 falls in the distribution: if it is in the extreme 5 % values, the difference is significant (or H1 ≠ H2)

Statistic t



t* < –t

t* = –t

–t< t* t

8

0

974

1†

17

Known for a long time, but computers only recently available

Normality tests 





Before parametric testing (many methods, easy to implement in common softwares), may need previous data transformation Examples 

Kolmogorov-Smirnov test



Shapiro-Wilks test

Quick and dirty 

Make an histogram an assess if it is bell-shaped



Find if a straight line roughly fits a normal quantile plot

Printed on Mer 10 sep 2008 at 9:08

1 variable: Comparing groups 

Samples can be 

Independent (no corresponding pairs among samples)



Matched (e.g. before/after)



But data must be independent



H0: equal means



Two steps 

Variance comparison



Mean comparison





Else: Behrens-Fisher problem (test of two null hypotheses: mean and variance) If non normal distribution, some tests exist, e.g. 

Fligner-Killeen (2 groups)



Permutational Bartlett test (> 2 groups)



Or... visually assess the difference between variances (ideally largest variance < 9 smallest variance)

Compare matched samples 

Only for 2 groups



Matched pairs (before/after, husband/wife, top/bottom, ...) 

Supplementary information: use the right test!



No variance testing



Parametric test = t test (for matched pairs)



Non parametric test = Wilcoxon ranked sign test 



Small samples

Permutation test = t test

Tests 

Analysis of variance (ANOVA - works even for 2 groups)



Groups separated according to factor/treatment



Parametric or permutational



Assess if means are equal or not 





Means should stand at the same place if no treatment effect: their variance is random error (known). If not, the treatment has an effect Mean comparison without computing means (!), only through variances

Avoids multiple testing





A posteriori/Post-hoc tests to identify different means (LSD, HSD, Sheffé, Tukey, ...) Some non parametric equivalents (one for each type) 



Unequal variances (if non normality, use permutational version) Small samples (while permutations may be used too)

ANOVA 1 X X X X X X

2 X X X X

3 X X X

4 X X X X X

One-way ANOVA

5 X X X X

1

1 2 3 4 A X X X X B X X X X C X X X X

Two-way ANOVA

2

3

A B CDE F 1 X X A X X X X B X X

2 X X X X X X X X

3 X X X X X X X X

Two-way ANOVA

4 X X X X X X X X

XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX

Nested ANOVA

≥ 2 variables: studying links 

Links: correlation (-1, 1) 

 

 

Parametric or not, depends on test and variable type (quantitative or ranked) Chi-square for qualitative variables

Modelling: regression (effect on response variable) 

Simple or multiple: number of variables



Linear or not: shape of relationship

Same principles for testing Connection between ANOVA and regression (variance decomposition)



Correlation ≠ causality (and vice-versa)