Biostatistics Yves Desdevises Observatoire Océanologique de Banyuls/Mer Université Pierre et Marie Curie - Paris 6 France
Outline
1. Session 1 - Basics
2. Session 2 - Small samples
Timing for each session:
Lecture (45-60 mn)
Demo on computer (30 mn)
Computer lab + Discussion on your data, ... (90 mn)
Session I - Basics
Question
Null hypothesis
Type of data
Number of data
Number of groups/variables
A few definitions
Object = observation = sample unit: item on which characteristics (variables) are measured Sample: all objects Target population: object group on which is focused the study Statistic population: objects represented by the sample. Inference is made on this population.
Variable = descriptor = factor = trait: feature measured or observed on objects. e.g. length, temperature, ...
Dependent variable (Y) = response
Independent variable (X) = explanatory
Random variable: value unknown before measure
Fixed variable: set by the experimenter, error = measure only (NB: error = random (or residual) variation, not bad job!)
Parameter: quantitative value allowing a condensed representation of the information contained in a dataset. e.g. mean, slope, ... Variance = inertia = mean square: sum of squares of deviations to the mean/number of objects (dispersion parameter)
Standard deviation: square root of the variance (same unit as the variable) 2 Sx = √Sx
➡
Standard error: standard deviation of the sampling distribution of the mean (dispersion of the means of several samples from a same population)
Dispersion = information (≠ noise). Must be assessed via replication.
Data
Binary: 2 states. e.g. presence/absence
Multiple: more than 2 states
Non-ordered = nominal. e.g. color
Ordered
Semi-quantitative = ordinal = ranked. e.g. classes
Quantitative
Discrete. e.g. number of individuals
Continuous. e.g. length
Descriptive statistics
To be computed for each sample before any test Standard : n, mean, variances, histogram or normal quantile plot (distribution),...
Identify outliers
Often forgotten...
Statdisk - Explore Data Printed on Mer 10 sep 2008 at 9:08
Statdisk - Scatterplot Printed on Mer 10 sep 2008 at 9:08
X Value
Planning experiments
Work on a simplified system: study the response to the variation of few factors
Important: rigorous (no test for that: assumption)
Construction: null and alternative hypotheses
Specifically answer a question
Minimise Type I error: do not erroneously conclude to an effect due to the factor Maximise power (Type II): ability to find small differences
Importance of a control group
Untreated group
Special treatment (placebo, manipulation, ...)
Block
A priori group of sample units
Allows a good repartition of treatment
Replication
Assess natural variability
Avoid pseudoreplication
Randomized Randomized Block Systematic Simple Segregation Clumped Segregation Isolative Segregation Randomized but interdependent replicates Hurlbert, 1984
No replication
Methods
Depends on question, variable type, number of groups
1 variable: difference between groups
2 or more variables: links, dependence
... but statistical principles remain the same
Statistical tests
Define null hypothesis (H0) The only hypothesis tested, distribution known. Assumed to be true, rejected ot not. Rejection function of Type I error (α = fixed treshold) Alternative hypothesis(H1): what we are often interested
in, biological hypothesis. Failing to reject H0 (“Accepting H1”) depends on power (hence sample size)
Cannot be “proven”, but, if experiment is correctly designed, is highly likely is H0 is rejected
Conditions
Independence
Depends on experimental plan If none, you might generate a Type I error (finding a relationship/effect when there is none)
Check variances: must generally be homogeneous (~ equal)
Distribution: normality or not
The only question: reject or not H0
Error from natural variability: accounted for by repetitions
Test refers to a distribution, where falls the test statistic (a value linked to what you are testing)
Theoretical distribution: e.g. Normal curve (parametric test, many softwares and tables). Require some conditions
No distribution: non parametric test (ranked values)
Generate distribution from data: permutation tests
Power ≥ parametric test, better for small samples... requires a computer
Possibility of confidence intervals for the test statistic and parameters (depends on sample size and strength of relationship)
Permutation testing: example
20 individuals in 2 groups of 10; variable = height (H)
Question: is mean height different in each group?
H0: same mean heights, H1 = H2
Test statistic: difference H1 - H2
If H0 is true, H1 - H2 is about 0
If we mix all 20 individuals and randomly select 2 groups of 10 individuals, we do not expect a difference in sizes (no more than “random noise”) Any random combination in two groups is a realization of H0 (H1 - H2 ~ 0) Most of these random differences are close to 0, a few are high (> 0 or < 0): generation of a distribution These random realizations are generated via random permutations of the data: permutation test You only need to assess where your original observation of H1 - H2 falls in the distribution: if it is in the extreme 5 % values, the difference is significant (or H1 ≠ H2)
Statistic t
t* < –t
t* = –t
–t< t* t
8
0
974
1†
17
Known for a long time, but computers only recently available
Normality tests
Before parametric testing (many methods, easy to implement in common softwares), may need previous data transformation Examples
Kolmogorov-Smirnov test
Shapiro-Wilks test
Quick and dirty
Make an histogram an assess if it is bell-shaped
Find if a straight line roughly fits a normal quantile plot
Printed on Mer 10 sep 2008 at 9:08
1 variable: Comparing groups
Samples can be
Independent (no corresponding pairs among samples)
Matched (e.g. before/after)
But data must be independent
H0: equal means
Two steps
Variance comparison
Mean comparison
Else: Behrens-Fisher problem (test of two null hypotheses: mean and variance) If non normal distribution, some tests exist, e.g.
Fligner-Killeen (2 groups)
Permutational Bartlett test (> 2 groups)
Or... visually assess the difference between variances (ideally largest variance < 9 smallest variance)
Compare matched samples
Only for 2 groups
Matched pairs (before/after, husband/wife, top/bottom, ...)
Supplementary information: use the right test!
No variance testing
Parametric test = t test (for matched pairs)
Non parametric test = Wilcoxon ranked sign test
Small samples
Permutation test = t test
Tests
Analysis of variance (ANOVA - works even for 2 groups)
Groups separated according to factor/treatment
Parametric or permutational
Assess if means are equal or not
Means should stand at the same place if no treatment effect: their variance is random error (known). If not, the treatment has an effect Mean comparison without computing means (!), only through variances
Avoids multiple testing
A posteriori/Post-hoc tests to identify different means (LSD, HSD, Sheffé, Tukey, ...) Some non parametric equivalents (one for each type)
Unequal variances (if non normality, use permutational version) Small samples (while permutations may be used too)
ANOVA 1 X X X X X X
2 X X X X
3 X X X
4 X X X X X
One-way ANOVA
5 X X X X
1
1 2 3 4 A X X X X B X X X X C X X X X
Two-way ANOVA
2
3
A B CDE F 1 X X A X X X X B X X
2 X X X X X X X X
3 X X X X X X X X
Two-way ANOVA
4 X X X X X X X X
XXXXXX XXXXXX XXXXXX XXXXXX XXXXXX
Nested ANOVA
≥ 2 variables: studying links
Links: correlation (-1, 1)
Parametric or not, depends on test and variable type (quantitative or ranked) Chi-square for qualitative variables
Modelling: regression (effect on response variable)
Simple or multiple: number of variables
Linear or not: shape of relationship
Same principles for testing Connection between ANOVA and regression (variance decomposition)
Correlation ≠ causality (and vice-versa)