Morceaux choisis
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. data tidying: structuring datasets to facilitate analysis. This paper [...] provides a comprehensive ``philosophy of data'' Since most real world datasets are not tidy... Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).
http://hadley.nz/
http://hadley.nz/
https://www.youtube.com/results?search_query=hadley+wicham
http://ggplot2.org/ http://ggplot2.org/resources/2007-past-present-future.pdf http://ggplot2.org/resources/2007-vanderbilt.pdf http://docs.ggplot2.org/current/
Data cleaunp
tidyr
Data handling
dplyr
Data Visualization
ggpot2
Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
« Les familles heureuses se ressemblent toutes. Les familles malheureuses sont malheureuses chacune à leur manière. »
Le principe d'Anna Karenine En d’autres termes, le succès demande que plusieurs conditions soient réunies. Une seule condition manquée est suffisante pour conduire à l’échec. https://deselection.wordpress.com/2010/11/12/leprincipedannakarenine/
Version Aristote https://en.wikipedia.org/wiki/Anna_Karenina_principle
Much earlier, Aristotle states the same principle in the Nichomachean Ethics (Book 2): Again, it is possible to fail in many ways (for evil belongs to the class of the unlimited, as the Pythagoreans conjectured, and good to that of the limited), while to succeed is possible only in one way (for which reason also one is easy and the other difficult – to miss the mark easy, to hit it difficult); for these reasons also, then, excess and defect are characteristic of vice, and the mean of virtue; For men are good in but one way, but bad in many.
Logique : quantificateurs universel et existentiel https://fr.wikipedia.org/wiki/Quantificateur_(logique) ●
∀x P(x) se lit « pour tout x P(x) » et signifie « tout objet du domaine
considéré possède la propriété P » ●
∃x P(x) signifie il existe au moins un x tel que P(x) (un objet au moins du
domaine considéré possède la propriété P)
Négation des quantificateurs La négation de ∃x P(x) est : ¬∃x P(x), soit : ∀x ¬P(x) La négation de ∀x P(x) est : ¬∀x P(x), soit : ∃x ¬P(x)
https://fr.wikipedia.org/wiki/Logique_classique Logique classique ● Le tiers exclu énonce que pour toute proposition mathématique considérée, elle-même ou sa négation est vraie : A ∨ ¬A ● Le raisonnement par l'absurde : ¬¬ A ⇒ A ● La contraposition : (¬Β ⇒ ¬A) ⇒ (A ⇒ B) ● L'implication matérielle : (Α ⇒ B) ⇔ (¬Α ∨ B)
ANOVA https://fr.wikipedia.org/wiki/Analyse_de_la_variance
H0 : toutes les moyennes sont égales H1 : non H0 Si rejet de H0, on sait qu'au moins une moyenne est différente des autres, mais laquelle ? → test post-hoc
Ils sont tous égaux !
Ils ne sont pas tous égaux ! → Ils sont tous différents
Ils ne sont pas tous égaux ! → un seul est différent des autres
messy
tidy
Tidy data Le terme tidy fait référence à une façon optimale (?) de présenter les données pour une analyse statistique. Une version messy peut être préférable pour une meilleure lisibilité des données.
y ss e m tid
Dans une publi, cette version ↑, plus compacte, est peut-etre préférable. Mais : impression d'avoir affaire à une table de contingence → test de chi2...
y
Tidy data
1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table. Messy data is any other arrangement of the data.
Messy data Real datasets can, and often do, violate the three precepts of tidy data in almost every way imaginable. While occasionally you do get a dataset that you can start analyzing immediately, this is the exception, not the rule. This section describes the five most common problems with messy datasets, along with their remedies:
Column headers are values, not variable names. ● Multiple variables are stored in one column. ● Variables are stored in both rows and columns. ● Multiple types of observational units are stored in the same table. ● A single observational unit is stored in multiple tables. ●
Surprisingly, most messy datasets, including types of messiness not explicitly described above, can be tidied with a small set of tools: melting, string splitting, and casting.
Column headers are values, not variable names ...
3 variables : - religion - revenu - effectif
Chaque colonne représente une variable ; chaque ligne, une observation
Tidying when column headers are values: melting
Columns corresponds to RNAseq data of different conditions (B, C, D) and 3 biological replicates ⇒ Wide dataset = natural initial format, nice format to summarise the data but not so nice to model or to plot
melt() function allows to turn columns into rows ⇒ Molten datset is a nice format for models across times for example
Variables are stored in both rows and columns
Cette colonne contient un nom de variable !
Une variable par colonne, une observation par ligne
Tidying when multiple variables are stored in one column: casting Casting changes rows into columns (inverse of melting)
Values of the 2 variables tmax and tmin are recorded in the same column but on 2 rows
After casting the 2 variables are recorded in 2 columns
Tidying when …
Variables are stored in both rows and columns: combination of melting and casting ●
Multiple types in one table (e.g. values collected at multiple levels needed in the same table): merging ●
One type in multiple tables : plyr package helps to read a list of file (ldplyr) ●
Tidy tools
1) Manipulation 2) Visualisation 3) Modélisation
Manipulation ●
●
●
●
Filter: subsetting or removing observations based on some condition. Transform: adding or modifying variables. These modications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume). Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means). Sort: changing the order of observations. All these operations are made easier when there is a consistent way to refer to variables. Tidy data provides this because each variable resides in its own column. Ensure input and output-tidiness plyr,dplyr packages
Visualisation
Tidy visualization tools only need to be input-tidy as their output is visual. It provides a comprehensive ''philosophy of data": one that underlies my work in the plyr (Wickham 2011) and ggplot2 (Wickham 2009) packages.
Logique ggplot2 : syntaxe adaptée à un input tidy.
ggplot2 package
Hadley Wicham dixit:
Source: http://ggplot2.org/resources/2007pastpresentfuture.pdf
str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... head (mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
plot(mtcars$mpg,mtcars$hp, col=mtcars$cyl3, pch=mtcars$gear+15, cex=mtcars$carb/3)
ggplot(mtcars, aes(x=mpg,y=hp)) + geom_point() + geom_point(aes(colour = cyl, size=carb, shape=factor(gear)))
toto=mtcars colnames(toto)=NULL plot(toto[,1],toto[,4], col=toto[,2]3, pch=toto[,10]+15, cex=toto[,11]/3)cex=mtcars $carb/3)
toto=mtcars colnames(toto)=NULL plot(toto[,1],toto[,4], col=toto[,2]3, pch=toto[,10]+15, cex=toto[,11]/3)cex=mtcars $carb/3)
ggplot(toto, aes(x=toto[,1],y=toto[,4])) geom_point() + geom_point(aes(colour = toto[,2], size=toto[,11], shape=factor(toto[,10]))) Error in geom_point() geom_point(aes(colour = toto[, 2], size = toto[, : nonnumeric argument to binary operator
scaled