An introduction to data vizualisation Christophe 19 f´evrier 2013
1
Theory of data graphics - Edward R. Tufte
1.1
Graphical excellence
Graphical excellence is nearly alway multivariate and requires telling the truth about the data. Graphical display should : 1. Show the data 2. Induce the viewer to think 3. Avoid distorting what the data have to say 4. Present many numbers in a small place 5. Make large data sets coherent 6. Encourage the eye to compare different pieces of data 7. Reveal data at several levels of details, from broad overview to fine structure 8. Serve a reasonnable clear purpose 9. Be closely integrated with the statistical description of the dataset
1.2
Raconter une histoire en images
1
Figure 1: Carte figurative des pertes successives en hommes de l’arm´ee Fran¸caisee dans la campagne de Russie en 1812-1813, Charles Minard (1869)
Figure 2: Horaires des trains entre Paris et Lyon, E.J. Marey (1885) This method is attributed to the French engineer Ibry, but new evidence suggests that Lt. Sergeev had developed this method approximately 30 years earlier in Russia. Source E. R. Tufte
2
Figure 3: Evolution de la consommation de ressources energ´etiques ; Source
Figure 4: Lignes a´eriennes aux USA. Source Aaron Koblin
3
Figure 5: Niveaux de neige et pr´ecipitations. Source M´et´eoFrance
4
Figure 6: Statistical Breviary by William Playfair (1801) Source : E. R. Tufte
1.3 1. 2. 3. 4. 5.
2
Rules Above all else show the data Maximize the data-ink ratio Erase non-data-ink Erase redudant data-ink Revise and edit
Graphical Integrity
Graphical excellence begins with telling the truth about the data, so a lie factor can be constructed to compute the misrepresentation. LieF actor =
2.1
size of ef f ect shown in graphic size of ef f ect in data
Exemples
5
(1)
Figure 7: Governemnt spending ”Skyrocketing”. Source : E. R. Tufte from Playfair(1786)
Figure 8: Governemnt spending ”Skyrocketing”. Source : E. R. Tufte from Playfair(1786). In a note Playfair says that the spending are now in real and not nominal millions !
6
Another example of a big lie. Thee real magnitude of change in cars consumptions is of 18 mpg in 1978 to 27.5 mpg in 1985, so the change is of 53% in 7 years. On the graph, the horizontal line is 1.5 cm in 1978 and 13 cm in 1985, so the visual change is around 75% making the lie factor reaching 14.5% ! ! !
Figure 9: Fuel economy standards. Source : E. R. Tufte (from NY Times 1978)
Figure 10: Fuel economy standards, another view. Source : E. R. Tufte from NY Times 1978
7
2.2
Exemples with MS-Excell
Figure 11: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE Les d´epenses de l’´etat (vert) semblent croitre plus fortement
8
Figure 12: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE
Figure 13: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE
9
2.3
Data-ink ratio
Ink shoul present data-information. Data-ink is the non-erasable core of graphic. The E. Tufte defines the data-ink-ratio as : Data − ink − ratio =
data − ink total ink used to print the graphic
(2)
in the following, we will analyse how much of the information could be erased...
Figure 14: Dette des administrations publiques. Etat vs Ensemble Source : INSEE
10
3
Boxplots and Co
106 104 102 100 98
Response
108
110
Le Box-plot est surement le plus simple et le plus utilis´e pour comparer des distributions entre groupes d’individus par exemple. Il n’est pas interdit d’utiliser des couleurs et les axes horizontaux et verticaux...
g1
g2
g3
g4
Groupe
11
g5
Let’s erase stuff...
98
102
Response
106
110
3.1
g1
g2
g3
g4
g5 Groupe
Response
Groupe
3.2
Let’s change the shape...
Box-percentile plots are similiar to boxplots, except box-percentile plots supply more information about the univariate distributions. At any height the width of the irregular ”box” is proportional to the percentile of that height, up to the 50th percentile, and above the 50th percentile the width is proportional to 100 minus the percentile. Thus, the width at any given height is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box. see http://had.co.nz/stat645/project-03/boxplots.pdf.
12
−2
−1
0
1
2
3
4
Box−Percentile Plot
−3
−3
−2
−1
0
1
2
3
4
Boxplot
Normal
Normal
Uniform
Uniform
On peut voir la diff´erence sur 2 groupes tir´es al´eatoirement de la mˆeme distribution :
−2
−1
0
1
2
3
4
Box−Percentile Plot
−3
−3
−2
−1
0
1
2
3
4
Boxplot
Group 1
Group 1
Group 2
Group 2
The boxplot has friends... The first figure show the underlying density of the
13
random generated data : a normal mixture of two components. Then, from left to right are plotted variations around the idea of a boxplot. 1. Underlying bimodal density 2. The boxplot itself, which concentrates on the central bulk of the data 3. The HDR boxplot, which looks at the zone of highest density 4. The Violin plot, that uses kernel estimator of the density 5. The Box-Percentile plot, same as boxplot, but showing more information about the density Sur une distribution ”classique” et unimodale, on ne diff´erencie pa les 4 boxplot :
0.0
0.2
HDR boxplot
0.4
5
Box−Percentile Plot
0
1
2
3
4
5 4 3 1 0
2
1 0
2
3
4
5
violin plot
1
2
3
4
5
standard boxplot
0
0
1
2
3
4
5
Underlying density
1
x
Mais si l’on change la distribution, pour la rendre bi-modale. Only the violin plot and the HDR boxplot capture the bimodality in that dataset. Given that the dataset is truly bimodal, they are, in that case, better than the standard boxplot and the Box percentile plot.
14
0.05
HDR boxplot
0.15
4
Box−Percentile Plot
−2
0
2
4 0 −2
2
0 −2
2
4
violin plot
0
2
4
standard boxplot
−2
−2
0
2
4
Underlying density
1
x
Source http://gallery.r-enthusiasts.com/graph/The_boxplot_friends_ 102. McGill, Tukey and Larsen (1978) introduced the Variable Width boxplot,where width is used to represent the density, and this is believed to prevent misinterpretation of certain characteristics of the data, in particular the median. In the same paper he introduced the Notched boxplot, which adds yet another element to the original boxplot by displaying confidence intervals around the medians. Doing so allows one to visually determine whether or not the medians are significantly different between groups.
15
Group 1 Group 2 Group 3
1
2
3
4
with var. width and Notches
0
1
2
3
4
with variable width
0
0
1
2
3
4
Boxplot
Group 1 Group 2 Group 3
16
Group 1 Group 2 Group 3
Context is important ! Passagers sur Air China
0
Total Passengers 2.00e+08 4.00e+08 6.00e+08
8.00e+08
3.3
2001
2002
2003
2004
Year
0
Total Passengers 5.00e+08 1.00e+09
1.50e+09
Passagers sur Air China
1990
1995
2000 Year
2005
17
2010
0
0 1990 1995 2000 Year
Air China
18
1.50e+09
5.00e+08 1.00e+09 1.50e+09 2.00e+09 Total Passengers
Total Passengers 5.00e+08 1.00e+09
Passagers
2005
British Airways 2010
4
Visualiser des relations
4.1
De l’int´ erˆ et de visualiser - F.J. Anscombe
Considerons les 3 jeux de donn´ees propos´es par F. J Anscombe (X1 , Y1 ),(X2 , Y2 ) & (X3 , Y3 ) Variable X1 X2 X3 X4
n 11 11 11 11
Min 4 4 4 8
q1 6.50 6.50 6.50 8.00
e x 9 9 9 8
x ¯ 9 9 9 9
q3 11.50 11.50 11.50 8.00
Max 14 14 14 19
Table 1: Summary of the 3 data sets : Xs
Variable Y1 Y2 Y3 Y4
n 11 11 11 11
Min 4.26 3.10 5.39 5.25
q1 6.31 6.70 6.25 6.17
e x 7.58 8.14 7.11 7.04
x ¯ 7.50 7.50 7.50 7.50
q3 8.57 8.95 7.98 8.19
Max 10.84 9.26 12.74 12.50
Table 2: Summary of the 3 data sets : Ys Notons que les correlations sont cor(X1 , Y1 ) = 0.8164, cor(X2 , Y2 ) = 0.8162, cor(X3 , Y3 ) = 0.8163 et enfin cor(X4 , Y4 ) = 0.8165 . Maintenant, regardons vraiment ces donn´ees :
8 10 6 4 2 0
0
2
4
6
8 10
dataAnscombe$y2
14
X2−Y2
14
X1−Y1
10
15
20
5
10
15
dataAnscombe$x2
X3−Y3
X4−Y4
20
8 10 6 4 2 0
0
2
4
6
8 10
dataAnscombe$y4
14
dataAnscombe$x1
14
5
5
10
15
20
5
19
10
15
20
4.2
Scatterplot with Tufte axes
200 150
Previous duration
96
Duration (sec)
250
306
Old Faithful Eruptions (271 samples)
43
50
55
60
65
70
75
80
85
90
96
Time till next eruption (min)
La version 3D de la densit´e (estim´ee avec le package np de R)
0.04
0.03
100 80
ing
0.01
60
wa it
Joint Density
0.02
40 0.00 2
3
4 eruptions
20
5
5 5.1
Visualier “d’autres choses” Visualiser des r´ eseaux
Figure 15: Relations entre les personnages de Mark Twain Source : Pajek http://pajek.imfm.si/doku.php?id=links
21
Figure 16: Relations entre les diff´erentes marques et groupes dans les IAA http://www.convergencealimentaire.info/?attachment_id=238
Figure 17: Relations entre diff´erents bordalierinstitute.com/target11.html
22
co-auteurs
http://www.
5.2
Visualiser dans l’espace et le temps
Figure 18: Exemple d’information a deux niveaux, le trajet et l’altitude de la route de migration des oies. Source Hawkes et al. (2012). http://sciencythoughts.blogspot.fr/2012/11/ how-bar-headed-geese-cross-himalayas.html
23
Figure 19: L’IGN propose de nombreux outils de visualisation dans l’espace, par exemple pour visualiser les changements entre 2 dates. http://logiciels. ign.fr/?Presentation,47/
24
5.3
Visualiser des textes
Words of a paper Source : http://www.wordle.net
25
6 6.1
les outils Gapminder
Figure 20: Snapshop of Gapminder World http://www.gapminder.org/ world/
6.2
R
cf pr´esentation de Thibault (decembre 2012) , celle de S´ebastien sur ggplot et le site R enthousiasts http://gallery.r-enthusiasts.com/thumbs.php? sort=time.
26
6.3
Mathematica
Figure 21: Outil dynamique de repr´esentation de la demande pour 2 biens. Source math´ematica http://demonstrations.wolfram.com/ ConsumerDemand/
6.4
Cortext
An important step when producing a network map is to carefully define the filtering steps. The first parameter one will be asked to choose is the total number of top nodes pertaining to each fields that should be mapped. The nodes are selected according to their frequency at each time period. Thus if one is mapping a co-authorship network, choosing 50 top items will produce the collaboration network between the 50 most productive authors (in terms of articles production) at each time period. For a research lab vs keywords map, 50 most productive research labs will be mapped along with the 50 most frequent keywords.
27
Figure 22: Snapshop of some Cortext project http://manager.cortext. net/projects/webmaster_cortext_fr/agroecologie-extended/data/ aeext-1996-1-aeext-db~3438/1/tubes/index.htm
28
6.5
Tableau
Commercial software proposing some unique features, on-the-fly visualisation, dynamic presentation. One drawback : Expensive and not comercially friendly !
Figure 23: Hurricane representation using Tableau tableausoftware.com/learn/gallery/storm-tracking
29
http://www.
6.6
Circos
Free software for circular representation of data (Genomics + others). These days many people are dumping their SUVs in preference to smaller cars. How do customers ”flow” between brands and car segments ? The figures below illustrate such data sets.
Figure 24: Circular representation of data http://circos.ca/
30
6.7
Many-Eyes
An experiment by IBM Research and the IBM Cognos software group http: //www-958.ibm.com/software/analytics/manyeyes/
Figure 25: Dataset can be represented using various vizualisation tools http: //www-958.ibm.com/software/analytics/manyeyes/datasets
7
Visualiser des math´ ematiques
Federico Amodeo, puis Ren´e Taton, ont nagu`ere attir´e l’attention sur la pr´esence d’´epures de g´eom´etrie descriptive dans l’Underweysung der messung, alors que cette discipline n’a ´et´e ´elabor´ee par Gaspard Monge que pr`es de trois si`ecles plus tard. Dans sa figure 38 du Livre I, on voit apparaˆıtre une parabole comme enveloppe de ses tangentes. D¨ urer engendre la figure point par point en pla¸cant l’extr´emit´e d’une r`egle de longueur fixe ab successivement sur les points de l’axe horizontal (dont une partie est divis´ee par 16 points en 16 intervalles ´egaux) et en la faisant passer par les points de mˆeme nom de l’axe vertical issu du point 13. L’autre extr´emit´e d´esigne les points successifs de la courbe
31
Figure 26: Construction d’une parabole illustr´ee par gaspard Monge. Source : Images des maths 2006 http://images.math.cnrs.fr/ Roles-des-figures-dans-la.html
8
Sites de r´ ef´ erence et d’exemples Les enthousiastes R : DataVisualization.ch Nathan Yau’s Flowchart Places & Spaces La fonderie We love dataviz DataVisualization Logiciels de l’IGN Theresa Vanderbilt R Clinic Circos
http://gallery.r-enthusiasts.com/thumbs.php?sort=time http://selection.datavisualization.ch/ http://flowingdata.com/category/tutorials/ http://scimaps.org/maps/browse/ http://outils.expoviz.fr/ http://datavis.tumblr.com/ http://www.datavisualization.fr/ http://logiciels.ign.fr/?Presentation,47 http://biostat.mc.vanderbilt.edu/wiki/Main/RClinic http://circos.ca/
32