An introduction to data vizualisation - Laurent Thibault

Feb 19, 2013 - Figure 2: Horaires des trains entre Paris et Lyon, E.J. Marey (1885) .... 2. The boxplot itself, which concentrates on the central bulk of the data. 3.
5MB taille 2 téléchargements 408 vues
An introduction to data vizualisation Christophe 19 f´evrier 2013

1

Theory of data graphics - Edward R. Tufte

1.1

Graphical excellence

Graphical excellence is nearly alway multivariate and requires telling the truth about the data. Graphical display should : 1. Show the data 2. Induce the viewer to think 3. Avoid distorting what the data have to say 4. Present many numbers in a small place 5. Make large data sets coherent 6. Encourage the eye to compare different pieces of data 7. Reveal data at several levels of details, from broad overview to fine structure 8. Serve a reasonnable clear purpose 9. Be closely integrated with the statistical description of the dataset

1.2

Raconter une histoire en images

1

Figure 1: Carte figurative des pertes successives en hommes de l’arm´ee Fran¸caisee dans la campagne de Russie en 1812-1813, Charles Minard (1869)

Figure 2: Horaires des trains entre Paris et Lyon, E.J. Marey (1885) This method is attributed to the French engineer Ibry, but new evidence suggests that Lt. Sergeev had developed this method approximately 30 years earlier in Russia. Source E. R. Tufte

2

Figure 3: Evolution de la consommation de ressources energ´etiques ; Source

Figure 4: Lignes a´eriennes aux USA. Source Aaron Koblin

3

Figure 5: Niveaux de neige et pr´ecipitations. Source M´et´eoFrance

4

Figure 6: Statistical Breviary by William Playfair (1801) Source : E. R. Tufte

1.3 1. 2. 3. 4. 5.

2

Rules Above all else show the data Maximize the data-ink ratio Erase non-data-ink Erase redudant data-ink Revise and edit

Graphical Integrity

Graphical excellence begins with telling the truth about the data, so a lie factor can be constructed to compute the misrepresentation. LieF actor =

2.1

size of ef f ect shown in graphic size of ef f ect in data

Exemples

5

(1)

Figure 7: Governemnt spending ”Skyrocketing”. Source : E. R. Tufte from Playfair(1786)

Figure 8: Governemnt spending ”Skyrocketing”. Source : E. R. Tufte from Playfair(1786). In a note Playfair says that the spending are now in real and not nominal millions !

6

Another example of a big lie. Thee real magnitude of change in cars consumptions is of 18 mpg in 1978 to 27.5 mpg in 1985, so the change is of 53% in 7 years. On the graph, the horizontal line is 1.5 cm in 1978 and 13 cm in 1985, so the visual change is around 75% making the lie factor reaching 14.5% ! ! !

Figure 9: Fuel economy standards. Source : E. R. Tufte (from NY Times 1978)

Figure 10: Fuel economy standards, another view. Source : E. R. Tufte from NY Times 1978

7

2.2

Exemples with MS-Excell

Figure 11: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE Les d´epenses de l’´etat (vert) semblent croitre plus fortement

8

Figure 12: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE

Figure 13: Dette des administrations publiques. Etat vs Ensemble. Source : INSEE

9

2.3

Data-ink ratio

Ink shoul present data-information. Data-ink is the non-erasable core of graphic. The E. Tufte defines the data-ink-ratio as : Data − ink − ratio =

data − ink total ink used to print the graphic

(2)

in the following, we will analyse how much of the information could be erased...

Figure 14: Dette des administrations publiques. Etat vs Ensemble Source : INSEE

10

3

Boxplots and Co

106 104 102 100 98

Response

108

110

Le Box-plot est surement le plus simple et le plus utilis´e pour comparer des distributions entre groupes d’individus par exemple. Il n’est pas interdit d’utiliser des couleurs et les axes horizontaux et verticaux...

g1

g2

g3

g4

Groupe

11

g5

Let’s erase stuff...

98

102

Response

106

110

3.1

g1

g2

g3

g4

g5 Groupe

Response

Groupe

3.2

Let’s change the shape...

Box-percentile plots are similiar to boxplots, except box-percentile plots supply more information about the univariate distributions. At any height the width of the irregular ”box” is proportional to the percentile of that height, up to the 50th percentile, and above the 50th percentile the width is proportional to 100 minus the percentile. Thus, the width at any given height is proportional to the percent of observations that are more extreme in that direction. As in boxplots, the median, 25th and 75th percentiles are marked with line segments across the box. see http://had.co.nz/stat645/project-03/boxplots.pdf.

12

−2

−1

0

1

2

3

4

Box−Percentile Plot

−3

−3

−2

−1

0

1

2

3

4

Boxplot

Normal

Normal

Uniform

Uniform

On peut voir la diff´erence sur 2 groupes tir´es al´eatoirement de la mˆeme distribution :

−2

−1

0

1

2

3

4

Box−Percentile Plot

−3

−3

−2

−1

0

1

2

3

4

Boxplot

Group 1

Group 1

Group 2

Group 2

The boxplot has friends... The first figure show the underlying density of the

13

random generated data : a normal mixture of two components. Then, from left to right are plotted variations around the idea of a boxplot. 1. Underlying bimodal density 2. The boxplot itself, which concentrates on the central bulk of the data 3. The HDR boxplot, which looks at the zone of highest density 4. The Violin plot, that uses kernel estimator of the density 5. The Box-Percentile plot, same as boxplot, but showing more information about the density Sur une distribution ”classique” et unimodale, on ne diff´erencie pa les 4 boxplot :

0.0

0.2

HDR boxplot

0.4

5

Box−Percentile Plot

0

1

2

3

4

5 4 3 1 0

2

1 0

2

3

4

5

violin plot

1

2

3

4

5

standard boxplot

0

0

1

2

3

4

5

Underlying density

1

x

Mais si l’on change la distribution, pour la rendre bi-modale. Only the violin plot and the HDR boxplot capture the bimodality in that dataset. Given that the dataset is truly bimodal, they are, in that case, better than the standard boxplot and the Box percentile plot.

14

0.05

HDR boxplot

0.15

4

Box−Percentile Plot

−2

0

2

4 0 −2

2

0 −2

2

4

violin plot

0

2

4

standard boxplot

−2

−2

0

2

4

Underlying density

1

x

Source http://gallery.r-enthusiasts.com/graph/The_boxplot_friends_ 102. McGill, Tukey and Larsen (1978) introduced the Variable Width boxplot,where width is used to represent the density, and this is believed to prevent misinterpretation of certain characteristics of the data, in particular the median. In the same paper he introduced the Notched boxplot, which adds yet another element to the original boxplot by displaying confidence intervals around the medians. Doing so allows one to visually determine whether or not the medians are significantly different between groups.

15

Group 1 Group 2 Group 3

1

2

3

4

with var. width and Notches

0

1

2

3

4

with variable width

0

0

1

2

3

4

Boxplot

Group 1 Group 2 Group 3

16

Group 1 Group 2 Group 3

Context is important ! Passagers sur Air China

0

Total Passengers 2.00e+08 4.00e+08 6.00e+08

8.00e+08

3.3

2001

2002

2003

2004

Year

0

Total Passengers 5.00e+08 1.00e+09

1.50e+09

Passagers sur Air China

1990

1995

2000 Year

2005

17

2010

0

0 1990 1995 2000 Year

Air China

18

1.50e+09

5.00e+08 1.00e+09 1.50e+09 2.00e+09 Total Passengers

Total Passengers 5.00e+08 1.00e+09

Passagers

2005

British Airways 2010

4

Visualiser des relations

4.1

De l’int´ erˆ et de visualiser - F.J. Anscombe

Considerons les 3 jeux de donn´ees propos´es par F. J Anscombe (X1 , Y1 ),(X2 , Y2 ) & (X3 , Y3 ) Variable X1 X2 X3 X4

n 11 11 11 11

Min 4 4 4 8

q1 6.50 6.50 6.50 8.00

e x 9 9 9 8

x ¯ 9 9 9 9

q3 11.50 11.50 11.50 8.00

Max 14 14 14 19

Table 1: Summary of the 3 data sets : Xs

Variable Y1 Y2 Y3 Y4

n 11 11 11 11

Min 4.26 3.10 5.39 5.25

q1 6.31 6.70 6.25 6.17

e x 7.58 8.14 7.11 7.04

x ¯ 7.50 7.50 7.50 7.50

q3 8.57 8.95 7.98 8.19

Max 10.84 9.26 12.74 12.50

Table 2: Summary of the 3 data sets : Ys Notons que les correlations sont cor(X1 , Y1 ) = 0.8164, cor(X2 , Y2 ) = 0.8162, cor(X3 , Y3 ) = 0.8163 et enfin cor(X4 , Y4 ) = 0.8165 . Maintenant, regardons vraiment ces donn´ees :

8 10 6 4 2 0

0

2

4

6

8 10

dataAnscombe$y2

14

X2−Y2

14

X1−Y1

10

15

20

5

10

15

dataAnscombe$x2

X3−Y3

X4−Y4

20

8 10 6 4 2 0

0

2

4

6

8 10

dataAnscombe$y4

14

dataAnscombe$x1

14

5

5

10

15

20

5

19

10

15

20

4.2

Scatterplot with Tufte axes

200 150

Previous duration

96

Duration (sec)

250

306

Old Faithful Eruptions (271 samples)

43

50

55

60

65

70

75

80

85

90

96

Time till next eruption (min)

La version 3D de la densit´e (estim´ee avec le package np de R)

0.04

0.03

100 80

ing

0.01

60

wa it

Joint Density

0.02

40 0.00 2

3

4 eruptions

20

5

5 5.1

Visualier “d’autres choses” Visualiser des r´ eseaux

Figure 15: Relations entre les personnages de Mark Twain Source : Pajek http://pajek.imfm.si/doku.php?id=links

21

Figure 16: Relations entre les diff´erentes marques et groupes dans les IAA http://www.convergencealimentaire.info/?attachment_id=238

Figure 17: Relations entre diff´erents bordalierinstitute.com/target11.html

22

co-auteurs

http://www.

5.2

Visualiser dans l’espace et le temps

Figure 18: Exemple d’information a deux niveaux, le trajet et l’altitude de la route de migration des oies. Source Hawkes et al. (2012). http://sciencythoughts.blogspot.fr/2012/11/ how-bar-headed-geese-cross-himalayas.html

23

Figure 19: L’IGN propose de nombreux outils de visualisation dans l’espace, par exemple pour visualiser les changements entre 2 dates. http://logiciels. ign.fr/?Presentation,47/

24

5.3

Visualiser des textes

Words of a paper Source : http://www.wordle.net

25

6 6.1

les outils Gapminder

Figure 20: Snapshop of Gapminder World http://www.gapminder.org/ world/

6.2

R

cf pr´esentation de Thibault (decembre 2012) , celle de S´ebastien sur ggplot et le site R enthousiasts http://gallery.r-enthusiasts.com/thumbs.php? sort=time.

26

6.3

Mathematica

Figure 21: Outil dynamique de repr´esentation de la demande pour 2 biens. Source math´ematica http://demonstrations.wolfram.com/ ConsumerDemand/

6.4

Cortext

An important step when producing a network map is to carefully define the filtering steps. The first parameter one will be asked to choose is the total number of top nodes pertaining to each fields that should be mapped. The nodes are selected according to their frequency at each time period. Thus if one is mapping a co-authorship network, choosing 50 top items will produce the collaboration network between the 50 most productive authors (in terms of articles production) at each time period. For a research lab vs keywords map, 50 most productive research labs will be mapped along with the 50 most frequent keywords.

27

Figure 22: Snapshop of some Cortext project http://manager.cortext. net/projects/webmaster_cortext_fr/agroecologie-extended/data/ aeext-1996-1-aeext-db~3438/1/tubes/index.htm

28

6.5

Tableau

Commercial software proposing some unique features, on-the-fly visualisation, dynamic presentation. One drawback : Expensive and not comercially friendly !

Figure 23: Hurricane representation using Tableau tableausoftware.com/learn/gallery/storm-tracking

29

http://www.

6.6

Circos

Free software for circular representation of data (Genomics + others). These days many people are dumping their SUVs in preference to smaller cars. How do customers ”flow” between brands and car segments ? The figures below illustrate such data sets.

Figure 24: Circular representation of data http://circos.ca/

30

6.7

Many-Eyes

An experiment by IBM Research and the IBM Cognos software group http: //www-958.ibm.com/software/analytics/manyeyes/

Figure 25: Dataset can be represented using various vizualisation tools http: //www-958.ibm.com/software/analytics/manyeyes/datasets

7

Visualiser des math´ ematiques

Federico Amodeo, puis Ren´e Taton, ont nagu`ere attir´e l’attention sur la pr´esence d’´epures de g´eom´etrie descriptive dans l’Underweysung der messung, alors que cette discipline n’a ´et´e ´elabor´ee par Gaspard Monge que pr`es de trois si`ecles plus tard. Dans sa figure 38 du Livre I, on voit apparaˆıtre une parabole comme enveloppe de ses tangentes. D¨ urer engendre la figure point par point en pla¸cant l’extr´emit´e d’une r`egle de longueur fixe ab successivement sur les points de l’axe horizontal (dont une partie est divis´ee par 16 points en 16 intervalles ´egaux) et en la faisant passer par les points de mˆeme nom de l’axe vertical issu du point 13. L’autre extr´emit´e d´esigne les points successifs de la courbe

31

Figure 26: Construction d’une parabole illustr´ee par gaspard Monge. Source : Images des maths 2006 http://images.math.cnrs.fr/ Roles-des-figures-dans-la.html

8

Sites de r´ ef´ erence et d’exemples Les enthousiastes R : DataVisualization.ch Nathan Yau’s Flowchart Places & Spaces La fonderie We love dataviz DataVisualization Logiciels de l’IGN Theresa Vanderbilt R Clinic Circos

http://gallery.r-enthusiasts.com/thumbs.php?sort=time http://selection.datavisualization.ch/ http://flowingdata.com/category/tutorials/ http://scimaps.org/maps/browse/ http://outils.expoviz.fr/ http://datavis.tumblr.com/ http://www.datavisualization.fr/ http://logiciels.ign.fr/?Presentation,47 http://biostat.mc.vanderbilt.edu/wiki/Main/RClinic http://circos.ca/

32