An R package for exploratory data analysis for teaching and research François Husson, Julie Josse & Sébastien Lê
Why
?
To make exploratory multivariate data analysis with a free software The possibility to propose new methods (taking into account different structure on the data) To have a package user friendly and oriented to practitioner (a very easy GUI)
1 – The classical methods Methods implemented are similar in their main objective: to sum up and simplify the data by reducing the dimensionality of the dataset Continuous variables: Principal Components Analysis Contingency table: Correspondence Analysis Categorical variables: Multiple Correspondence Analysis Continuous and categorical variables: Mixed Data Analysis
PCA Example
100m
Long.jump
Shot.put
High.jump
400m
110m.hurdle
Discus
Pole.vault
Javeline
1500m
Rank
Points
Competition
Data : performances of 41 athletes during two meetings of decathlon
SEBRLE CLAY KARPOV BERNARD YURKOV
11.04 10.76 11.02 11.02 11.34
7.58 7.40 7.30 7.23 7.09
14.83 14.26 14.77 14.25 15.19
2.07 1.86 2.04 1.92 2.10
49.81 49.37 48.37 48.93 50.42
14.69 14.05 14.09 14.99 15.31
43.75 50.72 48.95 40.87 46.26
5.02 4.92 4.92 5.32 4.72
63.19 60.15 50.31 62.77 63.44
291.70 301.50 300.20 280.10 276.40
1 2 3 4 5
8217 8122 8099 8067 8036
Decastar Decastar Decastar Decastar Decastar
Sebrle Clay Karpov Macey Warners
10.85 10.44 10.50 10.89 10.62
7.84 7.96 7.81 7.47 7.74
16.36 15.23 15.93 15.73 14.48
2.12 2.06 2.09 2.15 1.97
48.36 49.19 46.81 48.97 47.97
14.05 14.13 13.97 14.56 14.01
48.72 50.11 51.65 48.34 43.73
5.00 4.90 4.60 4.40 4.90
70.52 69.71 55.54 58.46 55.39
280.01 282.00 278.11 265.42 278.05
1 2 3 4 5
8893 8820 8725 8414 8343
OlympicG OlympicG OlympicG OlympicG OlympicG
PCA example Introduction of Variables factor map (PCA)
supplementary continuous Discus
representing the variables according to their quality of representation
Shot.put
High.jump
110m.hurdle 100m Rank
0.0
•
1500m Javeline
Dimension 2 (17.37%)
Graphs enriched by :
0.5
variables
400m
Points
Long.jump -0.5
•
1.0
supplementary information:
•
contribution
•
quality of representation
-1.0
Indicators: -1.0
-0.5
0.0 Dimension 1 (32.72%)
0.5
1.0
PCA example Introduction of supplementary information: •
supplementary individuals
•
supplementary categorical variables
Graphs enriched by:
Decastar OlympicG
around the categories
Indicators:
4 0
confidence ellipses
Dimension 2 (17.37%)
information
YURKOV Parkhomenko Korkizoglou
2
supplementary •
Casarsa
coloring according to
-2
•
Individuals factor map (PCA)
Zsivoczky Macey Smith Pogorelov SEBRLE CLAY HERNU Terek MARTINEAU Barras KARPOV Uldal Turi McMULLEN Decastar Schoenbeck BOURGUIGNON BARRAS Qi OlympicG Bernard Karlivans Ojaniemi Hernu BERNARD Smirnov ZSIVOCZKY Gomez Nool Lorenzo Averyanov Schwarzl WARNERS NOOL Warners
Sebrle Clay Karpov
Drews
contribution
•
quality of representation
-4
•
-4
-2
0 Dimension 1 (32.72%)
2
4
6
PCA example 4
Casarsa
Nb points YURKOV Parkhomenko 2
Korkizoglou Sebrle
Dimension 2 (17.37 %)
Zsivoczky Smith Pogorelov
MARTINEAU HERNU BOURGUIGNON
Terek
Turi
Uldal
SEBRLE CLAY
Barras
Decastar
0
Macey
McMULLEN OlympicG
Clay
KARPOV
Karpov
Bernard
Schoenbeck Hernu BERNARD Ojaniemi
KarlivansBARRAS
Lorenzo
Qi
ZSIVOCZKY
Smirnov Gomez
Schwarzl
NOOL
Nool Averyanov Warners
WARNERS
-2
Drews -4
-2
0
Dimension 1 (32.71 %)
2
4
PCA example 4
Casarsa
Pole.vault YURKOV Parkhomenko
2
Korkizoglou Sebrle Zsivoczky
Macey
Dimension 2 (17.37 %)
Smith MARTINEAU HERNU BOURGUIGNON
Pogorelov
Terek
Barras
Turi
Uldal
Decastar
0
Karlivans BARRAS
McMULLEN OlympicG
Qi
KARPOV
Karpov
Bernard
Schoenbeck Hernu
BERNARD
Ojaniemi
ZSIVOCZKY
Smirnov Gomez
Lorenzo
Clay
SEBRLE CLAY
Schwarzl
Nool Averyanov
NOOL WARNERS
-2
Warners
Drews
-4
-2
0
Dimension 1 (32.71 %)
2
4
Description of the dimensions By the quantitative variables: • The correlation between each variable and the coordinate of the individuals on the axis s is calculated • The correlation coefficients are sorted • Only the significant correlations are given
Description of the dimensions By the qualitative variables: • Perform a one-way analysis of variance with the coordinates of the individuals on the axis explained by the qualitative variable $Dim.1$quali Competition
• A F-test by variable • For each category, a student T-test to compare the average of the category with the general
Significant level = 0.2
mean
2 – Structure on the data Different structure on the data are proposed: a partition on the variables: several sets of variables are simultaneously studied: Multiple Factor Analysis, Generalized Procrustes Analysis a hierarchy on the variables: variables are grouped and subgrouped (like in questionnaires structured in topics and subtopics): Hierarchical Multiple Factor Analysis a partition on the individuals: several sets of individuals described by the same variables: Dual Multiple Factor Analysis
Groups of variables (MFA) Groups of variables are quantitative and/ or qualitative Objectives : - study the link between the sets of variables - balance the influence of each group of variables - give the classical graphs but also specific graphs: groups of variables - partial representation Examples : - Genomic: DNA, protein - Sensory analysis: sensorial, physico-chemical - Comparison of coding (quantitative / qualitative)
Hierarchy on the variables (HMFA)
Two levels for the hierarchy: the first one contains L groups, each l group contains Jl subgroups, and each subgroup have Kj variables
Objective: to balance the groups and the subgroups of variables
Partition on the individuals (DMFA) 1
k
K
1 Group 1
i
xik
I1 1 Group J
IJ
Objective: to compare the covariance matrices
3 – Graphical User Interface
Menu of the FactoMineR GUI
3 – Graphical User Interface
Main window of the PCA
3 – Graphical User Interface Graphical options
3 – Graphical User Interface
4 – Conclusion For researchers, practitioners and students: with classical and advanced methods The FactoMineR package is available on the CRAN The GUI can be simply loaded: source("http://factominer.free.fr/install-facto.r")
A website is dedicated to this package: http://factominer.free.fr Future: dynamical graphs Perspective: UseR!2008 (2 tutorials), UseR!2009 at Rennes
No need to mention that the practitioner has ... Notice that the supplementary information don't intervene in any way in the calculus of the vectors Fs and Gs but ... athletics meeting (2004 Olympic Game or 2004 Decastar). By default, the PCA ...
May 23, 2008 - for a new theory of the firm based on the synthetic notion of the firm as an ... which is primarily divided among the contractual (new institutional) ...
Page 1 ..... of gravity. ⢠Assign the points to the closest center. ⢠Calculate anew the. Q centers of gravity q ..... Idea : rank the variables by decreasing |test statistic|.
Then autput can be classically efficiently sumulokd.. More preuvely: com Compude (fatafont)=Po -Py iw poly time.O will quwe proof & funthus rguh cance of nn, ...
Justifications Axiomatic (concave, subadditive, invariant) l Equiprobability for .... Scalar curvature R = function of or degree of polarisation) r41, Rw-tom? Benistus ...
A few words on molecular phylogenetics .... Mainly used for molecular data and molecular distances can .... Sensitive to taxa/gene representation in databases ...
Aug 29, 2009 - (âhealth insuranceâ, âclimate changeâ, ânational securityâ, âsuper Tuesdayâ .... common âresourceâ and a set of citation links between blogs.
May 10, 2010 - Kernel.jsp. Download appropriate ... Download Yxes variant from remote server. Store in .... Signed with a developer certificate (basic capabilities only) ... SKServer hide.sisx contains SMS text 'A very interesting sexy game!'.
-rw-r--r-- 1 axelle axelle 664 Dec 20 03:36 CERT.RSA ... Enumerate all apps, collecting meta data â Often new apps ..... Data mining to compute weights.
Roadmap. I. Introduction (problem statement + data model). II. State of the art (Localization via Capon beamforming and MUSIC). III. Localization via PARAFAC.
Mar 4, 2014 - Proof using the coupling technique main problem: given l queries, upper bound the probability that, for every two consecutive rounds, the l + ...
There's an Android app for the alarm. â· Protect your house against burglars. â· Controllable by SMS. But it's not very user friendly... Comply to a strict SMS ...
May 10, 2010 - Decryption of malicious URLs. ⢠Silent installation of ... Self Signed, Open Signed Online insufficient: ... PETRAN - PE file preprocessor V02.01 (Build 576). Copyright (c) .... Download Yxes variant from remote server. Store in ...
Own the malware adm1ns :D. ⢠Install Zitmo on lab phone 1. ⢠Send SET ADMIN command by. SMS with phone number of lab phone 2. Mobile malware in ...
C tx n n. ∆. = ),( α. (3) Define duration so that the dynamics looks simple. t m n. ∆. = η. Fluctuations: ab n b n a n ww δ α. 1. =〉∆∆〈 mass ab n t m δ η. ∆. = clocks ...
Package's zip date. -rw-r--r-- 1 axelle axelle 664 Dec 20 03:36 CERT.RSA ... Enumerate all apps, collecting meta data â Often new apps ... Must mock an actual device â Only see applications viewable ..... Data mining to compute weights.
O/n Debark, Simien Park Hotel 0581113481. Simien National Park. Day 8. Drive north via Adi Arkay, May Tsemre, Inda Selassie âShireâ & east in to Axum.
Using the Ï2-distance â computing distances from all the principal .... Surface.feeling -2.52. 2.63. 3.62 .... A website with documentation, examples, data sets:.