Machine Learning Méthodes d’Apprentissage Statistique quand n ≪ p Ernest Fokoué Associate Professor of Statistics Rochester Institute of Technology Rochester, New York, USA
@ErnestFokoue Présentation Invitée Deuxièmes Journées de Statistiques Université de Bretagne Sud, FRANCE
Vendredi, 21 Novembre 2014 Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
1 / 83
Remerciements et Expression e Gratitude
Je tiens à dire un sincère et très merci Professeur Evans Gouno Université de Bretagne Sud
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
2 / 83
Statistical Speaker Accent Recognition
−1.0
−0.8
−0.6
−0.5
0.0
wav
−0.2 −0.4
wav
0.0
0.2
0.5
0.4
1.0
0.6
Sentence: Humanity as a whole is at a threshold of a monumental shift in consciousness. You can see it everywhere. It is breathtaking!
0e+00
1e+05
2e+05
3e+05
4e+05
0e+00
Time
1e+05
2e+05
3e+05
4e+05
5e+05
Time
Figure: (L) Loud Non US Female (R) Normal Non US Female.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
3 / 83
Statistical Recognition of Dialect
0.02 0.00
wav
−0.6
−0.06
−0.04
−0.4
−0.02
−0.2
wav
0.0
0.04
0.2
0.06
Sentence: Humanity as a whole is at a threshold of a monumental shift in consciousness. You can see it everywhere. It is breathtaking!
0e+00
1e+05
2e+05
3e+05
4e+05
0e+00
Time
1e+05
2e+05
3e+05
4e+05
Time
Figure: (L) Normal US Female (R) Low US Male.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
4 / 83
Statistical Speaker Accent Recognition
wav
−0.15
−0.10
−0.05
0.00
0.05
0.10
0.15
Sentence: Humanity as a whole is at a threshold of a monumental shift in consciousness. You can see it everywhere. It is breathtaking!
0e+00
1e+05
2e+05
3e+05
4e+05
Time
Figure: US Male Normal: (L) Time Domain (R)Specgram.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
5 / 83
Statistical Recognition of Dialect Consider Xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ {−1, +1}, and the set
D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) where
Yi =
+1 if person i is a Native US −1 if person i is a Non Native US
and Xi = (xi1 , · · · , xip )> ∈ Rp is the time domain representation of his/her reading of an English sentence. The design matrix is x11 x12 · · · · · · · · · · · · x1j · · · x1p .. .. . . . . .. . . . . · · · · · · · · · · · · . X = xi1 xi2 · · · · · · · · · · · · xij · · · xip .. .. .. . . . . . . . . ··· ··· ··· ··· . xn1 xn2 · · · · · · · · · · · · xnj · · · xnp Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
6 / 83
Example 1: Statistical Recognition of Dialect Consider this design matrix x11 x12 · · · .. .. . . . . . X = xi1 xi2 · · · .. ... . . . . xn1 xn2 · · ·
· · · · · · x1j · · · x1p . · · · · · · · · · · · · .. · · · · · · xij · · · xip . · · · · · · · · · · · · .. · · · · · · · · · xnj · · · xnp
··· .. . ··· .. .
At RIT, we recently collected voices from n = 117 people. Each sentence required about 11 seconds to be read. At a sampling rate of 441000 Hz, each sentence requires a vector of dimension roughly p=540000 in the time domain. We therefore have a gravely underdetermined system with X ∈ IRn×p where n ≪ p. Here, n=117 and p=540000. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
7 / 83
Design Matrix for an overdetertermined system Consider this design matrix x11 .. . .. . X= xi1 .. . .. . xn1
· · · x1j · · · x1p . .. . · · · · · · .. . .. . · · · · · · .. · · · xij · · · xip . .. . · · · · · · .. . .. . · · · · · · .. · · · xnj · · · xnp
We therefore have a gravely overdetermined system with X ∈ IRn×p where n ≫ p.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
8 / 83
Design Matrix for an underdetertermined system Consider this design matrix x11 x12 · · · .. .. . . . . . X = xi1 xi2 · · · .. .. . . . . . xn1 xn2 · · ·
· · · · · · x1j · · · x1p . · · · · · · · · · · · · .. · · · · · · xij · · · xip .. ··· ··· ··· ··· . · · · · · · · · · xnj · · · xnp
··· .. . ··· .. .
With X ∈ IRn×p where n ≪ p, we are in the presence of an underdetermined system. Data sets of this type is very challenging and often require both regularization and parallelization.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
9 / 83
Basic Formulation of Binary Pattern Recognition Consider Xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ {−1, +1}, and the set
D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) where Yi =
+1 if observation i belongs to the first group −1 if observation i does not belong to the first group
and Xi = (xi1 , · · · , xip )> ∈ Rp is the explanatory vector of for observation i. The design matrix is x11 x12 · · · · · · x1j · · · x1p .. .. . . . . .. . . . . · · · · · · . X = xi1 xi2 · · · · · · xij · · · xip .. .. .. . . . . . . . . ··· ··· . xn1 xn2 · · · · · · xnj · · · xnp Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
10 / 83
Examples of datasets with characteristics crabs pima spam musk lymphoma lung cancer breast cancer(A) colon cancer leukemia brain cancer breast cancer(W) accent recognition
n 200 532 4601 476 180 197 97 62 72 42 49 117
p 5 7 57 166 661 1000 1213 2000 3571 5597 7129 5 105
n/p 40.0 76.0 80.7 2.9 3 10−1 2 10−1 8 10−2 3 10−2 2 10−2 7 10−3 7 10−3 2 10−4
log(|cov(X)|) -10.04 -1.60 -15.24 -541.15 −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞
c 2 2 2 2 3 4 3 2 2 5 2 2
Table: The last column is the number of classes in the pattern recognition task. Normally we need to compute the class condition covariance matrices. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
11 / 83
Brief Reminder of Logistic Regression We have, Xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ {0, 1}, and data set
D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) Logistic Regression: Assumes that the response variable Yi is related to the explanatory vector xi through the model, πi log = η(xi ; β) (1) 1 − πi where η(xi ; β) = x> i β = β0 + β1 xi1 + β2 xi2 + · · · + βp xip
(2)
and πi = Pr[Yi = 1|xi ] = Ernest Fokoué (RIT)
eη(xi ;β) 1 = = π(xi ; β) η(x ;β) −η(x i i ;β) 1+e 1+e Machine Learning
Vendredi, 21 Novembre 2014
12 / 83
The Loss function or Negative log Likelihood Using the traditional {0, 1} indicator labelling, the empirical risk for the binary linear logistic regression model is given by ˆ 0 , β) = − 1 R(β n
n
X
h
i yi (β0 + β> xi ) − log 1 + exp (β0 + β> xi ) .
i=1
Which is really ˆ 0 , β) = − 1 log L(β) R(β n However, if we use the labelling {−1, +1} as with SVM, the empirical risk for the linear logistic regression model is given by h
i X ˆ 0 , β) = 1 log 1 + exp −yi (β0 + β> xi ) . R(β n n
i=1
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
13 / 83
Most common link functions in binary regression We can write η(xi ) = F−1 (π(xi )) = g(π(xi )) = g(E(Yi |xi )). Table (2) provides specific definitions of the most commonly used link functions in binary regression, along with their corresponding cdfs. Model Probit
Link function Φ−1 (v)
Compit
log[− log(1 − v)] h i tan πv − π2 h i v log 1−v
Cauchit Logit
Φ(u) =
cdf Ru 1 −∞
1 2 √ e− 2 z dz 2π u
1 − e−e 1 π
h
tan−1 (u) +
Λ(u) =
π 2
i
1 1+e−u
Table: Source: Gunduz and Fokoue(2013).
Note: The R function glm() used for performing logistic regression also allows the above link functions. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
14 / 83
1.0
0.4
Most common link functions in binary regression
probit compit cauchit logit
0.0
0.0
0.2
0.1
0.4
cdf
0.2
pdf
0.6
0.3
0.8
probit compit cauchit logit
−6
−4
−2
0
2
4
6
u
−6
−4
−2
0
2
4
6
u
Figure: Source: Gunduz and Fokoue(2013). (Left) Densities corresponding to the link functions (Right) cdfs corresponding to the link functions. The similarities are clear around the center of the distributions, but also some differences can be seen at the tails Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
15 / 83
Classification and Regression Trees
Given D = (x1 , Y1 ), · · · , (xn , Yn ) , with xi ∈ X, Yi ∈ {1, · · · , k}. If T denotes the tree represented by the partitioning of X into q regions q R1 , R2 , · · · , Rq such that X = ∪`=1 R` , then, all the observations in a given terminal node (region) will be assigned the same label, namely 1 X I(Yi = j) c` = argmax j∈{1,··· ,k} |R` | xi ∈R`
As a result, for a new point x, its predicted class is given by ˆ Tree = ˆfTree (x) = Y
q X
c` I` (x),
`=1
where I` (·) is the indicator function of R` , i.e. I` (x) = 1 if x ∈ R` and I` (x) = 0 if x ∈ / R` . Trees are known to be notoriously unstable. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
16 / 83
Typical Aspects of Big/Complex/Massive Data
1
Large p small n:
p≫n
2
Large n small p:
n≫p
3
Large p and Large n
4
Presence of multicollinearity
5
Different measurement scales
6
Heterogeneous input space
7
Non local storage
8
Dynamic/Sequential Data (Streams)
9
Unstructured (Text)
10
Potential nonlinear underlying pattern
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
17 / 83
Common Challenges with Big Data
1
Statistical
2
Computational
3
Mathematical
4
Epistemological
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
18 / 83
Aspects of Big/Complex/Massive Data
1
Number of variables observed for each entity (p)
2
Number of entities observed (n)
3
Number of atoms (basis functions) in the model (k)
Different scenarios in linear model situations small n large n
small p Traditional Sequential estimation
large p Regularization (Constraints) Regularized Sequential
Questions: 1
How small is small? How large is large?
2
Crucial is the ratio n/p
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
19 / 83
n p
n > 1000 n 6 1000
10 Informati Abundanc (n ≫ p) Much smaller p, C Much smaller p, F
Table: In this taxonomy, A and D pose a lot of challenges.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
20 / 83
Introduction to Regression Analysis We have, xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ IR, and data set
D = (x1 , Y1 ), (x2 , Y2 ), · · · , (xn , Yn ) We assume that the response variable Yi is related to the explanatory vector xi through a function f via the model, Yi = f (xi ) + ξi ,
i = 1, · · · , n
(3)
The explanatory vectors xi are fixed (non-random) The regression function f : IRp → IR is unknown iid
The error terms ξi are iid Gaussian, i.e. ξi ∼ N(0, σ2 )
Goal: We seek to estimate the function f using the data in D.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
20 / 83
Formulation of the regression problem
Let X and Y be two random variables s.t E[Y] = µ
and E[Y2 ] < ∞
. Goal: Find the best predictor f (X) of Y given X. Important Questions How does one define "best"? Is the very best attainable in practice? What does the function f look like? (Function class) How do we select a candidate from the chosen class of functions? How hard is it computationally to find the desired function?
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
21 / 83
Loss functions 1
When f (X) is used to predict Y, a loss is incurred. Question: How is such a loss quantified? Answer: Define a suitable loss function.
2
Common loss functions in regression Squared error loss or (`2 ) loss `(Y, f (X)) = (Y − f (X))2 `2 is by far the most used (prevalent) because of its differentiability. Unfortunately, not very robust to outliers. Absolute error loss or (`1 ) loss `(Y, f (X)) = |Y − f (X)| `1 is more robust to outliers, but not differentiable at zero.
3
Note that `(Y, f (X)) is a random variable. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
22 / 83
Risk Functionals and Cost Functions 1
Definition of a risk functional, Z R(f ) = E[`(Y, f (X))] =
X×Y
`(y, f (x))pXY (x, y)dxdy
R(f ) is the expected loss over all pairs of the cross space X × Y. 2
Ideally, one seeks the best out of all possible functions, i.e., f ∗ (X) = arg min R(f ) = arg min E[`(Y, f (X))] f
f
f ∗ (·) is such that R∗ = R(f ∗ ) = min R(f ) f
3
This ideal function cannot be found in practice, because the fact that the distributions are unknown, make it impossible to form an expression for R(f ). Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
23 / 83
Cost Functions and Risk Functionals Theorem: Under regularity conditions, f ∗ (X) = E[Y|X] = arg min E[(Y − f (X))2 ] f
Under the squared error loss, the optimal function f ∗ that yields the best prediction of Y given X is no other than the expected value of Y given X. Since we know neither pXY (x, y) nor pX (x), the conditional expectation Z Z pXY (x, y) E[Y|X] = ypY|X (y)(dy) = y dy pX (x) Y Y cannot be directly computed.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
24 / 83
Empirical Risk Minimization
Let D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) represent an iid sample The empirical version of the risk functional is 1X b ) = E[(Y\ ˆ ) = R(f (Yi − f (Xi ))2 R(f − f (X))2 ] = n n
i=1
b ) provides an unbiased estimator of R(f ). It turns out that R(f We therefore seek the best by empirical standard, n X 1 ˆf ∗ (X) = arg min R(f b ) = arg min (Yi − f (Xi ))2 n f f i=1
Since it is impossible to search all possible functions, it is usually crucial to choose the "right" function space. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
25 / 83
Function spaces For the function estimation task for instance, one could assume that the input space X is a closed and bounded interval of R, i.e. X = [a, b], and then consider estimating the dependencies between x and y from within the space F all bounded functions on X = [a, b], i.e., F = {f : X → R| there exists B > 0, such that |f (x)| 6 B,
for all x ∈ X}.
One could even be more specific and make the functions of the above F continuous, so that the space to search becomes F = {f : [a, b] → R| f is continuous} = C([a, b]), which is the well-known space of all continuous functions on a closed and bounded interval [a, b]. This is indeed a very important function space.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
26 / 83
Space of Univariate Polynomials In fact, polynomial regression consists of searching from a function space that is a subspace of C([a, b]). In other words, when we are doing the very common polynomial regression, we are searching the space P([a, b]) = {f ∈ C([a, b])| f is a polynomial with real coefficients} . It is interesting to note that Weierstrass did prove that P([a, b]) is dense in C([a, b]). One considers the space of all polynomial of some degree k, i.e., F = Pk ([a, b]) =
f ∈ C([a, b])| ∃α0 , α1 , · · · , αk ∈ R|
f (x) =
k X
αj xj , ∀x ∈ [a, b]
j=0 Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
27 / 83
Empirical Risk Minimization in F
Having chosen a class F of functions, we can now seek n X 1 ˆf (X) = arg min R(f b ) = arg min (Yi − f (Xi ))2 n f ∈F f ∈F i=1
We are seeking the best function in the function space chosen. For instance, if the function space in the space of all polynomials of degree k in some interval [a, b], finding ˆf boils down to estimating the coefficients of the polynomial using the data. The least squares estimation of the polynomial coefficients corresponds to this
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
28 / 83
Effect of Bias-Variance Dilemma of Prediction
Optimal Prediction achieved at the point of bias-variance trade-off. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
29 / 83
Ridge Regression Estimator Given {(xi , yi ), i = 1, · · · , n}. Assuming the MLR yi = x> i β + i ,
(4)
where x> i = (xi1 , xi2 , · · · , xip ) and β = (β1 , β2 , · · · , βp ), the ridge estimator of β is the solution the minimization problem n n X X ˆ (ridge) = arg min β (yi − x> β)2 + g β2 , β∈Rp
i
j
i=1
i=1
which yields what we intuitively derived earlier, namely ˆ (ridge) = (X> X + gIp )−1 X> y. β Note: g control the trade-off between bias and variance. g is often estimation by cross validation. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
30 / 83
Bayesian Formulation of Ridge Regression If we assume a multivariate Gaussian prior on β, with mean vector β0 and variance covariance matrix W0 , then we may write β ∼ N(β0 , W0 ). Combining the prior with the likelihood, y ∼ N(Xβ, σ2 I) yields, ˆ (Bayes) = (σ−2 X> X + W−1 )−1 (σ−2 X> y + W−1 β0 ) β 0 0 Relationship between Ridge Regression and Bayesian Estimation: If we choose β0 = (0, 0, · · · , 0)> and W0 = σ20 I, then ˆ (Bayes) = (X> X + gI)−1 X> y = β ˆ (ridge) β where g =
σ2 . σ20
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
31 / 83
LASSO as Constrained Optimization 1
LASSO: Least Angle Shrinkage and Selection Operator. n X 2 ˆ (lasso) = arg min β (yi − x> i β) β∈Rp i=1 p X subject to |βj | 6 s,
(5)
j=1
which is also formulated as ˆ (lasso) = arg min β p β∈R
where kβk1 =
Pp
j=1 |βj |
n X
2 (yi − x> i β) + 2γkβk1
(6)
i=1
is the `1 -norm of β.
Most popular technique in current scientific circles Computationally most harder but does shrinkage and selection Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
32 / 83
Regularized Kernel Ridge Regression h
i X X ˆ reg (β0 , β) = 1 R log 1 + exp −yi (β0 + β> xi ) + λ |βj | n n
p
i=1
j=0
Using kernels, the regularized empirical risk for the kernel logistic regression is given by X λ ˆ reg (g) = 1 log [1 + exp {−yi g(xi )}] + kgk2HK R n 2 n
i=1
where g(xi ) = v +
n X
wj K(xi , xj ).
j=1
with g = v + h, v ∈ IR and h ∈ HK . Here, HK is the Reproducing Kernel Hilbert Space (RKHS) engendered by the kernel K. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
33 / 83
Elastic Net and the Generalized Linear Model Net One should think of combining for instance the ridge and the lasso penalty into one (Hui and Hastie) yielding the so called Elastic Net. X ˆ reg (g) = 1 log [1 + exp {−yi g(xi )}] + λPenα (β) R n n
i=1
where
1 Penα (β) = (1 − α)kβk22 + αkβk1 2
So that α = 1: LASSO α = 0: Ridge 0 < α < 1: Grouping/Shrinkage and Selection library(glmnet) Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
34 / 83
Theoretical Aspects of Statistical Learning With labels taken from {−1, +1}, we have X1 1X ˆ emp (f ) = 1 R |yi − f (xi )| = 1{yi 6=f (xi )} n 2 n n
n
i=1
i=1
For every f ∈ F, and n > h, with probability at least 1 − η, we have v u u h log 2n + 1 − log 4 t h η ˆ emp (f ) + R(f ) 6 R n In the above formula, h is the VC (Vapnik-Chervonenkis) dimension of the space F of functions from which f is taken.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
35 / 83
Fundamental Challenges of Massive Data Illposedness and the inevitability of regularization In the event of data massive, it is often the case that either the number of parameters to be estimated is larger than the number of observations used to estimate them, or the intrinsic dimensionality of the solution space is far less than the observed (apparent) dimensionality. As a result, At least one of Hadamard’s wellposedness conditions is violated The problem is ill-posed in Hadamard’s sense: it violates one or more of the 3 wellposedness conditions: 1 The solution exists. 2 The solution is unique. 3 The solution is stable. Extra conditions must be imposed to achieve wellposedness and extensions of old tools are required to get at least some feasible solutions. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
36 / 83
Computational Challenges with Massive Data
1
Large n or Large p could make storage in RAM literally impossible
2
Large n or Large p leads to instability of iterative numeric methods
3
Large n or Large p increases the computational complexity of operations like estimation and model space search and Bayesian posterior simulation
4
Nonhomogeneity of input space creates representation challenges and requires hybridization
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
37 / 83
Mathematical Challenges with Massive Data
1
2
Hadamard’s ill-posedness force the need for regularization or Bayesianization to achieve Bias-Variance trade-off Large p leads to curse of dimensionality kNearest Neighbors Nonparametric density estimation
3
Large n or Large p causes difficulty in approximation, as in finding good hypothesis space
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
38 / 83
Approaches to Dealing with Massiveness 1
Deletion and Imputation
2
Vectorization
3
Standardization
4
Reduction/Projection
5
Selection/Extraction
6
Randomization
7
Regularization
8
Kernelization
9
Aggregation/Ensemblization
10
Parallelization
11
Sequentialization
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
39 / 83
Regularization
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
40 / 83
Basics of Regularized Discriminant Analysis Let zij = 1 if observation i is from class j and 0 otherwise. 1
Step 1-Basic Estimation: X ˆj = 1 ˆ j )(xi − µ ˆ j )> zij (xi − µ Σ n n
i=1
where
1X zij xi n n
ˆj = µ
i=1
2
Step 2-Regularization: ˆ j (λ) = λΣ ˆ j + (1 − λ)Ip Σ
3
Step 3-Shrinkage/Regularized more: ˆ j (λ, γ) = (1 − γ)Σj (λ) + γ tr(Σ ˆ j (λ))Ip Σ p Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
41 / 83
Ensemblization Aggregation Combination Averaging
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
42 / 83
Random Forest via Random Subspace Learning Choose a base learner class, with generic learner ˆ(·). denoted g Choose an estimation technique/method For b = 1 to B
Draw with replacement from D a bootstrap sample D(b) where (b) (b) D(b) = {z1 , · · · , zn } Draw without replacement from {1, 2, · · · , p} a subset (b) (b) {i1 , · · · , id } of d variables Drop unselected variables from the bootstrap sample D(b) (b) so that Dsub is d dimensional ˆ(b) based on D(b) Build the bth base learner g sub
End Use the ensemble {ˆ g(b) , b = 1, · · · , B} to form the estimator 1X ˆ(b) (x). (x) = g B B
(bagging)
ˆ g
b=1
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
43 / 83
Sequentialization
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
44 / 83
Sequential Estimation of the Sample Mean Given a X1 , X2 , · · · , Xn , the mean of the sample can be calculated 1
Batch mode: All the points in the sample used together all at once. X X1 + X2 + · · · + Xn ¯n = 1 X Xi = n n n
i=1
2
Sequential mode: Update the average as the points arrive. ¯ 1 = X1 , the update uses only the present Starting with X ¯ t−1 , using the equations. observation Xt and the previous mean X t−1 ¯ 1 ¯ Xt = Xt−1 + Xt . t t
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
45 / 83
Sequential Estimation of the Sample Variance Given a random X1 , X2 , · · · , Xn , the sample variance can be calculated in two different modes 1
Batch mode: All the points in the sample used together all at once ¯ 2 ¯ 2 1 X ¯ n )2 = (X1 − Xn ) + · · · + (Xn − Xn ) (Xi − X n−1 n−1 n
S2n =
i=1
2
Sequential mode: Update the sample variance as the points arrive. The update uses only the present observation Xt and the ¯ t−1 and the previous sample variance previous sample mean X 2 St−1 using the equations (t −
Ernest Fokoué (RIT)
1)S2t
= (t −
2)S2t−1
+
Machine Learning
t−1 t
¯ t−1 )2 (Xt − X
Vendredi, 21 Novembre 2014
46 / 83
Sequential Learning for the Perceptron Given a random sample {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )}, where the yi ∈ {−1, +1}, and xi ∈ IRp , the Perceptron learning algorithm is simply 1
2
Initialization: Set an initial value of w, say w0 = 0 and normalize all the xi so that they have unit length. Update/Learning: Repeat until convergence 1 2 3
Receive observation t, ie the pair (xt , yt ) Make your prediction: h(xt ) = w> t xt Updating: If yt h(xt ) < 0, then wt+1 = wt + yt xt
4
Iterate: t = t + 1.
The predicted class of x by the Perceptron is ˆfperc (x) = sign w> x t which means given x, predict class +1 if w> t x > 0. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
47 / 83
Kernelization
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
48 / 83
Simulated Example of Binary Classification To gain deeper insights, consider the following cloud of points:
−2
0
2
X2
4
6
Simple 2D binary classification
0
5
10
X1
The logistic regression model of (1) should do well on the above. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
49 / 83
Homogeneous Doughnut Classification Problem However, if we have the following cloud of points: Original Homogeneous 2−dimensional Data
4
A
A A
A
A
A
A AA A A AA A B B B A A BB BB B A B B B A BB A B A A BB B A B B B B BB B B B B B B B A B BB B BB A A A B B B BBB B BB A AA BB B B BB B B BB B B B A BB A B A B B B BBBB BBB A A B B B BBBBB B A B A A B B B B B A BB B B B BB A BB B A B A BB B A A BB B B B A B B BB B A BB B A AAAA B B B A A A A A A AA A A A A A A A A A A A A
2
A A
A A
−4
−2
x2
0
A
A
A −6
−4
−2
A AA
0
2
4
x1
How well will logistic regression - under its underlying linearity assumption - fare on this data? Uhmmmm! Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
50 / 83
Heterogeneous Doughnut Classification Problem Better yet, if the cloud of points is: Original Heterogeneous 2−dimensional Data
4
A
A A
A
A
A
A AA A A AA A B B B A A BB BB B A B B B A BB A B A A BB B A B B B B AB B A B B A B B A B AA B BA A A A B B B AAA A AB A AA BB B A AB B B BA A A A A AA A B A A B A ABBB BBB A A B B A AAAAB B A B A A A B B B B A BB A B B BB A BB B A B A BB B A A BB B B B A B B BB B A BB B A AAAA B B B A A A A A A AA A A A A A A A A A A A A
2
A A
A A
−4
−2
x2
0
A
A
A −6
−4
−2
A AA
0
2
4
x1
How well will logistic regression - under its underlying linearity assumption - fare on this data? Pretty pitiful! Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
51 / 83
Comparison of Performances of Classifiers
SVM Homogeneous Heterogeneous
0.98 0.95
Pattern recognition technique QDA-Tr QDA Logit Logit-Tr 0.98 0.83
0.95 0.75
0.52 0.54
1.00 0.77
Table: Average accuracy of classifiers over 50 replications of the same task.
Failure of linear logistic Success with nonlinear mapping Question: How is the nonlinearity modeled so that such tremendous improvement is achieved?
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
52 / 83
Kernel Logistic Regression for Classification Nonlinear decision boundary: What to do if the linearity assumption implied in equation (2) does not hold? How to model situations where η(x; β) is heavily nonlinear?
Kernel Logistic Regression: Assumes that the response variable Yi is related to the explanatory vector xi through the model πi log = η(xi ; v, w) (7) 1 − πi where η(xi ; v, w) = v + w1 K(xi , x1 ) + w2 K(xi , x2 ) + · · · + wn K(xi , xn ) (8) and πi = Pr[Yi = 1|xi ] = Ernest Fokoué (RIT)
eη(xi ;v,w) 1 = = π(xi ; v, w) η(x ;v,w) −η(x i ;v,w) 1+e i 1+e Machine Learning
Vendredi, 21 Novembre 2014
53 / 83
Regularized Kernel Ridge Regression h
i X X ˆ reg (β0 , β) = 1 R log 1 + exp −yi (β0 + β> xi ) + λ |βj | n n
p
i=1
j=0
Using kernels, the regularized empirical risk for the kernel logistic regression is given by X λ ˆ reg (g) = 1 log [1 + exp {−yi g(xi )}] + kgk2HK R n 2 n
i=1
where g(xi ) = v +
n X
wj K(xi , xj ).
j=1
with g = v + h, v ∈ IR and h ∈ HK . Here, HK is the Reproducing Kernel Hilbert Space (RKHS) engendered by the kernel K. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
54 / 83
Support Vector Machines and the Hinge Loss The hinge loss is defined as
0 1 − yf (x)
y = f (x) y 6= f (x)
3 2 0
1
hinge(y,f(x))
4
5
`(y, f (x)) = (1−yf (x))+ = max(0, 1−yf (x)) =
−4
−2
0
2
4
yf(x)
Notice that the hinge loss is simply a functional formulation of the constraint in the original SVM paradigm. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
55 / 83
Support Vector Machines and the Hinge Loss With the hinge loss, the Support Vector Machine classifier can be formulated as 1X (1 − yi (w> Φ(xi ) + b))+ n n
Minimize E(w) =
i=1
subject to kwk22 < τ. Which is equivalent in regularized (lagrangian) form to n 1X > 2 w = arg minq (1 − yi (w Φ(xi ) + b))+ + λkwk2 w∈R n i=1
It is important in this formulation to maintain the responses coded as {−1, +1}. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
56 / 83
Most Popular Kernels The polynomial kernel of Vladimir Vapnik. (Professor Vladimir Vapnik is the co-inventor of the Support Vector Machine paradigm) d K(xi , xj ) = (bx> i xj + a)
where d is the degree of the polynomial, b is the scale parameter and a is the offset parameter. The Gaussian Radial Basis function kernel kxi − xj k2 K(xi , xj ) = exp − 2ω2
!
This is arguably the most popular kernel because of its flexibility, but also because it is found be provide a decent approximation to many real life problems. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
57 / 83
Most Popular Kernels The Laplace kernel K(xi , xj ) = exp −γkxi − xj k
This is the `1 -norm counterpart of the RBF kernel. Although apparently similar, they sometimes produce drastically different results when applied to practical situations. Hyperbolic tangent kernel K(xi , xj ) = tanh bx> x + a i j Other kernels:(a) string kernels used in the text categorization (b) ANOVA kernels (c) Spline Kernels (d) Bessel kernels Note: A kernel in this context is essentially a measure of similarity. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
58 / 83
SVM on Doughnut Data With the Laplace kernel, SVM delivers SVM classification plot
1.0 1.5
0.5
1.0
0.5
X1
0.0
0.0 −0.5 −0.5 −1.0 −1.0
−1.5
−1.5 −1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
X2
Performance: Separation is clear, and accuracy is 98.25%, with 174 support vectors out of the n = 400 observations. Pretty good! Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
59 / 83
SVM on Doughnut Data With the Gaussian Radial Basis Function kernel, SVM delivers SVM classification plot
1.5 1 1.0
0.5
X1
0
0.0
−1
−0.5
−1.0 −2
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
X2
Performance: Separation is clear, and accuracy is 96.5%, with 165 support vectors out of the n = 400 observations. Pretty good! Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
60 / 83
SVM on Doughnut Data With the polynomial kernel, SVM delivers SVM classification plot
1.5 2 1.0 1
X1
0.5 0
0.0 −1
−0.5 −2
−3
−2
−1
0
1
2
X2
Performance: Nothing going (confusion), and accuracy is 50.25%, with 387 support vectors out of the n = 400 observations. Pretty pitiful, total failure! Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
61 / 83
Appeal of Kernels Consider the Gaussian RBF kernel for illustration ! kxi − xj k2 K(xi , xj ) = exp − 2ω2 Modeling of Nonlinearity in Extremely High dimensional: The choice of ω corresponds to the selection of an entire class of function, which can be very large and very sophisticated. Interesting intuition: Similarity measure provides some insights into why it works lim K(xi , xj ) = 1 kxi −xj k→0
and lim
K(xi , xj ) = 0
kxi −xj k→∞
Therefore, we have local approximation in the right neighborhood, which is a good thing! Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
62 / 83
Support Vector Machine (SVM) Classifier The support vector machine classifier ˆf is obtained by solving a constrained minimization problem with objective function X 1 E(w, v, ξ) = kwk2 + C ξi 2 n
i=1
subject to yi (w> Φ(xi ) + v) > 1 − ξi ,
and ξi > 0,
i = 1, · · · , n
Specifically, we have |s| X ˆf (x) = sign ˆ s ys K(xs , x) + v ˆ α j
j
j
j=1
where sj ∈ {1, 2, · · · , n}, s> = {s1 , s2 , · · · , s|s| } and |s| ≪ n. The vectors xs1 , xs2 , · · · , x> s| s| are special and a referred to as support vectors - hence the name support vector machines (SVM). Ernest Fokoué (RIT) Machine Learning Vendredi, 21 Novembre 2014
63 / 83
SVM Classification with Linear Boundary SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classifier with a relatively small margin Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
64 / 83
SVM Classification with Linear Boundary SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1
Figure: Linear SVM classifier with a relatively large margin Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
65 / 83
SVM Classification with Nonlinear Boundary SVM Optimal Separating and Margin Hyperplanes
Figure: Nonlinear SVM classifier with a relatively small margin Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
66 / 83
Kernel Logistic Regression for Classification Nonlinear decision boundary: What to do if the linearity assumption implied in equation (2) does not hold? How to model situations where η(x; β) is heavily nonlinear?
Kernel Logistic Regression: Assumes that the response variable Yi is related to the explanatory vector xi through the model πi log = η(xi ; v, w) (9) 1 − πi where η(xi ; v, w) = v + w1 K(xi , x1 ) + w2 K(xi , x2 ) + · · · + wn K(xi , xn ) (10) and πi = Pr[Yi = 1|xi ] = Ernest Fokoué (RIT)
eη(xi ;v,w) 1 = = π(xi ; v, w) η(x ;v,w) −η(x i ;v,w) 1+e i 1+e Machine Learning
Vendredi, 21 Novembre 2014
67 / 83
Ridge Regression with Model Search [Fokoue, 2008] Assume that each wi has prior density p(wi |λ) ∝ exp(−λ−1 w2i ) Then direct Empirical Bayes
1 λ > 2 Eλ (v, w) = ky − Kwk2 + w w n 2
(11)
Inherently non-sparsity inducing prior However, Grandvalet and Canu (2002) use sequential frequentist algo to find sparse solution with it. Fokoue (2008), use simple conjugate hyperpriors λ ∼ Ga(a, b)
and
1 ∼ Ga(c0 , d0 ) σ2
and gets sparse representations via model space search. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
68 / 83
Coordinate-wise model description Define the coordinate-wise model index γ = (γ1 , γ2 , · · · , γq )
γi ∈ {0, 1}
Consider selecting from among submodels of the form Mγ : Where
γi =
y = Kγ wγ +
1 if K(·, xi ) is used by model Mγ 0 otherwise
The model space M with 2q − 1 models is defined as M = {Mγ :
Ernest Fokoué (RIT)
γ ∈ {0, 1}q and γ 6= (0, 0, · · · , 0)}
Machine Learning
Vendredi, 21 Novembre 2014
69 / 83
Relevance Vector Machine [Tipping, 2001]
Let the prior for each wi be (wi | αi ) ∼ N(wi | 0, α−1 i ).
(12)
(αi | a, b) ∼ Ga(αi | a, b),
(13)
Interesting, with
The marginal prior for wi is Z
1 1 2 −(a+ 2 ) . (14) p(wi ) = p(wi | αi )p(αi )dαi = b + wi 1 2 (2π) 2 Γ (a)
Ernest Fokoué (RIT)
ba Γ (a + 12 )
Machine Learning
Vendredi, 21 Novembre 2014
70 / 83
Relevance Vector Machine [Tipping, 2001] Tipping (2001) solves the RVM problem Maximize
n X 1 > − log det σ2 I + α−1 j hj hj 2 j=1 −1 n X 1 > 2 > y σ I+ α−1 − y j hj hj 2 + (a − 1)
n X
j=1
log αj + b
j=1
Subject to
αj > 0
n X
(15)
αj
j=1
j = 1, · · · , n.
RVM is more sparse than LASSO. RVM optimization is trickier. Fokoue and Goel (2008) link RVM to D-optimality. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
71 / 83
Comparison of the Three Prior Structures
Figure: (left) Gaussian; (center) Lasso; (right) RVM: (a, b) = 1, 14
RVM is the most sparse of all. However, (a) optimization difficulties and (b) inconsistency estimates. The basic isotropic Gaussian can still help achieve sparse representation. Lasso is good but not straightforward for Bayesians Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
72 / 83
Empirical Framework for Predictive Analytics
For r = 1 to R Draw ` items without replacement from D to form Tr Train ˆf (r) (·) based on the ` items in Tr Predict ˆf (r) (xi ) for the m Xitems in Vr = D\Tr 1 (r) ˆ \ `(yi , ˆf (r) (xi )) Calculate EPMSE(f ) = m zi ∈Vr
End Compute the average EPMSE for ˆf , namely 1X \ ˆf (r) ). average{EPMSE(ˆf )} = EPMSE( R R
r=1
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
73 / 83
Comparison of Learning Methods on Accent Detection
Table: Accent data from research work by Zichen Ma
Accent US Non-US Total
Ernest Fokoué (RIT)
Gender Female Male 90 75 90 75 180 150
Machine Learning
Total 165 165 330
Vendredi, 21 Novembre 2014
74 / 83
Accent Detection in Time Domain
# PC 206 207 .. .
Eigenvalue 75.3270 74.9464 .. .
Percentage (%) 89.94 90.08 .. .
249 250
52.0964 51.8996
94.97 95.07
Table: Number of Principal Components
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
75 / 83
Accent Detection in Time Domain
Method LDA SVM k-NN
Error (%) 7.88 4.55 24.85
Accuracy (%) 92.12 95.45 75.15
Table: Training Errors on Time Domain
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
76 / 83
0.50 0.35
0.40
0.45
Test Error
0.55
0.60
Accent Detection in Time Domain
LDA
SVM−RBF
SVM−Poly
Figure: Comparison of the average test error for accent detection based on time domain. From research work by Zichen Ma. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
77 / 83
Accent Detection in MFCCs domain
Table: Average accuracy of classifiers
# MFCCs 12 19 26 33 39
LDA 0.7353 0.7503 0.8063 0.8319 0.8260
Ernest Fokoué (RIT)
QDA 0.8112 0.8647 0.9224 0.9543 0.9383
SVM(RBF) 0.8208 0.8507 0.9080 0.9352 0.9223
Machine Learning
SVM(PLY) 0.8097 0.8851 0.9379 0.9509 0.9438
kNN 0.8548 0.9098 0.9398 0.9586 0.9605
Vendredi, 21 Novembre 2014
78 / 83
0.85 0.80 0.75
Accuracy
0.90
0.95
1.00
Accent Detection using MFCCs
0.65
0.70
LDA QDA SVM−RBF SVM−PLY k−NN
15
20
25
30
35
40
# MFCCs
Figure: Comparison of the average test error for accent detection using MFCCs from research work by Zichen Ma Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
79 / 83
Comparison of Learning Methods
LDA 0.2227 0.2193 0.0452
Musk Pima Crabs
Musk Pima Crabs
SVM 0.1184 0.2362 0.0677
kNN 0.1922 0.3094 0.0938
CART 0.2450 0.2507 0.1970
adaBoost 0.1375 0.2243 0.1208
rForest 0.1152 0.2304 0.1097
NeuralNet 0.1479 0.2570 0.0350
GaussPR 0.1511 0.2304 0.0702
Logistic 0.2408 0.2186 0.0363
Table: Computations made with The Mighty R
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
80 / 83
0.10
0.15
0.20
0.25
0.30
0.35
Comparison of Learning Methods
LDA
SVM
CART
rForest
Gauss
kNN
Boost
NNET
Logistic
Figure: Comparison of the average prediction error over R = 100 replications on the Musk data Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
81 / 83
0.25
Comparison of Learning Methods
CART
Logistic
0.20
kNN
0.15
Average prediction error
LDA
Gauss
NNet Boost
SVM
0.10
rForest
0
2
4
6
8
10
Method (Classifier)
Figure: Comparison of the average prediction error over R = 100 replications on the Musk data Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
82 / 83
0.0
0.1
0.2
0.3
0.4
0.5
Comparison of Learning Methods
LDA
SVM
CART
rForest
Gauss
kNN
Boost
Logistic
RDA
Figure: Comparison of the average prediction error over R = 100 replications on the lymphoma data Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
83 / 83
Comparison of Learning Methods
0.3 0.2
Average prediction error
0.4
Logistic
CART Gauss
kNN
LDA rForest
RDA
0.1
Boost
SVM
0
2
4
6
8
10
Method (Classifier)
Figure: Comparison of the average prediction error over R = 100 replications on the lymphoma data Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
84 / 83
No Free Lunch Theorem
Theorem (No Free Lunch) There is no learning method that is universally superior to all other methods on all datasets. In other words, if a learning method is presented with a data set whose inherent patterns violate its assumptions, then that learning method will under-perform. The above no free lunch theorem basically says that there is no such thing as a universally superior learning method that outperforms all other methods on all possible data, no matter how sophisticated the method may appear to be.
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
85 / 83
No Free Lunch Theorem Traditional approaches typically break down Even when it appears to work, traditional method like trees break down Off the shelf Dimensionality reduction does not seem to work Domain specific dimensional reduction/feature extracion like MFCC works great for sounds Regularization Aggregation sounds great as a device for variance reduction but must be combi Subspace Learning appears to be best suited as it combines both regularization/selection and aggregation Tuning is very difficulty
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
86 / 83
Existing Computing Tools R packages for big data library{biglm} library(foreach) library(glmnet) library(caret) library(spikeslab) library(bms) library(kernlab) library(randomForest) library(ada) library(audio) library(rpart)
Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
87 / 83
Concluding Remarks and Recommendations Applications: Sharpen your intuition and your commonsense by questioning things, reading about interesting open applied problems, and attempt to solve as many problems as possible Methodology: Read and learn about the fundamental of statistical estimation and inference, get acquainted with the most commonly used methods and techniques, and consistently ask yourself and others what the natural extensions of the techniques could be. Computation: Learn and master at least two programming languages. I strongly recommend getting acquainted with R http://www.r-project.org Theory: "Nothing is more practical than a good theory" (Vladimir N. Vapnik). When it comes to data mining and machine learning and predictive analytics, those who truly understand the inner workings of algorithms and methods always solve problems better. Ernest Fokoué (RIT)
Machine Learning
Vendredi, 21 Novembre 2014
88 / 83