Machine Learning Méthodes d'Apprentissage Statistique quand n p

f. Tree(x) = q. ∑ l=1. clIl(x), where Il(·) is the indicator function of Rl, i.e. Il(x) = 1 if x ∈ Rl and. Il(x) = 0 if x ..... Sequential mode: Update the average as the points arrive. Starting ... Initialization: Set an initial value of w, say w0 = 0 and normalize.
5MB taille 1 téléchargements 35 vues
Machine Learning Méthodes d’Apprentissage Statistique quand n ≪ p Ernest Fokoué Associate Professor of Statistics Rochester Institute of Technology Rochester, New York, USA

@ErnestFokoue Présentation Invitée Deuxièmes Journées de Statistiques Université de Bretagne Sud, FRANCE

Vendredi, 21 Novembre 2014 Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

1 / 83

Remerciements et Expression e Gratitude

Je tiens à dire un sincère et très merci Professeur Evans Gouno Université de Bretagne Sud

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

2 / 83

Statistical Speaker Accent Recognition

−1.0

−0.8

−0.6

−0.5

0.0

wav

−0.2 −0.4

wav

0.0

0.2

0.5

0.4

1.0

0.6

Sentence: Humanity as a whole is at a threshold of a monumental shift in consciousness. You can see it everywhere. It is breathtaking!

0e+00

1e+05

2e+05

3e+05

4e+05

0e+00

Time

1e+05

2e+05

3e+05

4e+05

5e+05

Time

Figure: (L) Loud Non US Female (R) Normal Non US Female.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

3 / 83

Statistical Recognition of Dialect

0.02 0.00

wav

−0.6

−0.06

−0.04

−0.4

−0.02

−0.2

wav

0.0

0.04

0.2

0.06

Sentence: Humanity as a whole is at a threshold of a monumental shift in consciousness. You can see it everywhere. It is breathtaking!

0e+00

1e+05

2e+05

3e+05

4e+05

0e+00

Time

1e+05

2e+05

3e+05

4e+05

Time

Figure: (L) Normal US Female (R) Low US Male.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

4 / 83

Statistical Speaker Accent Recognition

wav

−0.15

−0.10

−0.05

0.00

0.05

0.10

0.15

Sentence: Humanity as a whole is at a threshold of a monumental shift in consciousness. You can see it everywhere. It is breathtaking!

0e+00

1e+05

2e+05

3e+05

4e+05

Time

Figure: US Male Normal: (L) Time Domain (R)Specgram.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

5 / 83

Statistical Recognition of Dialect Consider Xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ {−1, +1}, and the set

D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) where

Yi =

+1 if person i is a Native US −1 if person i is a Non Native US

and Xi = (xi1 , · · · , xip )> ∈ Rp is the time domain representation of his/her reading of an English sentence. The design matrix is   x11 x12 · · · · · · · · · · · · x1j · · · x1p  .. .. . . . . ..   . . . . · · · · · · · · · · · · .     X =  xi1 xi2 · · · · · · · · · · · · xij · · · xip    .. ..  .. . . . .  . . . . ··· ··· ··· ··· .  xn1 xn2 · · · · · · · · · · · · xnj · · · xnp Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

6 / 83

Example 1: Statistical Recognition of Dialect Consider this design matrix  x11 x12 · · ·  .. .. . .  . . .   X =  xi1 xi2 · · ·  .. ... . . .  . xn1 xn2 · · ·

 · · · · · · x1j · · · x1p .  · · · · · · · · · · · · ..   · · · · · · xij · · · xip   .  · · · · · · · · · · · · ..  · · · · · · · · · xnj · · · xnp

··· .. . ··· .. .

At RIT, we recently collected voices from n = 117 people. Each sentence required about 11 seconds to be read. At a sampling rate of 441000 Hz, each sentence requires a vector of dimension roughly p=540000 in the time domain. We therefore have a gravely underdetermined system with X ∈ IRn×p where n ≪ p. Here, n=117 and p=540000. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

7 / 83

Design Matrix for an overdetertermined system Consider this design matrix  x11  ..  .   ..  .  X=  xi1  ..  .   ..  . xn1

· · · x1j · · · x1p . .. . · · · · · · .. . .. . · · · · · · .. · · · xij · · · xip . .. . · · · · · · .. . .. . · · · · · · .. · · · xnj · · · xnp

             

We therefore have a gravely overdetermined system with X ∈ IRn×p where n ≫ p.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

8 / 83

Design Matrix for an underdetertermined system Consider this design matrix  x11 x12 · · ·  .. .. . .  . . .   X =  xi1 xi2 · · ·  .. .. . .  . . . xn1 xn2 · · ·

 · · · · · · x1j · · · x1p .  · · · · · · · · · · · · ..   · · · · · · xij · · · xip   ..  ··· ··· ··· ··· .  · · · · · · · · · xnj · · · xnp

··· .. . ··· .. .

With X ∈ IRn×p where n ≪ p, we are in the presence of an underdetermined system. Data sets of this type is very challenging and often require both regularization and parallelization.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

9 / 83

Basic Formulation of Binary Pattern Recognition Consider Xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ {−1, +1}, and the set

D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) where Yi =



+1 if observation i belongs to the first group −1 if observation i does not belong to the first group

and Xi = (xi1 , · · · , xip )> ∈ Rp is the explanatory vector of for observation i. The design matrix is   x11 x12 · · · · · · x1j · · · x1p  .. .. . . . . ..   . . . . · · · · · · .     X =  xi1 xi2 · · · · · · xij · · · xip    .. ..  .. . . . .  . . . . ··· ··· .  xn1 xn2 · · · · · · xnj · · · xnp Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

10 / 83

Examples of datasets with characteristics crabs pima spam musk lymphoma lung cancer breast cancer(A) colon cancer leukemia brain cancer breast cancer(W) accent recognition

n 200 532 4601 476 180 197 97 62 72 42 49 117

p 5 7 57 166 661 1000 1213 2000 3571 5597 7129 5 105

n/p 40.0 76.0 80.7 2.9 3 10−1 2 10−1 8 10−2 3 10−2 2 10−2 7 10−3 7 10−3 2 10−4

log(|cov(X)|) -10.04 -1.60 -15.24 -541.15 −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞

c 2 2 2 2 3 4 3 2 2 5 2 2

Table: The last column is the number of classes in the pattern recognition task. Normally we need to compute the class condition covariance matrices. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

11 / 83

Brief Reminder of Logistic Regression We have, Xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ {0, 1}, and data set

D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) Logistic Regression: Assumes that the response variable Yi is related to the explanatory vector xi through the model,   πi log = η(xi ; β) (1) 1 − πi where η(xi ; β) = x> i β = β0 + β1 xi1 + β2 xi2 + · · · + βp xip

(2)

and πi = Pr[Yi = 1|xi ] = Ernest Fokoué (RIT)

eη(xi ;β) 1 = = π(xi ; β) η(x ;β) −η(x i i ;β) 1+e 1+e Machine Learning

Vendredi, 21 Novembre 2014

12 / 83

The Loss function or Negative log Likelihood Using the traditional {0, 1} indicator labelling, the empirical risk for the binary linear logistic regression model is given by ˆ 0 , β) = − 1 R(β n

n

X

h

i yi (β0 + β> xi ) − log 1 + exp (β0 + β> xi ) .

i=1

Which is really ˆ 0 , β) = − 1 log L(β) R(β n However, if we use the labelling {−1, +1} as with SVM, the empirical risk for the linear logistic regression model is given by h

i X ˆ 0 , β) = 1 log 1 + exp −yi (β0 + β> xi ) . R(β n n

i=1

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

13 / 83

Most common link functions in binary regression We can write η(xi ) = F−1 (π(xi )) = g(π(xi )) = g(E(Yi |xi )). Table (2) provides specific definitions of the most commonly used link functions in binary regression, along with their corresponding cdfs. Model Probit

Link function Φ−1 (v)

Compit

log[− log(1 − v)] h i tan πv − π2 h i v log 1−v

Cauchit Logit

Φ(u) =

cdf Ru 1 −∞

1 2 √ e− 2 z dz 2π u

1 − e−e 1 π

h

tan−1 (u) +

Λ(u) =

π 2

i

1 1+e−u

Table: Source: Gunduz and Fokoue(2013).

Note: The R function glm() used for performing logistic regression also allows the above link functions. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

14 / 83

1.0

0.4

Most common link functions in binary regression

probit compit cauchit logit

0.0

0.0

0.2

0.1

0.4

cdf

0.2

pdf

0.6

0.3

0.8

probit compit cauchit logit

−6

−4

−2

0

2

4

6

u

−6

−4

−2

0

2

4

6

u

Figure: Source: Gunduz and Fokoue(2013). (Left) Densities corresponding to the link functions (Right) cdfs corresponding to the link functions. The similarities are clear around the center of the distributions, but also some differences can be seen at the tails Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

15 / 83

Classification and Regression Trees

Given D = (x1 , Y1 ), · · · , (xn , Yn ) , with xi ∈ X, Yi ∈ {1, · · · , k}. If T denotes the tree represented by the partitioning of X into q regions q R1 , R2 , · · · , Rq such that X = ∪`=1 R` , then, all the observations in a given terminal node (region) will be assigned the same label, namely     1 X I(Yi = j) c` = argmax  j∈{1,··· ,k}  |R` | xi ∈R`

As a result, for a new point x, its predicted class is given by ˆ Tree = ˆfTree (x) = Y

q X

c` I` (x),

`=1

where I` (·) is the indicator function of R` , i.e. I` (x) = 1 if x ∈ R` and I` (x) = 0 if x ∈ / R` . Trees are known to be notoriously unstable. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

16 / 83

Typical Aspects of Big/Complex/Massive Data

1

Large p small n:

p≫n

2

Large n small p:

n≫p

3

Large p and Large n

4

Presence of multicollinearity

5

Different measurement scales

6

Heterogeneous input space

7

Non local storage

8

Dynamic/Sequential Data (Streams)

9

Unstructured (Text)

10

Potential nonlinear underlying pattern

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

17 / 83

Common Challenges with Big Data

1

Statistical

2

Computational

3

Mathematical

4

Epistemological

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

18 / 83

Aspects of Big/Complex/Massive Data

1

Number of variables observed for each entity (p)

2

Number of entities observed (n)

3

Number of atoms (basis functions) in the model (k)

Different scenarios in linear model situations small n large n

small p Traditional Sequential estimation

large p Regularization (Constraints) Regularized Sequential

Questions: 1

How small is small? How large is large?

2

Crucial is the ratio n/p

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

19 / 83

n p

n > 1000 n 6 1000

10 Informati Abundanc (n ≫ p) Much smaller p, C Much smaller p, F

Table: In this taxonomy, A and D pose a lot of challenges.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

20 / 83

Introduction to Regression Analysis We have, xi = (xi1 , · · · , xip )> ∈ Rp and Yi ∈ IR, and data set

D = (x1 , Y1 ), (x2 , Y2 ), · · · , (xn , Yn ) We assume that the response variable Yi is related to the explanatory vector xi through a function f via the model, Yi = f (xi ) + ξi ,

i = 1, · · · , n

(3)

The explanatory vectors xi are fixed (non-random) The regression function f : IRp → IR is unknown iid

The error terms ξi are iid Gaussian, i.e. ξi ∼ N(0, σ2 )

Goal: We seek to estimate the function f using the data in D.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

20 / 83

Formulation of the regression problem

Let X and Y be two random variables s.t E[Y] = µ

and E[Y2 ] < ∞

. Goal: Find the best predictor f (X) of Y given X. Important Questions How does one define "best"? Is the very best attainable in practice? What does the function f look like? (Function class) How do we select a candidate from the chosen class of functions? How hard is it computationally to find the desired function?

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

21 / 83

Loss functions 1

When f (X) is used to predict Y, a loss is incurred. Question: How is such a loss quantified? Answer: Define a suitable loss function.

2

Common loss functions in regression Squared error loss or (`2 ) loss `(Y, f (X)) = (Y − f (X))2 `2 is by far the most used (prevalent) because of its differentiability. Unfortunately, not very robust to outliers. Absolute error loss or (`1 ) loss `(Y, f (X)) = |Y − f (X)| `1 is more robust to outliers, but not differentiable at zero.

3

Note that `(Y, f (X)) is a random variable. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

22 / 83

Risk Functionals and Cost Functions 1

Definition of a risk functional, Z R(f ) = E[`(Y, f (X))] =

X×Y

`(y, f (x))pXY (x, y)dxdy

R(f ) is the expected loss over all pairs of the cross space X × Y. 2

Ideally, one seeks the best out of all possible functions, i.e., f ∗ (X) = arg min R(f ) = arg min E[`(Y, f (X))] f

f

f ∗ (·) is such that R∗ = R(f ∗ ) = min R(f ) f

3

This ideal function cannot be found in practice, because the fact that the distributions are unknown, make it impossible to form an expression for R(f ). Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

23 / 83

Cost Functions and Risk Functionals Theorem: Under regularity conditions, f ∗ (X) = E[Y|X] = arg min E[(Y − f (X))2 ] f

Under the squared error loss, the optimal function f ∗ that yields the best prediction of Y given X is no other than the expected value of Y given X. Since we know neither pXY (x, y) nor pX (x), the conditional expectation Z Z pXY (x, y) E[Y|X] = ypY|X (y)(dy) = y dy pX (x) Y Y cannot be directly computed.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

24 / 83

Empirical Risk Minimization

Let D = (X1 , Y1 ), (X2 , Y2 ), · · · , (Xn , Yn ) represent an iid sample The empirical version of the risk functional is 1X b ) = E[(Y\ ˆ ) = R(f (Yi − f (Xi ))2 R(f − f (X))2 ] = n n

i=1

b ) provides an unbiased estimator of R(f ). It turns out that R(f We therefore seek the best by empirical standard,  n  X 1 ˆf ∗ (X) = arg min R(f b ) = arg min (Yi − f (Xi ))2 n f f i=1

Since it is impossible to search all possible functions, it is usually crucial to choose the "right" function space. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

25 / 83

Function spaces For the function estimation task for instance, one could assume that the input space X is a closed and bounded interval of R, i.e. X = [a, b], and then consider estimating the dependencies between x and y from within the space F all bounded functions on X = [a, b], i.e., F = {f : X → R| there exists B > 0, such that |f (x)| 6 B,

for all x ∈ X}.

One could even be more specific and make the functions of the above F continuous, so that the space to search becomes F = {f : [a, b] → R| f is continuous} = C([a, b]), which is the well-known space of all continuous functions on a closed and bounded interval [a, b]. This is indeed a very important function space.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

26 / 83

Space of Univariate Polynomials In fact, polynomial regression consists of searching from a function space that is a subspace of C([a, b]). In other words, when we are doing the very common polynomial regression, we are searching the space P([a, b]) = {f ∈ C([a, b])| f is a polynomial with real coefficients} . It is interesting to note that Weierstrass did prove that P([a, b]) is dense in C([a, b]). One considers the space of all polynomial of some degree k, i.e.,  F = Pk ([a, b]) =

f ∈ C([a, b])| ∃α0 , α1 , · · · , αk ∈ R|

f (x) =

k X

αj xj , ∀x ∈ [a, b]

j=0 Ernest Fokoué (RIT)

Machine Learning

  

Vendredi, 21 Novembre 2014

27 / 83

Empirical Risk Minimization in F

Having chosen a class F of functions, we can now seek  n  X 1 ˆf (X) = arg min R(f b ) = arg min (Yi − f (Xi ))2 n f ∈F f ∈F i=1

We are seeking the best function in the function space chosen. For instance, if the function space in the space of all polynomials of degree k in some interval [a, b], finding ˆf boils down to estimating the coefficients of the polynomial using the data. The least squares estimation of the polynomial coefficients corresponds to this

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

28 / 83

Effect of Bias-Variance Dilemma of Prediction

Optimal Prediction achieved at the point of bias-variance trade-off. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

29 / 83

Ridge Regression Estimator Given {(xi , yi ), i = 1, · · · , n}. Assuming the MLR yi = x> i β + i ,

(4)

where x> i = (xi1 , xi2 , · · · , xip ) and β = (β1 , β2 , · · · , βp ), the ridge estimator of β is the solution the minimization problem  n  n X X ˆ (ridge) = arg min β (yi − x> β)2 + g β2 , β∈Rp

i

j

i=1

i=1

which yields what we intuitively derived earlier, namely ˆ (ridge) = (X> X + gIp )−1 X> y. β Note: g control the trade-off between bias and variance. g is often estimation by cross validation. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

30 / 83

Bayesian Formulation of Ridge Regression If we assume a multivariate Gaussian prior on β, with mean vector β0 and variance covariance matrix W0 , then we may write β ∼ N(β0 , W0 ). Combining the prior with the likelihood, y ∼ N(Xβ, σ2 I) yields, ˆ (Bayes) = (σ−2 X> X + W−1 )−1 (σ−2 X> y + W−1 β0 ) β 0 0 Relationship between Ridge Regression and Bayesian Estimation: If we choose β0 = (0, 0, · · · , 0)> and W0 = σ20 I, then ˆ (Bayes) = (X> X + gI)−1 X> y = β ˆ (ridge) β where g =

σ2 . σ20

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

31 / 83

LASSO as Constrained Optimization 1

LASSO: Least Angle Shrinkage and Selection Operator.    n X 2 ˆ (lasso) = arg min  β (yi − x> i β)  β∈Rp  i=1  p X   subject to |βj | 6 s,

(5)

j=1

which is also formulated as ˆ (lasso) = arg min β p β∈R

where kβk1 =

Pp

j=1 |βj |



n X

 2 (yi − x> i β) + 2γkβk1

(6)

i=1

is the `1 -norm of β.

Most popular technique in current scientific circles Computationally most harder but does shrinkage and selection Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

32 / 83

Regularized Kernel Ridge Regression h

i X X ˆ reg (β0 , β) = 1 R log 1 + exp −yi (β0 + β> xi ) + λ |βj | n n

p

i=1

j=0

Using kernels, the regularized empirical risk for the kernel logistic regression is given by X λ ˆ reg (g) = 1 log [1 + exp {−yi g(xi )}] + kgk2HK R n 2 n

i=1

where g(xi ) = v +

n X

wj K(xi , xj ).

j=1

with g = v + h, v ∈ IR and h ∈ HK . Here, HK is the Reproducing Kernel Hilbert Space (RKHS) engendered by the kernel K. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

33 / 83

Elastic Net and the Generalized Linear Model Net One should think of combining for instance the ridge and the lasso penalty into one (Hui and Hastie) yielding the so called Elastic Net. X ˆ reg (g) = 1 log [1 + exp {−yi g(xi )}] + λPenα (β) R n n

i=1

where

1 Penα (β) = (1 − α)kβk22 + αkβk1 2

So that α = 1: LASSO α = 0: Ridge 0 < α < 1: Grouping/Shrinkage and Selection library(glmnet) Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

34 / 83

Theoretical Aspects of Statistical Learning With labels taken from {−1, +1}, we have X1 1X ˆ emp (f ) = 1 R |yi − f (xi )| = 1{yi 6=f (xi )} n 2 n n

n

i=1

i=1

For every f ∈ F, and n > h, with probability at least 1 − η, we have v   u u h log 2n + 1 − log 4 t h η ˆ emp (f ) + R(f ) 6 R n In the above formula, h is the VC (Vapnik-Chervonenkis) dimension of the space F of functions from which f is taken.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

35 / 83

Fundamental Challenges of Massive Data Illposedness and the inevitability of regularization In the event of data massive, it is often the case that either the number of parameters to be estimated is larger than the number of observations used to estimate them, or the intrinsic dimensionality of the solution space is far less than the observed (apparent) dimensionality. As a result, At least one of Hadamard’s wellposedness conditions is violated The problem is ill-posed in Hadamard’s sense: it violates one or more of the 3 wellposedness conditions: 1 The solution exists. 2 The solution is unique. 3 The solution is stable. Extra conditions must be imposed to achieve wellposedness and extensions of old tools are required to get at least some feasible solutions. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

36 / 83

Computational Challenges with Massive Data

1

Large n or Large p could make storage in RAM literally impossible

2

Large n or Large p leads to instability of iterative numeric methods

3

Large n or Large p increases the computational complexity of operations like estimation and model space search and Bayesian posterior simulation

4

Nonhomogeneity of input space creates representation challenges and requires hybridization

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

37 / 83

Mathematical Challenges with Massive Data

1

2

Hadamard’s ill-posedness force the need for regularization or Bayesianization to achieve Bias-Variance trade-off Large p leads to curse of dimensionality kNearest Neighbors Nonparametric density estimation

3

Large n or Large p causes difficulty in approximation, as in finding good hypothesis space

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

38 / 83

Approaches to Dealing with Massiveness 1

Deletion and Imputation

2

Vectorization

3

Standardization

4

Reduction/Projection

5

Selection/Extraction

6

Randomization

7

Regularization

8

Kernelization

9

Aggregation/Ensemblization

10

Parallelization

11

Sequentialization

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

39 / 83

Regularization

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

40 / 83

Basics of Regularized Discriminant Analysis Let zij = 1 if observation i is from class j and 0 otherwise. 1

Step 1-Basic Estimation: X ˆj = 1 ˆ j )(xi − µ ˆ j )> zij (xi − µ Σ n n

i=1

where

1X zij xi n n

ˆj = µ

i=1

2

Step 2-Regularization: ˆ j (λ) = λΣ ˆ j + (1 − λ)Ip Σ

3

Step 3-Shrinkage/Regularized more: ˆ j (λ, γ) = (1 − γ)Σj (λ) + γ tr(Σ ˆ j (λ))Ip Σ p Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

41 / 83

Ensemblization Aggregation Combination Averaging

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

42 / 83

Random Forest via Random Subspace Learning Choose a base learner class, with generic learner ˆ(·). denoted g Choose an estimation technique/method For b = 1 to B

Draw with replacement from D a bootstrap sample D(b) where (b) (b) D(b) = {z1 , · · · , zn } Draw without replacement from {1, 2, · · · , p} a subset (b) (b) {i1 , · · · , id } of d variables Drop unselected variables from the bootstrap sample D(b) (b) so that Dsub is d dimensional ˆ(b) based on D(b) Build the bth base learner g sub

End Use the ensemble {ˆ g(b) , b = 1, · · · , B} to form the estimator 1X ˆ(b) (x). (x) = g B B

(bagging)

ˆ g

b=1

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

43 / 83

Sequentialization

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

44 / 83

Sequential Estimation of the Sample Mean Given a X1 , X2 , · · · , Xn , the mean of the sample can be calculated 1

Batch mode: All the points in the sample used together all at once. X X1 + X2 + · · · + Xn ¯n = 1 X Xi = n n n

i=1

2

Sequential mode: Update the average as the points arrive. ¯ 1 = X1 , the update uses only the present Starting with X ¯ t−1 , using the equations. observation Xt and the previous mean X   t−1 ¯ 1 ¯ Xt = Xt−1 + Xt . t t

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

45 / 83

Sequential Estimation of the Sample Variance Given a random X1 , X2 , · · · , Xn , the sample variance can be calculated in two different modes 1

Batch mode: All the points in the sample used together all at once ¯ 2 ¯ 2 1 X ¯ n )2 = (X1 − Xn ) + · · · + (Xn − Xn ) (Xi − X n−1 n−1 n

S2n =

i=1

2

Sequential mode: Update the sample variance as the points arrive. The update uses only the present observation Xt and the ¯ t−1 and the previous sample variance previous sample mean X 2 St−1 using the equations (t −

Ernest Fokoué (RIT)

1)S2t

= (t −

2)S2t−1

 +

Machine Learning

t−1 t



¯ t−1 )2 (Xt − X

Vendredi, 21 Novembre 2014

46 / 83

Sequential Learning for the Perceptron Given a random sample {(x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )}, where the yi ∈ {−1, +1}, and xi ∈ IRp , the Perceptron learning algorithm is simply 1

2

Initialization: Set an initial value of w, say w0 = 0 and normalize all the xi so that they have unit length. Update/Learning: Repeat until convergence 1 2 3

Receive observation t, ie the pair (xt , yt ) Make your prediction: h(xt ) = w> t xt Updating: If yt h(xt ) < 0, then wt+1 = wt + yt xt

4

Iterate: t = t + 1.

The predicted class of x by the Perceptron is   ˆfperc (x) = sign w> x t which means given x, predict class +1 if w> t x > 0. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

47 / 83

Kernelization

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

48 / 83

Simulated Example of Binary Classification To gain deeper insights, consider the following cloud of points:

−2

0

2

X2

4

6

Simple 2D binary classification

0

5

10

X1

The logistic regression model of (1) should do well on the above. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

49 / 83

Homogeneous Doughnut Classification Problem However, if we have the following cloud of points: Original Homogeneous 2−dimensional Data

4

A

A A

A

A

A

A AA A A AA A B B B A A BB BB B A B B B A BB A B A A BB B A B B B B BB B B B B B B B A B BB B BB A A A B B B BBB B BB A AA BB B B BB B B BB B B B A BB A B A B B B BBBB BBB A A B B B BBBBB B A B A A B B B B B A BB B B B BB A BB B A B A BB B A A BB B B B A B B BB B A BB B A AAAA B B B A A A A A A AA A A A A A A A A A A A A

2

A A

A A

−4

−2

x2

0

A

A

A −6

−4

−2

A AA

0

2

4

x1

How well will logistic regression - under its underlying linearity assumption - fare on this data? Uhmmmm! Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

50 / 83

Heterogeneous Doughnut Classification Problem Better yet, if the cloud of points is: Original Heterogeneous 2−dimensional Data

4

A

A A

A

A

A

A AA A A AA A B B B A A BB BB B A B B B A BB A B A A BB B A B B B B AB B A B B A B B A B AA B BA A A A B B B AAA A AB A AA BB B A AB B B BA A A A A AA A B A A B A ABBB BBB A A B B A AAAAB B A B A A A B B B B A BB A B B BB A BB B A B A BB B A A BB B B B A B B BB B A BB B A AAAA B B B A A A A A A AA A A A A A A A A A A A A

2

A A

A A

−4

−2

x2

0

A

A

A −6

−4

−2

A AA

0

2

4

x1

How well will logistic regression - under its underlying linearity assumption - fare on this data? Pretty pitiful! Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

51 / 83

Comparison of Performances of Classifiers

SVM Homogeneous Heterogeneous

0.98 0.95

Pattern recognition technique QDA-Tr QDA Logit Logit-Tr 0.98 0.83

0.95 0.75

0.52 0.54

1.00 0.77

Table: Average accuracy of classifiers over 50 replications of the same task.

Failure of linear logistic Success with nonlinear mapping Question: How is the nonlinearity modeled so that such tremendous improvement is achieved?

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

52 / 83

Kernel Logistic Regression for Classification Nonlinear decision boundary: What to do if the linearity assumption implied in equation (2) does not hold? How to model situations where η(x; β) is heavily nonlinear?

Kernel Logistic Regression: Assumes that the response variable Yi is related to the explanatory vector xi through the model   πi log = η(xi ; v, w) (7) 1 − πi where η(xi ; v, w) = v + w1 K(xi , x1 ) + w2 K(xi , x2 ) + · · · + wn K(xi , xn ) (8) and πi = Pr[Yi = 1|xi ] = Ernest Fokoué (RIT)

eη(xi ;v,w) 1 = = π(xi ; v, w) η(x ;v,w) −η(x i ;v,w) 1+e i 1+e Machine Learning

Vendredi, 21 Novembre 2014

53 / 83

Regularized Kernel Ridge Regression h

i X X ˆ reg (β0 , β) = 1 R log 1 + exp −yi (β0 + β> xi ) + λ |βj | n n

p

i=1

j=0

Using kernels, the regularized empirical risk for the kernel logistic regression is given by X λ ˆ reg (g) = 1 log [1 + exp {−yi g(xi )}] + kgk2HK R n 2 n

i=1

where g(xi ) = v +

n X

wj K(xi , xj ).

j=1

with g = v + h, v ∈ IR and h ∈ HK . Here, HK is the Reproducing Kernel Hilbert Space (RKHS) engendered by the kernel K. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

54 / 83

Support Vector Machines and the Hinge Loss The hinge loss is defined as

0 1 − yf (x)

y = f (x) y 6= f (x)

3 2 0

1

hinge(y,f(x))

4

5

`(y, f (x)) = (1−yf (x))+ = max(0, 1−yf (x)) =

−4

−2

0

2

4

yf(x)

Notice that the hinge loss is simply a functional formulation of the constraint in the original SVM paradigm. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

55 / 83

Support Vector Machines and the Hinge Loss With the hinge loss, the Support Vector Machine classifier can be formulated as 1X (1 − yi (w> Φ(xi ) + b))+ n n

Minimize E(w) =

i=1

subject to kwk22 < τ. Which is equivalent in regularized (lagrangian) form to  n  1X > 2 w = arg minq (1 − yi (w Φ(xi ) + b))+ + λkwk2 w∈R n i=1

It is important in this formulation to maintain the responses coded as {−1, +1}. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

56 / 83

Most Popular Kernels The polynomial kernel of Vladimir Vapnik. (Professor Vladimir Vapnik is the co-inventor of the Support Vector Machine paradigm) d K(xi , xj ) = (bx> i xj + a)

where d is the degree of the polynomial, b is the scale parameter and a is the offset parameter. The Gaussian Radial Basis function kernel kxi − xj k2 K(xi , xj ) = exp − 2ω2

!

This is arguably the most popular kernel because of its flexibility, but also because it is found be provide a decent approximation to many real life problems. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

57 / 83

Most Popular Kernels The Laplace kernel K(xi , xj ) = exp −γkxi − xj k



This is the `1 -norm counterpart of the RBF kernel. Although apparently similar, they sometimes produce drastically different results when applied to practical situations. Hyperbolic tangent kernel   K(xi , xj ) = tanh bx> x + a i j Other kernels:(a) string kernels used in the text categorization (b) ANOVA kernels (c) Spline Kernels (d) Bessel kernels Note: A kernel in this context is essentially a measure of similarity. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

58 / 83

SVM on Doughnut Data With the Laplace kernel, SVM delivers SVM classification plot

1.0 1.5

0.5

1.0

0.5

X1

0.0

0.0 −0.5 −0.5 −1.0 −1.0

−1.5

−1.5 −1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

X2

Performance: Separation is clear, and accuracy is 98.25%, with 174 support vectors out of the n = 400 observations. Pretty good! Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

59 / 83

SVM on Doughnut Data With the Gaussian Radial Basis Function kernel, SVM delivers SVM classification plot

1.5 1 1.0

0.5

X1

0

0.0

−1

−0.5

−1.0 −2

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

X2

Performance: Separation is clear, and accuracy is 96.5%, with 165 support vectors out of the n = 400 observations. Pretty good! Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

60 / 83

SVM on Doughnut Data With the polynomial kernel, SVM delivers SVM classification plot

1.5 2 1.0 1

X1

0.5 0

0.0 −1

−0.5 −2

−3

−2

−1

0

1

2

X2

Performance: Nothing going (confusion), and accuracy is 50.25%, with 387 support vectors out of the n = 400 observations. Pretty pitiful, total failure! Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

61 / 83

Appeal of Kernels Consider the Gaussian RBF kernel for illustration ! kxi − xj k2 K(xi , xj ) = exp − 2ω2 Modeling of Nonlinearity in Extremely High dimensional: The choice of ω corresponds to the selection of an entire class of function, which can be very large and very sophisticated. Interesting intuition: Similarity measure provides some insights into why it works lim K(xi , xj ) = 1 kxi −xj k→0

and lim

K(xi , xj ) = 0

kxi −xj k→∞

Therefore, we have local approximation in the right neighborhood, which is a good thing! Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

62 / 83

Support Vector Machine (SVM) Classifier The support vector machine classifier ˆf is obtained by solving a constrained minimization problem with objective function X 1 E(w, v, ξ) = kwk2 + C ξi 2 n

i=1

subject to yi (w> Φ(xi ) + v) > 1 − ξi ,

and ξi > 0,

i = 1, · · · , n

Specifically, we have   |s| X ˆf (x) = sign  ˆ s ys K(xs , x) + v ˆ α j

j

j

j=1

where sj ∈ {1, 2, · · · , n}, s> = {s1 , s2 , · · · , s|s| } and |s| ≪ n. The vectors xs1 , xs2 , · · · , x> s| s| are special and a referred to as support vectors - hence the name support vector machines (SVM). Ernest Fokoué (RIT) Machine Learning Vendredi, 21 Novembre 2014

63 / 83

SVM Classification with Linear Boundary SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1

Figure: Linear SVM classifier with a relatively small margin Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

64 / 83

SVM Classification with Linear Boundary SVM boundary: 3x+2y+1=0. Margins: 3x+2y+1=−+1

Figure: Linear SVM classifier with a relatively large margin Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

65 / 83

SVM Classification with Nonlinear Boundary SVM Optimal Separating and Margin Hyperplanes

Figure: Nonlinear SVM classifier with a relatively small margin Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

66 / 83

Kernel Logistic Regression for Classification Nonlinear decision boundary: What to do if the linearity assumption implied in equation (2) does not hold? How to model situations where η(x; β) is heavily nonlinear?

Kernel Logistic Regression: Assumes that the response variable Yi is related to the explanatory vector xi through the model   πi log = η(xi ; v, w) (9) 1 − πi where η(xi ; v, w) = v + w1 K(xi , x1 ) + w2 K(xi , x2 ) + · · · + wn K(xi , xn ) (10) and πi = Pr[Yi = 1|xi ] = Ernest Fokoué (RIT)

eη(xi ;v,w) 1 = = π(xi ; v, w) η(x ;v,w) −η(x i ;v,w) 1+e i 1+e Machine Learning

Vendredi, 21 Novembre 2014

67 / 83

Ridge Regression with Model Search [Fokoue, 2008] Assume that each wi has prior density p(wi |λ) ∝ exp(−λ−1 w2i ) Then direct Empirical Bayes

1 λ > 2 Eλ (v, w) = ky − Kwk2 + w w n 2

(11)

Inherently non-sparsity inducing prior However, Grandvalet and Canu (2002) use sequential frequentist algo to find sparse solution with it. Fokoue (2008), use simple conjugate hyperpriors λ ∼ Ga(a, b)

and

1 ∼ Ga(c0 , d0 ) σ2

and gets sparse representations via model space search. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

68 / 83

Coordinate-wise model description Define the coordinate-wise model index γ = (γ1 , γ2 , · · · , γq )

γi ∈ {0, 1}

Consider selecting from among submodels of the form Mγ : Where

γi =

y = Kγ wγ + 

1 if K(·, xi ) is used by model Mγ 0 otherwise

The model space M with 2q − 1 models is defined as M = {Mγ :

Ernest Fokoué (RIT)

γ ∈ {0, 1}q and γ 6= (0, 0, · · · , 0)}

Machine Learning

Vendredi, 21 Novembre 2014

69 / 83

Relevance Vector Machine [Tipping, 2001]

Let the prior for each wi be (wi | αi ) ∼ N(wi | 0, α−1 i ).

(12)

(αi | a, b) ∼ Ga(αi | a, b),

(13)

Interesting, with

The marginal prior for wi is Z

1   1 2 −(a+ 2 ) . (14) p(wi ) = p(wi | αi )p(αi )dαi = b + wi 1 2 (2π) 2 Γ (a)

Ernest Fokoué (RIT)

ba Γ (a + 12 )

Machine Learning

Vendredi, 21 Novembre 2014

70 / 83

Relevance Vector Machine [Tipping, 2001] Tipping (2001) solves the RVM problem Maximize

  n X 1 > − log det σ2 I + α−1 j hj hj 2 j=1 −1  n X 1 > 2 > y σ I+ α−1 − y j hj hj 2 + (a − 1)

n X

j=1

log αj + b

j=1

Subject to

αj > 0

n X

(15)

αj

j=1

j = 1, · · · , n.

RVM is more sparse than LASSO. RVM optimization is trickier. Fokoue and Goel (2008) link RVM to D-optimality. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

71 / 83

Comparison of the Three Prior Structures

Figure: (left) Gaussian; (center) Lasso; (right) RVM: (a, b) = 1, 14



RVM is the most sparse of all. However, (a) optimization difficulties and (b) inconsistency estimates. The basic isotropic Gaussian can still help achieve sparse representation. Lasso is good but not straightforward for Bayesians Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

72 / 83

Empirical Framework for Predictive Analytics

For r = 1 to R Draw ` items without replacement from D to form Tr Train ˆf (r) (·) based on the ` items in Tr Predict ˆf (r) (xi ) for the m Xitems in Vr = D\Tr 1 (r) ˆ \ `(yi , ˆf (r) (xi )) Calculate EPMSE(f ) = m zi ∈Vr

End Compute the average EPMSE for ˆf , namely 1X \ ˆf (r) ). average{EPMSE(ˆf )} = EPMSE( R R

r=1

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

73 / 83

Comparison of Learning Methods on Accent Detection

Table: Accent data from research work by Zichen Ma

Accent US Non-US Total

Ernest Fokoué (RIT)

Gender Female Male 90 75 90 75 180 150

Machine Learning

Total 165 165 330

Vendredi, 21 Novembre 2014

74 / 83

Accent Detection in Time Domain

# PC 206 207 .. .

Eigenvalue 75.3270 74.9464 .. .

Percentage (%) 89.94 90.08 .. .

249 250

52.0964 51.8996

94.97 95.07

Table: Number of Principal Components

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

75 / 83

Accent Detection in Time Domain

Method LDA SVM k-NN

Error (%) 7.88 4.55 24.85

Accuracy (%) 92.12 95.45 75.15

Table: Training Errors on Time Domain

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

76 / 83

0.50 0.35

0.40

0.45

Test Error

0.55

0.60

Accent Detection in Time Domain

LDA

SVM−RBF

SVM−Poly

Figure: Comparison of the average test error for accent detection based on time domain. From research work by Zichen Ma. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

77 / 83

Accent Detection in MFCCs domain

Table: Average accuracy of classifiers

# MFCCs 12 19 26 33 39

LDA 0.7353 0.7503 0.8063 0.8319 0.8260

Ernest Fokoué (RIT)

QDA 0.8112 0.8647 0.9224 0.9543 0.9383

SVM(RBF) 0.8208 0.8507 0.9080 0.9352 0.9223

Machine Learning

SVM(PLY) 0.8097 0.8851 0.9379 0.9509 0.9438

kNN 0.8548 0.9098 0.9398 0.9586 0.9605

Vendredi, 21 Novembre 2014

78 / 83

0.85 0.80 0.75

Accuracy

0.90

0.95

1.00

Accent Detection using MFCCs

0.65

0.70

LDA QDA SVM−RBF SVM−PLY k−NN

15

20

25

30

35

40

# MFCCs

Figure: Comparison of the average test error for accent detection using MFCCs from research work by Zichen Ma Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

79 / 83

Comparison of Learning Methods

LDA 0.2227 0.2193 0.0452

Musk Pima Crabs

Musk Pima Crabs

SVM 0.1184 0.2362 0.0677

kNN 0.1922 0.3094 0.0938

CART 0.2450 0.2507 0.1970

adaBoost 0.1375 0.2243 0.1208

rForest 0.1152 0.2304 0.1097

NeuralNet 0.1479 0.2570 0.0350

GaussPR 0.1511 0.2304 0.0702

Logistic 0.2408 0.2186 0.0363

Table: Computations made with The Mighty R

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

80 / 83

0.10

0.15

0.20

0.25

0.30

0.35

Comparison of Learning Methods

LDA

SVM

CART

rForest

Gauss

kNN

Boost

NNET

Logistic

Figure: Comparison of the average prediction error over R = 100 replications on the Musk data Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

81 / 83

0.25

Comparison of Learning Methods

CART

Logistic

0.20

kNN

0.15

Average prediction error

LDA

Gauss

NNet Boost

SVM

0.10

rForest

0

2

4

6

8

10

Method (Classifier)

Figure: Comparison of the average prediction error over R = 100 replications on the Musk data Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

82 / 83

0.0

0.1

0.2

0.3

0.4

0.5

Comparison of Learning Methods

LDA

SVM

CART

rForest

Gauss

kNN

Boost

Logistic

RDA

Figure: Comparison of the average prediction error over R = 100 replications on the lymphoma data Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

83 / 83

Comparison of Learning Methods

0.3 0.2

Average prediction error

0.4

Logistic

CART Gauss

kNN

LDA rForest

RDA

0.1

Boost

SVM

0

2

4

6

8

10

Method (Classifier)

Figure: Comparison of the average prediction error over R = 100 replications on the lymphoma data Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

84 / 83

No Free Lunch Theorem

Theorem (No Free Lunch) There is no learning method that is universally superior to all other methods on all datasets. In other words, if a learning method is presented with a data set whose inherent patterns violate its assumptions, then that learning method will under-perform. The above no free lunch theorem basically says that there is no such thing as a universally superior learning method that outperforms all other methods on all possible data, no matter how sophisticated the method may appear to be.

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

85 / 83

No Free Lunch Theorem Traditional approaches typically break down Even when it appears to work, traditional method like trees break down Off the shelf Dimensionality reduction does not seem to work Domain specific dimensional reduction/feature extracion like MFCC works great for sounds Regularization Aggregation sounds great as a device for variance reduction but must be combi Subspace Learning appears to be best suited as it combines both regularization/selection and aggregation Tuning is very difficulty

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

86 / 83

Existing Computing Tools R packages for big data library{biglm} library(foreach) library(glmnet) library(caret) library(spikeslab) library(bms) library(kernlab) library(randomForest) library(ada) library(audio) library(rpart)

Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

87 / 83

Concluding Remarks and Recommendations Applications: Sharpen your intuition and your commonsense by questioning things, reading about interesting open applied problems, and attempt to solve as many problems as possible Methodology: Read and learn about the fundamental of statistical estimation and inference, get acquainted with the most commonly used methods and techniques, and consistently ask yourself and others what the natural extensions of the techniques could be. Computation: Learn and master at least two programming languages. I strongly recommend getting acquainted with R http://www.r-project.org Theory: "Nothing is more practical than a good theory" (Vladimir N. Vapnik). When it comes to data mining and machine learning and predictive analytics, those who truly understand the inner workings of algorithms and methods always solve problems better. Ernest Fokoué (RIT)

Machine Learning

Vendredi, 21 Novembre 2014

88 / 83