Machine Learning & Data Science for Actuaries ... - Freakonometrics

Editor of the freakonometrics.hypotheses.org's blog. Editor of Computational ...... samples can be decomposed in three parts Hastie et al. (2001). E[(Y â Ìm(x)). 2. ] ...... 4 Information from URL : http://maps.googleapis.com/maps/api/geocode/.

Télécharger le PDF

64MB taille 2 téléchargements 512 vues

commentaire

Report

http://www.ub.edu/riskcenter

Machine Learning & Data Science for Actuaries, with R Arthur Charpentier (Université de Rennes 1 & UQàM)

Universitat de Barcelona, April 2016. http://freakonometrics.hypotheses.org

@freakonometrics

1

http://www.ub.edu/riskcenter

Machine Learning & Data Science for Actuaries, with R Arthur Charpentier (Université de Rennes 1 & UQàM)

Professor, Economics Department, Univ. Rennes 1 In charge of Data Science for Actuaries program, IA Research Chair actinfo (Institut Louis Bachelier) (previously Actuarial Sciences at UQàM & ENSAE Paristech actuary in Hong Kong, IT & Stats FFSA) PhD in Statistics (KU Leuven), Fellow Institute of Actuaries MSc in Financial Mathematics (Paris Dauphine) & ENSAE Editor of the freakonometrics.hypotheses.org’s blog Editor of Computational Actuarial Science, CRC @freakonometrics

2

http://www.ub.edu/riskcenter

Agenda

0. Introduction, see slides 1. Classification, y ∈ {0, 1} 2. Regression Models, y ∈ R 3. Model Choice, Feature Selection, etc 4. Data Visualisation & Maps

@freakonometrics

3

http://www.ub.edu/riskcenter

Part 1. Classification, y ∈ {0, 1}

@freakonometrics

4

http://www.ub.edu/riskcenter

Classification? Example: Fraud detection, automatic reading (classifying handwriting symbols), face recognition, accident occurence, death, purchase of optinal insurance cover, etc

1.0

Here yi ∈ {0, 1} or yi ∈ {−1, +1} or yi ∈ {•, •}.

0.8

●

●

●

●

0.4

• the score function, s(x) = P(Y = 1|X = x) ∈ [0, 1]

●

●

0.6

We look for a (good) predictive model here. There will be two steps,

●

0.2

● ●

0.0

●

• the classification function s(x) → Yb ∈ {0, 1}.

@freakonometrics

0.0

0.2

0.4

0.6

0.8

1.0

5

http://www.ub.edu/riskcenter

Modeling a 0/1 random variable

Myocardial infarction of patients admited in E.R. ◦ heart rate (FRCAR), ◦ ◦ ◦ ◦ ◦ ◦ ◦

1

heart index (INCAR) stroke index (INSYS) diastolic pressure (PRDIA) pulmonary arterial pressure (PAPUL) ventricular pressure (PVENT) lung resistance (REPUL) death or survival (PRONO)

> myocarde = read . table ( " http : / / fre a ko no me tr ics . free . fr / myocarde . csv " , head = TRUE , sep = " ; " )

@freakonometrics

6

http://www.ub.edu/riskcenter

Logistic Regression Assume that P(Yi = 1) = πi , logit(πi ) = xT i β, where logit(πi ) = log or −1

πi = logit

(xT i β)

πi 1 − πi

,

exp[xT i β] = . T 1 + exp[xi β]

The log-likelihood is log L(β) =

n X

yi log(πi )+(1−yi ) log(1−πi ) =

i=1

n X

yi log(πi (β))+(1−yi ) log(1−πi (β))

i=1

and the first order conditions are solved numerically n

∂ log L(β) X = xk,i [yi − πi (β)] = 0. ∂βk i=1

@freakonometrics

7

http://www.ub.edu/riskcenter

Logistic Regression, Output (with R) 1

> logistic summary ( logistic )

3 4

Coefficients : Estimate Std . Error z value Pr ( >| z |)

5 6

( Intercept ) -10.187642

11.895227

-0.856

0.392

7

FRCAR

0.138178

0.114112

1.211

0.226

8

INCAR

-5.862429

6.748785

-0.869

0.385

9

INSYS

0.717084

0.561445

1.277

0.202

10

PRDIA

-0.073668

0.291636

-0.253

0.801

11

PAPUL

0.016757

0.341942

0.049

0.961

12

PVENT

-0.106776

0.110550

-0.966

0.334

13

REPUL

-0.003154

0.004891

-0.645

0.519

14 15

( Dispersion parameter for binomial family taken to be 1)

16 17

Number of Fisher Scoring iterations : 7

@freakonometrics

8

http://www.ub.edu/riskcenter

Logistic Regression, Output (with R) 1

> library ( VGAM )

2

> mlogistic summary ( mlogistic )

4 5

Coefficients :

6

Estimate Std . Error

z value

7

( Intercept ) 10.1876411 11.8941581

0.856525

8

FRCAR

-0.1381781

9

INCAR

5.8624289

10

INSYS

-0.7170840

11

PRDIA

0.0736682

12

PAPUL

-0.0167565

13

PVENT

0.1067760

0.1105456

0.965901

14

REPUL

0.0031542

0.0048907

0.644939

0.1141056 -1.210967 6.7484319

0.868710

0.5613961 -1.277323 0.2916276

0.252610

0.3419255 -0.049006

15 16

Name of linear predictor : log ( mu [ ,1] / mu [ ,2])

@freakonometrics

9

http://www.ub.edu/riskcenter

Logistic (Multinomial) Regression In the Bernoulli case, y ∈ {0, 1}, XTβ

P(Y = 1) =

p1 1 p0 e = ∝ p and P(Y = 0) = = ∝ p0 1 Tβ T X X p0 + p1 p0 + p1 1+e 1+e

In the multinomial case, y ∈ {A, B, C} X T βA

P(X = A) =

e pA ∝ pA i.e. P(X = A) = X T β T B + eX β B + 1 pA + pB + pC e T

pB eX βB P(X = B) = ∝ pB i.e. P(X = B) = X T β T A + eX β B + 1 pA + pB + pC e 1 pC ∝ pC i.e. P(X = C) = X T β P(X = C) = T A + eX β B + 1 pA + pB + pC e

@freakonometrics

10

http://www.ub.edu/riskcenter

Logistic Regression, Numerical Issues b is The algorithm to compute β 1. start with some initial value β 0 2. define β k = β k−1 − H(β k−1 )−1 ∇ log L(β k−1 ) where ∇ log L(β)is the gradient, and H(β) the Hessian matrix, also called Fisher’s score. The generic term of the Hessian is n

∂ 2 log L(β) X = Xk,i X`,i [yi − πi (β)] ∂βk ∂β` i=1 Define Ω = [ωi,j ] = diag(b πi (1 − π bi )) so that the gradient is writen ∂ log L(β) ∇ log L(β) = = X T (y − π) ∂β

@freakonometrics

11

http://www.ub.edu/riskcenter

Logistic Regression, Numerical Issues and the Hessian

∂ 2 log L(β) T H(β) = = −X ΩX T ∂β∂β

The gradient descent algorithm is then β k = (X T ΩX)−1 X T ΩZ where Z = Xβ k−1 + X T Ω−1 (y − π), b = lim β , From maximum likelihood properties, if β k k→∞

√

L

b − β) → N (0, I(β)−1 ). n(β

From a numerical point of view, this asymptotic variance I(β)−1 satisfies I(β)−1 = −H(β).

@freakonometrics

12

http://www.ub.edu/riskcenter

Logistic Regression, Numerical Issues 1

> X = cbind (1 , as . matrix ( myocarde [ ,1:7]) )

2

> Y = myocarde $ PRONO == " Survival "

3

> beta = as . matrix ( lm ( Y ~ 0+ X ) $ coefficients , ncol =1)

4

> for ( s in 1:9) {

5

+

pi = exp ( X % * % beta [ , s ]) / (1+ exp ( X % * % beta [ , s ]) )

6

+

gradient = t ( X ) % * % (Y - pi )

7

+

omega = matrix (0 , nrow ( X ) , nrow ( X ) ) ; diag ( omega ) =( pi * (1 - pi ) )

8

+

Hessian = - t ( X ) % * % omega % * % X

9

+

beta = cbind ( beta , beta [ , s ] - solve ( Hessian ) % * % gradient ) }

10

> beta

11

> - solve ( Hessian )

12

> sqrt ( - diag ( solve ( Hessian ) ) )

@freakonometrics

13

http://www.ub.edu/riskcenter

Predicted Probability Let m(x) = E(Y |X = x). With a logistic regression, we can get a prediction b exp[xT β] m(x) b = b 1 + exp[xT β] 1 2

> predict ( logistic , type = " response " ) [1:5] 1

2

3

4

5

3

0.6013894 0.1693769 0.3289560 0.8817594 0.1424219

4

> predict ( mlogistic , type = " response " ) [1:5 ,]

5

Death

Survival

6

1 0.3986106 0.6013894

7

2 0.8306231 0.1693769

8

3 0.6710440 0.3289560

9

4 0.1182406 0.8817594

10

5 0.8575781 0.1424219

@freakonometrics

14

http://www.ub.edu/riskcenter

Predicted Probability b exp[βb0 + βb1 x1 + · · · + βbk xk ] exp[xT β] m(x) b = = T b 1 + exp[x β] 1 + exp[βb0 + βb1 x1 + · · · + βbk xk ] use 1

> predict ( fit _ glm , newdata = data , type = " response " )

e.g. 3000

●

●

● ●

> GLM pred _ GLM = function (p , r ) {

● ●

●

●

●

●

●

●

+ return ( predict ( GLM , newdata =

●

● ●

●

● ●

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

+ data . frame ( PVENT =p , REPUL = r ) , type = " response " ) }

●

● ● ●

● ●

●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

●

500

4

●

●

1500

2

● ●

●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

@freakonometrics

15

http://www.ub.edu/riskcenter

Predictive Classifier To go from a score to a class: if s(x) > s, then Yb (x) = 1 and s(x) ≤ s, then Yb (x) = 0 Plot T P (s) = P[Yb = 1|Y = 1] against F P (s) = P[Yb = 1|Y = 0]

@freakonometrics

16

http://www.ub.edu/riskcenter

Predictive Classifier With a threshold (e.g. s = 50%) and the predicted probabilities, one can get a classifier and the confusion matrix 1

> probabilities predictions .5) +1]

3

> table ( predictions , myocarde $ PRONO )

4 5

predictions Death Survival

6

Death

7

Survival

@freakonometrics

25

3

4

39

17

http://www.ub.edu/riskcenter

Visualization of a Classifier in Higher Dimension...

4

Death Survival

4

Death Survival

19

19

29

66

●

29

66

●

●

●

●

●

54

54 ●

2

●

64 34 ● 22 3 ● ●

15 11 10 ● 21 ● 67 37 7 56 ● ● 46● ● 12 51 36 ● ● 43 ● ● 61 47 35 ● ● ● 28 Survival 5324 71 ● 2 68 ● ● ● 32 17 ● 13 58 60 25● ● ● ● 16 ● ● ● ● ● 55 ● ● ● ● 20 Death 70 62 30 4841 ● ● 9● 44 40 38 ● ● 14 ● ● ● 50 5 39 ● ● ● ●4 1 ● 59 45 ● 26 6 ● 57 ● ● ● 18 ● ● ● ● 42 23 27 ● ● ● ● 8 63 49 ● ● ● 33

●

0

52 ●

52

−4 0

●

●

64 34 ● 22 3 ● ●

15 11 10 ● 21 ● 67 37 7 56 ● ● 46● ● 12 51 36 ● ● 43 ● ● 61 47 35 ● ● ● 28 Survival 5324 71 ● 2 68 ● ● ● 32 17 ● 13 58 60 25● ● ● ● 16 ● ● ● ● ● 55 ● ● ● ● 20 Death 70 62 30 4841 ● ● 9● 44 40 38 ● ● 14 ● ● ● 50 5 39 ● ● ● ●4 1 ● 59 45 ● 26 6 ● 57 ● ● ● 18 ● ● ● ● 42 23 27 ● ● ● ● 8 63 49 ● ● ● 33

●

●

−4

−2

31

●

●

−4

69 65

0

●

Dim 2 (18.64%)

31

−2

69 65

−2

Dim 2 (18.64%)

2

●

2

4

Dim 1 (54.26%)

5

0.

−4

−2

0

2

4

Dim 1 (54.26%)

Point z = (z1 , z2 , 0, · · · , 0) −→ x = (x1 , x2 , · · · , xk ).

@freakonometrics

18

http://www.ub.edu/riskcenter

... but be carefull about interpretation !

1

> prediction = predict ( logistic , type = " response " )

Use a 25% probability threshold 1

>

table ( prediction >.25 , myocarde $ PRONO ) Death Survival

2 3

FALSE

19

2

4

TRUE

10

40

or a 75% probability threshold 1

>

table ( prediction >.75 , myocarde $ PRONO ) Death Survival

2 3

FALSE

4

TRUE

@freakonometrics

27

9

2

33

19

http://www.ub.edu/riskcenter

Why a Logistic and not a Probit Regression? Bliss (1934) suggested a model such that P(Y = 1|X = x) = H(xT β) where H(·) = Φ(·) the c.d.f. of the N (0, 1) distribution. This is the probit model. This yields a latent model, yi = 1(yi? > 0) where yi? = xT i β + εi is a nonobservable score. In the logistic regression, we model the odds ratio, P(Y = 1|X = x) = exp[xT β] P(Y 6= 1|X = x) exp[·] P(Y = 1|X = x) = H(x β) where H(·) = 1 + exp[·] T

which is the c.d.f. of the logistic variable, see Verhulst (1845) @freakonometrics

20

http://www.ub.edu/riskcenter

k-Nearest Neighbors (a.k.a. k-NN) In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. (Source: wikipedia). X 1 yi E[Y |X = x] ∼ k kxi −xk small

For k-Nearest Neighbors, the class is usually the majority vote of the k closest neighbors of x. 1

3000

●

> library ( caret )

●

● ●

> KNN

● ●

●

●

●

●

●

●

> pred _ KNN = function (p , r ) {

●

● ●

●

● ●

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

+ return ( predict ( KNN , newdata =

●

● ● ●

● ●

●

5

●

●

1500

3

● ●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

+ data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]}

500

6

● ●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

@freakonometrics

21

http://www.ub.edu/riskcenter

k-Nearest Neighbors Distance k · k should not be sensitive to units: normalize by standard deviation 1

3000

●

> sP library ( rpart )

2

> cart library ( rpart . plot )

4

> library ( rattle )

5

> prp ( cart , type =2 , extra =1)

or 1

> fancyRpartPl ot ( cart , sub = " " )

@freakonometrics

24

http://www.ub.edu/riskcenter

Classification (and Regression) Trees, CART The impurity is a function ϕ of the probability to have 1 at node N , i.e. P[Y = 1| node N ], and I(N ) = ϕ(P[Y = 1| node N ]) ϕ is nonnegative (ϕ ≥ 0), symmetric (ϕ(p) = ϕ(1 − p)), with a minimum in 0 and 1 (ϕ(0) = ϕ(1) < ϕ(p)), e.g. • Bayes error: ϕ(p) = min{p, 1 − p} • cross-entropy: ϕ(p) = −p log(p) − (1 − p) log(1 − p) • Gini index: ϕ(p) = p(1 − p) Those functions are concave, minimum at p = 0 and 1, maximum at p = 1/2.

@freakonometrics

25

http://www.ub.edu/riskcenter

Classification (and Regression) Trees, CART To split N into two {NL , NR }, consider X

I(NL , NR ) =

x∈{L,R}

nx I(Nx ) n

e.g. Gini index (used originally in CART, see Breiman et al. (1984)) X nx X nx,y nx,y gini(NL , NR ) = − 1− n nx nx x∈{L,R}

y∈{0,1}

and the cross-entropy (used in C4.5 and C5.0) entropy(NL , NR ) = −

X x∈{L,R}

@freakonometrics

nx n

X nx,y nx,y log nx nx

y∈{0,1}

26

http://www.ub.edu/riskcenter

Classification (and Regression) Trees, CART

15

20

25

30

30

@freakonometrics

−0.14 −0.16 −0.18 −0.14 16

18

20

22

2000

20

22

24

26

28

−0.16

−0.14 1500

18

REPUL

−0.18 1000

16

PVENT

−0.20 500

32

−0.16 14

−0.16

−0.25 10 12 14 16

12

second split −→

REPUL

28

−0.20

35

−0.45 8

24

−0.14

25

20

PAPUL

−0.18 20

−0.35

−0.25 −0.35 −0.45

6

3.0

−0.20

24

PVENT

4

2.6

−0.18

−0.25

←− first split

−0.35 20

2.2

PRDIA

−0.45

−0.35 −0.45

16

−0.20 1.8

PAPUL

−0.25

PRDIA

12

−0.14 −0.16

j∈{1,··· ,k},s

3.0

{I(NL , NR )}

−0.18

2.5

max

−0.18

solve

−0.20

2.0

INSYS

−0.14

1.5

INCAR

−0.16

1.0

NR : {xi,j > s}

−0.20

−0.25 −0.35

NL : {xi,j ≤ s}

−0.45

−0.45

−0.25

INSYS

−0.35

INCAR

4

6

8

10

12

14

500

700

900

27

1100

http://www.ub.edu/riskcenter

Pruning Trees One can grow a big tree, until leaves have a (preset) small number of observations, and then possibly go back and prune branches (or leaves) that do not improve gains on good classification sufficiently. Or we can decide, at each node, whether we split, or not.

@freakonometrics

28

http://www.ub.edu/riskcenter

Pruning Trees In trees, overfitting increases with the number of steps, and leaves. Drop in impurity at node N is defined as n nR L ∆I(NL , NR ) = I(N ) − I(NL , NR ) = I(N ) − I(NL ) − I(NR ) n n

1

3000

●

> library ( rpart )

●

● ●

> CART

●

● ●

●

●

●

●

●

●

> pred _ CART = function (p , r ) {

●

● ●

●

● ●

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

+ return ( predict ( CART , newdata =

●

● ● ●

●

●

5

●

●

1500

3

● ●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

+ data . frame ( PVENT =p , REPUL = r ) [ , " Survival " ]) }

500

6

● ●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

−→ we cut if ∆I(NL , NR )/I(N ) (relative gain) exceeds cp (complexity parameter, default 1%). @freakonometrics

29

http://www.ub.edu/riskcenter

Pruning Trees 1

3000

●

> library ( rpart )

●

● ●

> CART

● ● ●

●

● ●

●

●

●

●

●

●

> pred _ CART = function (p , r ) {

●

● ●

●

● ●

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

+ return ( predict ( CART , newdata =

●

●

●

5

●

●

1500

3

● ●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

+ data . frame ( PVENT =p , REPUL = r ) [ , " Survival " ]) }

500

6

● ●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

See also 1

> library ( mvpart )

2

> ? prune

Define the missclassification rate of a tree R(tree)

@freakonometrics

30

http://www.ub.edu/riskcenter

Pruning Trees Given a cost-complexity parameter cp (see tunning parameter in Ridge-Lasso) define a penalized R(·) Rcp (tree) = R(tree) + cpktreek | {z } | {z } loss

complexity

If cp is small the optimal tree is large, if cp is large the optimal tree has no leaf, see Breiman et al. (1984). size of tree 2

3

7

9

2

> plotcp ( cart )

●

0.8

=3)

X−val Relative Error

> cart prune ( cart , cp =0.06) 0.4

3

Inf

0.27

0.06

0.024

0.013

cp

@freakonometrics

31

http://www.ub.edu/riskcenter

Bagging Bootstrapped Aggregation (Bagging) , is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification (Source: wikipedia). It is an ensemble method that creates multiple models of the same type from different sub-samples of the same dataset [boostrap]. The predictions from each separate model are combined together to provide a superior result [aggregation]. → can be used on any kind of model, but interesting for trees, see Breiman (1996) Boostrap can be used to define the concept of margin, B B 1 X 1 X margini = 1(b yi = yi ) − 1(b yi 6= yi ) B B b=1

b=1

Remark Probability that ith raw is not selection (1 − n−1 )n → e−1 ∼ 36.8%, cf training / validation samples (2/3-1/3) @freakonometrics

32

http://www.ub.edu/riskcenter

Bagging Trees 1

> margin for ( b in 1:1 e4 ) {

3

+ idx = sample (1: n , size =n , replace = TRUE )

4

> cart margin [j ,] .5) ! =

● ●

●

●

●●

●

●

●

●

( myocarde $ PRONO == " Survival " )

●

●

5

10

15

20

7

+ }

8

> apply ( margin , 2 , mean )

PVENT

@freakonometrics

33

http://www.ub.edu/riskcenter

Bagging Trees Interesting because of instability in CARTs (in terms of tree structure, not necessarily prediction)

@freakonometrics

34

http://www.ub.edu/riskcenter

Bagging and Variance, Bagging and Bias Assume that y = m(x) + ε. The mean squared error over repeated random samples can be decomposed in three parts Hastie et al. (2001) 2 2 b − E[(m(x)] b E[(Y − m(x)) b ] = |{z} σ + E[m(x)] b − m(x) + E m(x) {z } | {z } | 2

2

1

2

3

1 reflects the variance of Y around m(x) 2 is the squared bias of m(x) b 3 is the variance of m(x) b −→ bias-variance tradeoff. Boostrap can be used to reduce the bias, and he variance (but be careful of outliers)

@freakonometrics

35

http://www.ub.edu/riskcenter

1

3000

●

> library ( ipred )

●

● ●

> BAG

●

● ●

●

●

●

●

●

●

> pred _ BAG = function (p , r ) {

●

● ●

●

● ●

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

+ return ( predict ( BAG , newdata =

●

● ● ●

●

●

5

●

●

1500

3

● ●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

+ data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]) }

500

6

● ●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

@freakonometrics

36

http://www.ub.edu/riskcenter

Random Forests Strictly speaking, when boostrapping among observations, and aggregating, we use a bagging algorithm. In the random forest algorithm, we combine Breiman’s bagging idea and the random selection of features, introduced independently by Ho (1995)) and Amit & Geman (1997)) 1

3000

●

> library ( randomForest )

●

● ●

> RF

●

● ●

●

●

●

●

●

●

> pred _ RF = function (p , r ) {

●

● ●

●

● ●

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

+ return ( predict ( RF , newdata =

●

● ● ●

●

●

5

●

●

1500

3

● ●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

+ data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]) }

500

6

● ●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

@freakonometrics

37

http://www.ub.edu/riskcenter

Random Forest At each node, select

√

k covariates out of k (randomly).

can deal with small n large k-problems Random Forest are used not only for prediction, but also to assess variable importance (discussed later on).

@freakonometrics

38

http://www.ub.edu/riskcenter

Support Vector Machine SVMs were developed in the 90’s based on previous work, from Vapnik & Lerner (1963), see Vailant (1984) Assume that points are linearly separable, i.e. there is ω and b such that   +1 if ω T x + b > 0 Y =  −1 if ω T x + b < 0 Problem: infinite number of solutions, need a good one, that separate the data, (somehow) far from the data. Concept : VC dimension. Let H : {h : Rd 7→ {−1, +1}}. Then H is said to shatter a set of points X is all dichotomies can be achieved. E.g. with those three points, all configurations can be achieved

@freakonometrics

39

http://www.ub.edu/riskcenter

Support Vector Machine

● ●

● ●

●

● ●

● ●

●

●

● ●

●

● ●

● ●

●

●

● ●

●

●

E.g. with those four points, several configurations cannot be achieved (with some linear separator, but they can with some quadratic one)

@freakonometrics

40

http://www.ub.edu/riskcenter

Support Vector Machine Vapnik’s (VC) dimension is the size of the largest shattered subset of X. This dimension is intersting to get an upper bound of the probability of miss-classification (with some complexity penalty, function of VC(H)). Now, in practice, where is the optimal hyperplane ? The distance from x0 to the hyperplane ω T x + b is ω T x0 + b d(x0 , Hω,b ) = kωk and the optimal hyperplane (in the separable case) is argmin min d(xi , Hω,b ) i=1,··· ,n

@freakonometrics

41

http://www.ub.edu/riskcenter

Support Vector Machine Define support vectors as observations such that |ω T xi + b| = 1 The margin is the distance between hyperplanes defined by support vectors. The distance from support vectors to Hω,b is kωk−1 , and the margin is then 2kωk−1 . −→ the algorithm is to minimize the inverse of the margins s.t. Hω,b separates ±1 points, i.e. 1 T min ω ω s.t. Yi (ω T xi + b) ≥ 1, ∀i. 2

@freakonometrics

42

http://www.ub.edu/riskcenter

Support Vector Machine Problem difficult to solve: many inequality constraints (n) −→ solve the dual problem... In the primal space, the solution was X X ω= αi Yi xi with αi Yi = 0. i=1

In the dual space, the problem becomes (hint: consider the Lagrangian) ) ( X X 1X T max αi αj Yi Yj xi xj s.t. αi Yi = 0. αi − 2 i=1 i=1 i=1 which is usually written   0 ≤ α ∀i 1 T i min α Qα − 1T α s.t. α  yT α = 0 2 where Q = [Qi,j ] and Qi,j = yi yj xT i xj . @freakonometrics

43

http://www.ub.edu/riskcenter

Support Vector Machine Now, what about the non-separable case? Here, we cannot have yi (ω T xi + b) ≥ 1 ∀i. −→ introduce slack variables,   ω T x + b ≥ +1 − ξ when y = +1 i i i  ω T xi + b ≤ −1 + ξi when yi = −1 where ξi ≥ 0 ∀i. There is a classification error when ξi > 1. The idea is then to solve 1 T 1 T min ω ω + C1T 1ξ>1 , instead of min ω ω 2 2

@freakonometrics

44

http://www.ub.edu/riskcenter

Support Vector Machines, with a Linear Kernel So far, d(x0 , Hω,b ) = min {kx0 − xk`2 } x∈Hω,b

where k · k`2 is the Euclidean (`2 ) norm, kx0 − xk`2

●

●

> SVM2 library ( kernlab )

3000

1

p √ = (x0 − x) · (x0 − x) = x0 ·x0 − 2x0 ·x + x·x

●

myocarde ,

●

4

> pred _ SVM2 = function (p , r ) {

REPUL

+ prob . model = TRUE , kernel = " vanilladot " )

●

● ●

● ● ●

● ●

1500

3

2000

● ●

● ●

●

●

●

●

●

●

●

● ●

+ return ( predict ( SVM2 , newdata =

●

●

●

+ data . frame ( PVENT =p , REPUL = r ) , type = " probabilities " ) [ ,2]) }

●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

●

500

6

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

5

●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

@freakonometrics

45

http://www.ub.edu/riskcenter

Support Vector Machines, with a Non Linear Kernel More generally, d(x0 , Hω,b ) = min {kx0 − xkk } x∈Hω,b

where k · kk is some kernel-based norm, p kx0 − xkk = k(x0 ,x0 ) − 2k(x0 ,x) + k(x·x)

●

●

> SVM2 library ( kernlab )

3000

1

●

myocarde ,

●

4

> pred _ SVM2 = function (p , r ) {

REPUL

+ prob . model = TRUE , kernel = " rbfdot " )

●

● ●

● ● ●

● ●

1500

3

2000

● ●

● ●

●

●

●

●

●

●

●

● ●

+ return ( predict ( SVM2 , newdata =

●

●

●

+ data . frame ( PVENT =p , REPUL = r ) , type = " probabilities " ) [ ,2]) }

●

●

●

● ●

●

●

●

● ● ●

● ● ● ●

●

500

6

●

●

●

● ●

● ●

●

● ● ● ●

●

●

1000

5

●

● ● ●

0

●

●

● ●

5

●

●

10

15

20

PVENT

@freakonometrics

46

http://www.ub.edu/riskcenter

Heuristics on SVMs An interpretation is that data aren’t linearly seperable in the original space, but might be separare by some kernel transformation,

● ●● ● ●

● ●

● ●

●

● ● ● ●

●

● ●●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ●● ● ● ● ● ●

@freakonometrics

●

●

● ● ●●●● ●

●

● ● ● ● ● ● ● ● ● ●● ● ● ● ●

●● ●● ●● ● ● ● ● ● ● ●

●

● ●● ● ● ● ●● ●●● ●● ● ●

●

● ●

●

●● ● ●●●● ●●

● ● ● ● ●●● ● ● ● ● ●

●

● ●

● ●

●

●

47

http://www.ub.edu/riskcenter

Still Hungry ? There are still several (machine learning) techniques that can be used for classification • Fisher’s Linear or Quadratic Discrimination (closely related to logistic regression, and PCA), see Fisher (1936)) X|Y = 0 ∼ N (µ0 , Σ0 ) and X|Y = 1 ∼ N (µ1 , Σ1 )

@freakonometrics

48

http://www.ub.edu/riskcenter

Still Hungry ? • Perceptron or more generally Neural Networks In machine learning, neural networks are a family of statistical learning models inspired by biological neural networks and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. wikipedia, see Rosenblatt (1957) • Boosting (see next section) • Naive Bayes In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. wikipedia, see Russell & Norvig (2003) See also the (great) package 1

> library ( caret )

@freakonometrics

49

http://www.ub.edu/riskcenter

●

Promotion •

No Purchase

85.17%

61.60%

Purchase

14.83%

38.40%

0.8 0.6

●

●

0.4

●

●

● ● ● 0.0

Control

●

●

0.2

In many applications (e.g. marketing), we do need two models to analyze the impact of a treatment. We need two groups, a control and a treatment group. Data : {(xi , yi )} with yi ∈ {•, •} Data : {(xj , yj )} with yi ∈ {, } See clinical trials, treatment vs. control group E.g. direct mail campaign in a bank

1.0

Difference in Differences

0.0

0.2

0.4

0.6

0.8

1.0

overall uplift effect +23.57%, see Guelman et al. (2014) for more details.

@freakonometrics

50

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims Consider a (standard) logistic regression, on two covariate (age of driver, and age of camping-car) π = logit−1 (β0 + β1 x1 + β2 x2 ) 0.00

0.05

0.10

0.15

10

data = camping , family = binomial )

20

> reg _ glm = glm ( nombre ~ ageconducteur + agevehicule ,

0

1

Age du véhicule

30

40

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

20

30

40

50

60

70

80

Age du conducteur principal

@freakonometrics

51

90

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims Consider a (standard) logistic regression, on two covariate (age of driver, and age of camping-car), smoothed with splines π = logit−1 (β0 + s1 (x1 ) + s2 (x2 )) 0.00

0.05

0.10

0.15

10

agevehicule ) , data = camping , family = binomial )

20

> reg _ add = glm ( nombre ~ bs ( ageconducteur ) + bs (

0

1

Age du v?hicule

30

40

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

20

30

40

50

60

70

80

Age du conducteur principal

@freakonometrics

52

90

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims Consider a (standard) logistic regression, on two covariate (age of driver, and age of camping-car), smoothed with bivariate spline π = logit−1 (β0 + s(x1 ,x2 )) 0.00

0.05

0.10

0.15

2

> reg _ gam = gam ( nombre ~ s ( ageconducteur , agevehicule ) ,

0

10

data = camping , family = binomial )

20

> library ( mgcv ) Age du véhicule

1

30

40

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

20

30

40

50

60

70

80

Age du conducteur principal

@freakonometrics

53

90

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims One can also use k-Nearest Neighbours (k-NN) 0.00

0.05

0.10

0.15

3

> sv = sd ( camping $ agevehicule )

4

> knn = knn3 (( nombre ==1) ~ I ( ageconducteur / sc ) + I (

30

> sc = sd ( camping $ ageconducteur )

20

2

Age du véhicule

> library ( caret )

10

1

40

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0

agevehicule / sv ) , data = camping , k =100) 20

30

40

50

60

70

80

Age du conducteur principal

(be carefull about scaling problems)

@freakonometrics

54

90

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims We can also use a tree data = camping , cp =7 e -4)

0.00

0.05

0.10

0.15

0.00

0.10

0.15

40 30 0

10

20

Age du véhicule

30 20 10 20

30

40

50

60

70

Age du conducteur principal

@freakonometrics

0.05

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

40

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0

2

> tree = rpart (( nombre ==1) ~ ageconducteur + agevehicule ,

Age du véhicule

1

80

90

20

30

40

50

60

70

80

90

Age du conducteur principal

55

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims or bagging techniques (rather close to random forests)

0.00

0.05

0.10

0.15

> library ( ipred )

2

> bag = bagging (( nombre ==1) ~ ageconducteur +

> library ( randomForest )

4

> rf = randomForest (( nombre ==1) ~ ageconducteur + agevehicule , data = camping )

@freakonometrics

20 10

3

0

agevehicule , data = camping )

Age du véhicule

30

1

40

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

20

30

40

50

60

70

80

Age du conducteur principal

56

90

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims

> library ( dismo )

2

> library ( gbm )

3

> fit predict ( fit , type = " response " , n . trees =700)

0.00

4

0.05

learning . rate =0.001 , bag . fraction =0.5)

20

30

40

50

60

70

80

Age du conducteur principal

@freakonometrics

57

90

http://www.ub.edu/riskcenter

Application on Motor Insurance Claims

Boosting algorithms can also be considered (see next time) 1

> library ( dismo )

2

> library ( gbm )

0.05

0.10

0.15

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

> fit predict ( fit , type = " response " , n . trees =400)

0

4

20

learning . rate =0.01 , bag . fraction =0.5)

Age du véhicule

30

y =13 , family = " bernoulli " , tree . complexity =5 ,

20

30

40

50

60

70

80

Age du conducteur principal

@freakonometrics

58

90

http://www.ub.edu/riskcenter

Part 2. Regression

@freakonometrics

59

http://www.ub.edu/riskcenter

Regression? In statistics, regression analysis is a statistical process for estimating the relationships among variables [...] In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. (Source: wikipedia). Here regression is opposed to classification (as in the CART algorithm). y is either a continuous variable y ∈ R or a counting variable y ∈ N .

@freakonometrics

60

http://www.ub.edu/riskcenter

Regression? Parametrics, nonparametrics and machine learning In many cases in econometric and actuarial literature we simply want a good fit for the conditional expectation, E[Y |X = x]. Regression analysis estimates the conditional expectation of the dependent variable given the independent variables (Source: wikipedia). Example: A popular nonparametric technique, kernel based regression, P i Yi · Kh (X i − x) P m(x) b = i Kh (X i − x) In econometric litterature, interest on asymptotic normality properties and plug-in techniques. In machine learning, interest on out-of sample cross-validation algorithms.

@freakonometrics

61

http://www.ub.edu/riskcenter

Linear, Non-Linear and Generalized Linear

Linear Model: • (Y |X = x) ∼ N (θx , σ 2 ) • E[Y |X = x] = θx = xT β 1

> fit fit fit fit e for ( i in 1:100) {

3

+ W for ( i in 1:100) {

3

+ ind fit predict ( fit , newdata = data . frame ( X = x ) )

@freakonometrics

71

http://www.ub.edu/riskcenter

Regression Smoothers: Spline Functions

1

> fit predict ( fit , newdata = data . frame ( X = x ) )

see Generalized Additive Models.

@freakonometrics

72

http://www.ub.edu/riskcenter

Fixed Knots vs. Optimized Ones

1

> library ( f re ekn ot spli ne s )

2

> gen fit predict ( fit , newdata = data . frame ( X = x ) )

@freakonometrics

73

http://www.ub.edu/riskcenter

Interpretation of Penalty Unbiased estimators are important in mathematical statistics, but are they the best estimators ? Consider a sample, i.i.d., {y1 , · · · , yn } with distribution N (µ, σ 2 ). Define θb = αY . What is the optimal α? to get the best estimator of µ ? • bias: bias θb = E θb − µ = (α − 1)µ α2 σ 2 • variance: Var θb = n 2 2 α σ 2 2 b • mse: mse θ = (α − 1) µ + n ?

The optimal value is α =

µ2 2

µ2 +

@freakonometrics

σ n

< 1.

74

http://www.ub.edu/riskcenter

Linear Model Consider some linear model yi = xT i β + εi for all i = 1, · · · , n. Assume that εi are i.i.d. with E(ε) = 0 (and finite variance). Write         β0   y1 1 x1,1 · · · x1,k ε1          β1  . .. .  ..   ..    .   . . . + .  .  = .   . . . .   ..       .   yn εn 1 xn,1 · · · xn,k βk {z } | {z } | | {z } | {z } y,n×1 ε,n×1 X,n×(k+1) β,(k+1)×1

Assuming ε ∼ N (0, σ 2 I), the maximum likelihood estimator of β is b = argmin{ky − X T βk` } = (X T X)−1 X T y β 2 ... under the assumtption that X T X is a full-rank matrix. T −1 b What if X T X T y does not exist, but i X cannot be inverted? Then β = [X X] b = [X T X + λI]−1 X T y always exist if λ > 0. β λ

@freakonometrics

75

http://www.ub.edu/riskcenter

Ridge Regression b = [X T X + λI]−1 X T y is the Ridge estimate obtained as solution The estimator β of     n X  2 b = argmin β [yi − β0 − xT i β] + λ kβk`2  | {z } β  i=1  1T β 2

for some tuning parameter λ. One can also write b = argmin {kY − X T βk` } β 2 β;kβk`2 ≤s

b = argmin {objective(β)} where Remark Note that we solve β β

objective(β) =

L(β) | {z }

training loss

@freakonometrics

+

R(β) | {z }

regularization

76

http://www.ub.edu/riskcenter

Going further on sparcity issues In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s 0). Ici dim(β) = s. We wish we could solve b = argmin {kY − X T βk` } β 2 β;kβk`0 ≤s

Problem: it is usually not possible to describe all possible constraints, since s coefficients should be chosen here (with k (very) large). k Idea: solve the dual problem b= β

argmin β;kY

−X T βk

{kβk`0 }

`2 ≤h

where we might convexify the `0 norm, k · k`0 .

@freakonometrics

78

http://www.ub.edu/riskcenter

Regularization `0 , `1 et `2

@freakonometrics

79

http://www.ub.edu/riskcenter

Optimal LASSO Penalty Use cross validation, e.g. K-fold, b β (−k) (λ) = argmin

 X 

i6∈Ik

  2 [yi − xT i β] + λkβk 

then compute the sum of the squared errors, X 2 b Qk (λ) = [yi − xT i β (−k) (λ)] i∈Ik

and finally solve (

1 X λ = argmin Q(λ) = Qk (λ) K

)

?

k

Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest the largest λ such that K X 1 Q(λ) ≤ Q(λ? ) + se[λ? ] with se[λ]2 = 2 [Qk (λ) − Q(λ)]2 K k=1

@freakonometrics

80

http://www.ub.edu/riskcenter

Going further on sparcity issues On [−1, +1]k , the convex hull of kβk`0 is kβk`1 On [−a, +a]k , the convex hull of kβk`0 is a−1 kβk`1 Hence, b = argmin {kY − X T βk` } β 2 β;kβk`1 ≤˜ s

is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem b = argmin{kY − X T βk` +λkβk` } β 2 1

@freakonometrics

81

http://www.ub.edu/riskcenter

LASSO Least Absolute Shrinkage and Selection Operator b ∈ argmin{kY − X T βk` +λkβk` } β 2 1 is a convex problem (several algorithms? ), but not strictly convex (no unicity of b are unique b = xT β the minimum). Nevertheless, predictions y

?

MM, minimize majorization, coordinate descent Hunter (2003).

@freakonometrics

82

http://www.ub.edu/riskcenter

1

> freq = merge ( contrat , nombre _ RC )

2

> freq = merge ( freq , nombre _ DO )

3

> freq [ ,10]= as . factor ( freq [ ,10])

4

> mx = cbind ( freq [ , c (4 ,5 ,6) ] , freq [ ,9]== " D " ,

LASSO, third party

freq [ ,3]% in % c ( " A " ," B " ," C " ) ) 5

> colnames ( mx ) = c ( names ( freq ) [ c (4 ,5 ,6) ] , " diesel " ," zone " )

8

[1]

puissance agevehicule ageconducteur diesel

1

−10

−9

−8

−7

−6

−5

0.10

3

4

1

zone

−0.05

> names ( mx )

Coefficients

7

4

0.00

mx [ , i ]) ) / sd ( mx [ , i ])

4

0.05

> for ( i in 1: ncol ( mx ) ) mx [ , i ]=( mx [ , i ] - mean (

4

−0.10

6

4

10

> library ( glmnet )

−0.15

9

> fit = glmnet ( x = as . matrix ( mx ) , y = freq [ ,11] ,

−0.20

3

offset = log ( freq [ ,2]) , family = " poisson

5

Log Lambda

") 11

> plot ( fit , xvar = " lambda " , label = TRUE )

@freakonometrics

83

http://www.ub.edu/riskcenter

4

4 4

1

−0.05

0.00

0.10

3

−0.10

Coefficients

LASSO, third party

2

0.05

0

2

> cvfit = cv . glmnet ( x = as . matrix ( mx ) , y = freq

3

−0.15

> plot ( fit , label = TRUE )

−0.20

1

[ ,11] , offset = log ( freq [ ,2]) , family = "

5

0.0

0.1

0.2

0.4

L1 Norm

poisson " ) 3

0.3

> plot ( cvfit )

> log ( cvfit $ lambda . min )

7

[1] -8.16453

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●

●

●

●

●

●

●

●

0.246

0.248

• Cross validation curve + error bars

0.256

6

0.254

[1] 0.0002845703

0.252

5

0.250

> cvfit $ lambda . min

Poisson Deviance

4

0.258

4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 2 1

−10

@freakonometrics

−9

−8

−7

log(Lambda)

−6

−5

84

http://www.ub.edu/riskcenter

1

> freq = merge ( contrat , nombre _ RC )

2

> freq = merge ( freq , nombre _ DO )

3

> freq [ ,10]= as . factor ( freq [ ,10])

4

> mx = cbind ( freq [ , c (4 ,5 ,6) ] , freq [ ,9]== " D " ,

LASSO, Fréquence DO

freq [ ,3]% in % c ( " A " ," B " ," C " ) ) 5

> colnames ( mx ) = c ( names ( freq ) [ c (4 ,5 ,6) ] , " > for ( i in 1: ncol ( mx ) ) mx [ , i ]=( mx [ , i ] - mean (

[1]

puissance agevehicule ageconducteur diesel

9 10

zone

> library ( glmnet ) > fit = glmnet ( x = as . matrix ( mx ) , y = freq [ ,12] , offset = log ( freq [ ,2]) , family = " poisson

2

1

1

−9

−8

−7

−6

−5

−4

4 1 5

−0.4

8

3

−0.6

> names ( mx )

Coefficients

7

4

−0.2

mx [ , i ]) ) / sd ( mx [ , i ])

4

−0.8

6

0.0

diesel " ," zone " )

2

Log Lambda

") 11

> plot ( fit , xvar = " lambda " , label = TRUE )

@freakonometrics

85

http://www.ub.edu/riskcenter

1

1

1

2

4

0.0

0

4

1

> plot ( fit , label = TRUE )

2

> cvfit = cv . glmnet ( x = as . matrix ( mx ) , y = freq

−0.4 −0.8

−0.6

LASSO, material

Coefficients

−0.2

1 5

2

[ ,12] , offset = log ( freq [ ,2]) , family = "

0.0

0.2

0.4

0.8

1.0

L1 Norm

poisson " ) 3

0.6

> plot ( cvfit )

6

> log ( cvfit $ lambda . min )

7

[1] -7.653266

●

●

●

● ● ● ●

0.215

• Cross validation curve + error bars

●

0.230

[1] 0.0004744917

0.225

5

●

0.220

> cvfit $ lambda . min

Poisson Deviance

4

0.235

4 4 4 4 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1

● ● ●

●● ●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−9

@freakonometrics

−8

−7

−6

log(Lambda)

●

●

●

●

−5

−4

86

http://www.ub.edu/riskcenter

Some thoughts about Tuning parameters Regularization is a key issue in machine learning, to avoid overfitting. In (traditional) econometrics are based on plug-in methods: see Silverman bandwith rule in Kernel density estimation, 5 4b σ ∼ 1.06b σ n−1/5 . h? = 3n In machine learning literature, use on out-of-sample cross-validation methods for choosing amount of regularization.

@freakonometrics

87

http://www.ub.edu/riskcenter

Optimal LASSO Penalty Use cross validation, e.g. K-fold,   X X 2 b  β [yi − xT |βk | (−k) (λ) = argmin { i β] + λ  

i6∈Ik

k

then compute the sum or the squared errors, X 2 b Qk (λ) = [yi − xT i β (−k) (λ)] i6∈Ik

and finally solve (

1 X λ = argmin Q(λ) = Qk (λ) K

)

?

k

Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest the largest λ such that K X 1 Q(λ) ≤ Q(λ? ) + se[λ? ] with se[λ]2 = 2 [Qk (λ) − Q(λ)]2 K k=1

@freakonometrics

88

http://www.ub.edu/riskcenter

Big Data, Oracle and Sparcity Assume that k is large, and that β ∈ Rk can be partitioned as β = (β imp , β non-imp ), as well as covariates x = (ximp , xnon-imp ), with important and non-important variables, i.e. β non-imp ∼ 0. Goal : achieve variable selection and make inference of β imp Oracle property of high dimensional model selection and estimation, see Fan and Li (2001). Only the oracle knows which variables are important... k If sample size is large enough (n >> kimp 1 + log ) we can do inference as kimp if we knew which covariates were important: we can ignore the selection of covariates part, that is not relevant for the confidence intervals. This provides cover for ignoring the shrinkage and using regularstandard errors, see Athey & Imbens (2015).

@freakonometrics

89

http://www.ub.edu/riskcenter

Why Shrinkage Regression Estimates ? Interesting for model selection (alternative to peanlized criterions) and to get a good balance between bias and variance. In decision theory, an admissible decision rule is a rule for making a decisionsuch that there is not any other rule that is always better than it. When k ≥ 3, ordinary least squares are not admissible, see the improvement by James–Stein estimator.

@freakonometrics

90

http://www.ub.edu/riskcenter

Regularization and Scalability What if k is (extremely) large? never trust ols with more than five regressors (attributed to Zvi Griliches in Athey & Imbens (2015)) Use regularization techniques, see Ridge, Lasso, or subset selection ( n ) X X T 2 b = argmin β [yi − β0 − xi β] + λkβk`0 where kβk`0 = 1(βk 6= 0). β

@freakonometrics

i=1

k

91

http://www.ub.edu/riskcenter

Penalization and Splines In order to get a sufficiently smooth model, why not penalyse the sum of squares of errors, Z n X [yi − m(xi )]2 + λ [m00 (t)]2 dt i=1

for some tuning parameter λ. Consider some cubic spline basis, so that m(x) =

J X

θj Nj (x)

j=1

then the optimal expression for m is obtained using b = [N T N + λΩ]−1 N T y θ where N i,j is the matrix of Nj (X i )’s and Ωi,j =

@freakonometrics

R

Ni00 (t)Nj00 (t)dt 92

http://www.ub.edu/riskcenter

Smoothing with Multiple Regressors Actually n X

[yi − m(xi )]2 + λ

Z

[m00 (t)]2 dt

i=1

is based on some multivariate penalty functional, e.g.   2 Z X 2 Z X ∂ 2 m(t) 2 ∂ m(t)  dt [m00 (t)]2 dt =  +2 2 ∂ti ∂tj ∂ti i i,j

@freakonometrics

93

http://www.ub.edu/riskcenter

Regression Trees The partitioning is sequential, one covariate at a time (see adaptative neighbor estimation). Start with Q =

n X

[yi − y]2

i=1

For covariate k and threshold t, split the data according to {xi,k ≤ t} (L) or {xi,k > t} (R). Compute P P i,xi,k ≤t yi i,xi,k >t yi and y R = P yL = P i,xi,k ≤t 1 i,xi,k >t 1 and let (k,t)

mi

@freakonometrics

  y if x ≤ t i,k L =  y if xi,k > t R

94

http://www.ub.edu/riskcenter

Regression Trees Then compute (k ? , t? ) = argmin

( n X

) (k,t) 2

[yi − mi

]

, and partition the space

i=1 ?

intro two subspace, whether xk? ≤ t , or not. Then repeat this procedure, and minimize n X [yi − mi ]2 + λ · #{leaves}, i=1

(cf LASSO). One can also consider random forests with regression trees.

@freakonometrics

95

http://www.ub.edu/riskcenter

Local Regression

1

> W W W W library ( KernSmooth )

6

> library ( sp )

@freakonometrics

99

http://www.ub.edu/riskcenter

Local Regression : Kernel Based Smoothing

1

> library ( np )

2

> fit predict ( fit , newdata = data . frame ( X = x ) )

@freakonometrics

100

http://www.ub.edu/riskcenter

From Linear to Generalized Linear Models The (Gaussian) Linear Model and the logistic regression have been extended to the wide class of the exponential family, yθ − b(θ) f (y|θ, φ) = exp + c(y, φ) , a(φ) where a(·), b(·) and c(·) are functions, θ is the natural - canonical - parameter and φ is a nuisance parameter. The Gaussian distribution N (µ, σ 2 ) belongs to this family θ = µ , φ = σ 2 , a(φ) = φ, b(θ) = θ2 /2 | {z } | {z }

θ↔E(Y )

@freakonometrics

φ↔Var(Y )

101

http://www.ub.edu/riskcenter

From Linear to Generalized Linear Models The Bernoulli distribution B(p) belongs to this family p , a(φ) = 1, b(θ) = log(1 + exp(θ)), and φ = 1 θ = log 1−p {z } | θ=g? (E(Y ))

where the g? (·) is some link function (here the logistic transformation): the canonical link. Canonical links are 1

binomial ( link = " logit " )

2

gaussian ( link = " identity " )

3

Gamma ( link = " inverse " )

4

inverse . gaussian ( link = " 1 / mu ^2 " )

5

poisson ( link = " log " )

6

quasi ( link = " identity " , variance = " constant " )

7

quasibinomial ( link = " logit " )

8

quasipoisson ( link = " log " )

@freakonometrics

102

http://www.ub.edu/riskcenter

From Linear to Generalized Linear Models Observe that µ = E(Y ) = b0 (θ) and Var(Y ) = b00 (θ) · φ = b00 ([b0 ]−1 (µ)) · φ | {z }

variance function V (µ)

−→ distributions are characterized by this variance function, e.g. V (µ) = 1 for the Gaussian family (homoscedastic models), V (µ) = µ for the Poisson and V (µ) = µ2 for the Gamma distribution, V (µ) = µ3 for the inverse-Gaussian family. Note that g? (·) = [b0 ]−1 (·) is the canonical link. Tweedie (1984) suggested a power-type variance function V (µ) = µγ · φ. When γ ∈ [1, 2], then Y has a compound Poisson distribution with Gamma jumps. 1

> library ( tweedie )

@freakonometrics

103

http://www.ub.edu/riskcenter

From the Exponential Family to GLM’s So far, there no regression model. Assume that yi θi − b(θi ) f (yi |θi , φ) = exp + c(yi , φ) where θi = g?−1 (g(xT i β)) a(φ) so that the log-likelihood is L(θ, φ|y) =

n Y i=1

Pn f (yi |θi , φ) = exp

i=1

Pn

yi θ i − a(φ)

i=1

b(θi )

+

n X

! c(yi , φ) .

i=1

To derive the first order condition, observe that we can write ∂ log L(θ, φ|y i ) = ω i,j xi,j [yi − µi ] ∂β j for some ω i,j (see e.g. Müller (2004)) which are simple when g? = g.

@freakonometrics

104

http://www.ub.edu/riskcenter

From the Exponential Family to GLM’s The first order conditions can be writen X T W −1 [y − µ] = 0 which are first order conditions for a weighted linear regression model. As for the logistic regression, W depends on unkown β’s : use an iterative algorithm b 0 = y, θ 0 = g(b 1. Set µ µ0 ) and b 0 )g 0 (b z 0 = θ 0 + (y − µ µ0 ). Define W 0 = diag[g 0 (b µ0 )2 Var(b y )] and fit a (weighted) lineare regression of Z 0 on X, i.e. b = [X T W −1 X]−1 X T W −1 z 0 β 1 0 0 b , θ k = g(b b = Xβ 2. Set µ µ ) and k

k

k

b k )g 0 (b z k = θ k + (y − µ µk ). @freakonometrics

105

http://www.ub.edu/riskcenter

From the Exponential Family to GLM’s Define W k = diag[g 0 (b µk )2 Var(b y )] and fit a (weighted) lineare regression of Z k on X, i.e. T −1 −1 b β X T W −1 k+1 = [X W k X] k Zk b b and loop... until changes in β k+1 are (sufficiently) small. Then set β = β ∞ P b→ Under some technical conditions, we can prove that β β and

√

L

b − β) → N (0, I(β)−1 ). n(β

where numerically I(β) = φ · [X T W −1 ∞ X]).

@freakonometrics

106

http://www.ub.edu/riskcenter

From the Exponential Family to GLM’s We estimate (see linear regression estimation) φ by n

2 X 1 [y − µ b ] i i φb = ω i,i n − dim(X) i=1 Var(b µi )

This asymptotic expression can be used to derive confidence intervals, or tests. But is might be a poor approximation when n is small. See use of boostrap in claims reserving. Those are theorerical results: in practice, the algorithm may fail to converge

@freakonometrics

107

http://www.ub.edu/riskcenter

GLM’s outside the Exponential Family? Actually, it is possible to consider more general distributions, see Yee (2014)) 1

> library ( VGAM )

2

> vglm ( y ~ x , family = Makeham )

3

> vglm ( y ~ x , family = Gompertz )

4

> vglm ( y ~ x , family = Erlang )

5

> vglm ( y ~ x , family = Frechet )

6

> vglm ( y ~ x , family = pareto1 ( location =100) )

Those functions can also be used for a multivariate response y

@freakonometrics

108

http://www.ub.edu/riskcenter

GLM: Link and Distribution

@freakonometrics

109

http://www.ub.edu/riskcenter

GLM: Distribution? From a computational point of view, the Poisson regression is not (really) related to the Poisson distribution. Here we solve the first order conditions (or normal equations) X

[Yi − exp(X T i β)]Xi,j = 0 ∀j

i

with unconstraint β, using Fisher’s scoring technique β k+1 = β k − H −1 k ∇k where H k = −

X i

T exp(X T β )X X i i k i

and ∇k =

X

T XT [Y − exp(X i i i β k )]

i

−→ There is no assumption here about Y ∈ N: it is possible to run a Poisson regression on non-integers.

@freakonometrics

110

http://www.ub.edu/riskcenter

The Exposure and (Annual) Claim Frequency In General Insurance, we should predict blueyearly claims frequency. Let Ni denote the number of claims over one year for contrat i. We did observe only the contract for a period of time Ei Let Yi denote the observed number of claims, over period [0, Ei ].

@freakonometrics

111

http://www.ub.edu/riskcenter

The Exposure and (Annual) Claim Frequency Assuming that claims occurence is driven by a Poisson process of intensity λ, if Ni ∼ P(λ), then Yi ∼ P(λ · Ei ), where N is the annual frequency. L(λ, Y , E) =

n Y e−λEi [λEi ]Yi i=1

Yi !

the first order condition is n n X ∂ 1X log L(λ, Y , E) = − Ei + Yi = 0 ∂λ λ i=1 i=1

for

@freakonometrics

Pn n X Y Yi Ei i b = Pni=1 = ωi where ωi = Pn λ Ei i=1 Ei i=1 Ei i=1

112

http://www.ub.edu/riskcenter

The Exposure and (Annual) Claim Frequency Assume that Yi ∼ P(λi · Ei ) where λi = exp[X T i β]. Here E(Yi |X i ) = Var(Yi |X i ) = λi = exp[X T i β + log Ei ]. log L(β; Y ) =

n X

Yi · [X T i β + log Ei ] − (exp[X i β] + log Ei ) − log(Yi !)

i=1

1

> model model gbm . step ( data = myocarde , gbm . x = 1:7 , gbm . y = 8 ,

0.6

0.7

1

1.0

1.1

PRONO01, d − 5, lr − 0.01

200

400

600

800

1000

no. of trees

@freakonometrics

123

http://www.ub.edu/riskcenter

Exponential distribution, deviance, loss function, residuals, etc • Gaussian distribution ←→ `2 loss function n X Deviance is (yi − m(xi ))2 , with gradient εbi = yi − m(xi ) i=1

• Laplace distribution ←→ `1 loss function Deviance is

n X

|yi − m(xi ))|, with gradient εbi = sign(yi − m(xi ))

i=1

@freakonometrics

124

http://www.ub.edu/riskcenter

Exponential distribution, deviance, loss function, residuals, etc • Bernoullli {−1, +1} distribution ←→ `adaboost loss function Deviance is

n X

e−yi m(xi ) , with gradient εbi = −yi e−[yi ]m(xi )

i=1

• Bernoullli {0, 1} distribution n X Deviance 2 [yi · log i=1

εbi = yi −

yi m(xi )

(1 − yi ) log

1 − yi 1 − m(xi )

with gradient

exp[m(xi )] 1 + exp[m(xi )]

• Poisson distribution n X Deviance 2 yi · log i=1

@freakonometrics

yi m(xi )

yi − m(xi ) − [yi − m(xi )] with gradient εbi = p m(xi ) 125

http://www.ub.edu/riskcenter

Regularized GLM In Regularized GLMs, we introduced a penalty in the loss function (the deviance), see e.g. `1 regularized logistic regression   n k X  X T β0 +xi β max yi [β0 + xT ]] − λ |βj | i β − log[1 + e   i=1

j=1

7

> glm _ ridge plot ( lm _ ridge )

6

4

4

> x y library ( glmnet )

7

−4

1

7

0

L1 Norm

@freakonometrics

126

http://www.ub.edu/riskcenter

Collective vs. Individual Model Consider a Tweedie distribution, with variance function power p ∈ (0, 1), mean µ and scale parameter φ, then it is a compound Poisson model, φµ2−p • N ∼ P(λ) with λ = 2−p p−2 φµ1−p • Yi ∼ G(α, β) with α = − and β = p−1 p−1 Consversely, consider a compound Poisson model N ∼ P(λ) and Yi ∼ G(α, β), α+2 • variance function power is p = α+1 λα • mean is µ = β • scale parameter is φ =

[λα]

α+2 α+1 −1

β α+1

2− α+2 α+1

seems to be equivalent... but it’s not. @freakonometrics

127

http://www.ub.edu/riskcenter

Collective vs. Individual Model In the context of regression Ni ∼ P(λi ) with λi = exp[X T i βλ ] Yj,i ∼ G(µi , φ) with µi = exp[X T i βµ ] Then Si = Y1,i + · · · + YN,i has a Tweedie distribution • variance function power is p =

φ+2 φ+1

• mean is λi µi 1 φ+1 −1

• scale parameter is

λi

φ φ+1

µi

φ 1+φ

There are 1 + 2dim(X) degrees of freedom. @freakonometrics

128

http://www.ub.edu/riskcenter

Collective vs. Individual Model Note that the scale parameter should not depend on i. A Tweedie regression is • variance function power is p =∈ (0, 1) • mean is µi = exp[X T i β Tweedie ] • scale parameter is φ There are 2 + dim(X) degrees of freedom. Note that oone can easily boost a Tweedie model 1

> library ( TDboost )

@freakonometrics

129

http://www.ub.edu/riskcenter

Part 3. Model Choice, Feature Selection, etc.

@freakonometrics

130

http://www.ub.edu/riskcenter

AIC, BIC AIC and BIC are both maximum likelihood estimate driven and penalize useless parameters(to avoid overfitting) AIC = −2 log[likelihood] + 2k and BIC = −2 log[likelihood] + log(n)k AIC focus on overfit, while BIC depends on n so it might also avoid underfit BIC penalize complexity more than AIC does. Minimizing AIC ⇔ minimizing cross-validation value, Stone (1977). Minimizing BIC ⇔ k-fold leave-out cross-validation, Shao (1997), with k = n[1 − (log n − 1)] −→ used in econometric stepwise procedures

@freakonometrics

131

http://www.ub.edu/riskcenter

Cross-Validation Formally, the leave-one-out cross validation is based on n

1X CV = `(yi , m b −i (xi )) n i=1 where m b −i is obtained by fitting the model on the sample where observation i has been dropped, e.g. n 1X 2 CV = [yi , m b −i (xi )] n i=1 The Generalized cross-validation, for a quadratic loss function, is defined as 2 n b −i (xi ) 1 X yi − m GCV = n i=1 1 − trace(S)/n

@freakonometrics

132

http://www.ub.edu/riskcenter

Cross-Validation for kernel based local regression

1

(β0 ,β1 )

2

● ● ●

● ●

●

1

● ●

● ●

●

● ● ●

●

● ● ●

● ●

●

● ● ● ●

● ●

● ●

●

● ● ●

●

@freakonometrics

●

● ● ●

●

● ● ● ● ● ●● ● ●●

●

●

●

● ● ● ●●●● ●

i=1

● ●●

●

●

●

● ● ● ● ● ●

● ● ●

−2

where h? is given by some rule of thumb (see previous discussion).

●

● ●

● ●

●

● ●

●

●

h

● ●

● ●● ● ● ● ●

−1

0

● ●

0

Econometric approach [x] [x] Define m(x) b = βb0 + βb1 x with ( n ) X [x] [x] [x] ω ? [yi − (β0 + β1 xi )]2 (βb , βb ) = argmin

0

2

4

6

8

● ●

10

133

http://www.ub.edu/riskcenter

Cross-Validation for kernel based local regression Bootstrap based approach

12

Use bootstrap samples, compute h?b , and get m b b (x)’s. 2

● ● ●

● ●

●

●

● ●

●

●

●

10

● ● ●

● ●

● ●

●

● ● ●

●

●

●

0

●

● ● ●

● ●

●

● ●

● ● ●● ● ● ● ● ● ● ● ● ●

8

●

●

1

● ●

●

● ● ●

●

6

● ●

●

● ●

●

● ● ● ●● ● ●●

●

●

4

● ● ● ●●●● ●

−1

● ●●

●

●

●

−2 0

@freakonometrics

● ● ●

2

4

6

8

● ●

0

● ●

2

● ● ● ●

10

0.85

0.90

0.95

1.00

1.05

1.10

1.15

1.20

134

http://www.ub.edu/riskcenter

Cross-Validation for kernel based local regression Statistical learning approach (Cross Validation (leave-one-out)) Given j ∈ {1, · · · , n}, given h, solve [(i),h]

(βb0

[(i),h]

, βb1

) = argmin (β0 ,β1 )

 X 

(i)

ωh [Yj − (β0 + β1 xj )]2

  

j6=i

[(i),h] [h] [(i),h] xi . Define + βb1 and compute m b (i) (xi ) = βb0

mse(h) =

n X

[h]

[yi − m b (i) (xi )]2

i=1

and set h? = argmin{mse(h)}. [x] [x] Then compute m(x) b = βb0 + βb1 x with ( n ) X [x] [x] b[x] b ω ? [yi − (β0 + β1 xi )]2 (β , β ) = argmin 0

@freakonometrics

1

(β0 ,β1 )

h

i=1

135

http://www.ub.edu/riskcenter

Cross-Validation for kernel based local regression

@freakonometrics

136

http://www.ub.edu/riskcenter

Cross-Validation for kernel based local regression Statistical learning approach (Cross Validation (k-fold)) Given I ∈ {1, · · · , n}, given h, solve [(I),h]

(βb0

[xi ,h]

, βb1

) = argmin (β0 ,β1 )

 X 

(I)

ωh [yj − (β0 + β1 xj )]2

j ∈I /

  

[(i),h] [h] [(i),h] xi , ∀i ∈ I. Define + βb1 and compute m b (I) (xi ) = βb0 XX [h] mse(h) = [yi − m b (I) (xi )]2 I

i∈I

and set h? = argmin{mse(h)}. [x] [x] Then compute m(x) b = βb0 + βb1 x with ( n ) X [x] [x] [x] (βb , βb ) = argmin ω ? [yi − (β0 + β1 xi )]2 0

@freakonometrics

1

(β0 ,β1 )

h

i=1

137

http://www.ub.edu/riskcenter

Cross-Validation for kernel based local regression

@freakonometrics

138

http://www.ub.edu/riskcenter

Cross-Validation for Ridge & Lasso

3

> x cvfit cvfit $ lambda . min

1.2

> y library ( glmnet )

0.6

1

Binomial Deviance

1.4

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

●●●●●●●●●●● ●●●●●●●●● ●●●●● ●●●● ●●● ●● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●● ●●●●●

−2

0

2

4

6

log(Lambda)

> cvfit cvfit $ lambda . min

12

[1] 0.03315514

13

> plot ( cvfit )

7

7

6

6

6

6

5

5

6

5

4

4

3

3

2

1

4

9

7

3

> plot ( cvfit )

●●● ●●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●

2

8

Binomial Deviance

[1] 0.0408752

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●● ● ●●●● ● ●●● ● ●● ●● ●●● ●● ●●● ●● ●●● ●●●●●●●●●●●●●●●●●

1

7

−10

−8

−6

−4

−2

log(Lambda)

@freakonometrics

139

http://www.ub.edu/riskcenter

Variable Importance for Trees 1 X X Nt Given some random forest with M trees, set I(Xk ) = ∆i(t) M m t N where the first sum is over all trees, and the second one is over all nodes where the split is done based on variable Xk . 1

> RF = randomForest ( PRONO ~ . , data = myocarde )

2

> varImpPlot ( RF , main = " " )

3

> importance ( RF ) INSYS

M ea nDe cr eas eGin i

4 5

FRCAR

1.107222

6

INCAR

8.194572

7

INSYS

9.311138

8

PRDIA

2.614261

9

PAPUL

2.341335

10

PVENT

3.313113

11

REPUL

7.078838

●

INCAR

●

REPUL

●

PVENT

●

PRDIA

@freakonometrics

●

PAPUL

●

FRCAR

●

0

2

4

6

8

MeanDecreaseGini

140

http://www.ub.edu/riskcenter

Partial Response Plots One can also compute Partial Response Plots, n

1 Xb E[Y |Xk = x, X i,(k) = xi,(k) ] x 7→ n i=1

1

> i mpo rt an ceO rd er names for ( name in names )

4

+ partialPlot ( RF , myocarde , eval ( name ) , col = " red " , main = " " , xlab = name )

@freakonometrics

141

http://www.ub.edu/riskcenter

Feature Selection Use Mallow’s Cp , from Mallow (1974) on all subset of predictors, in a regression n 1 X Cp = 2 [Yi − Ybi ]2 − n + 2p, S i=1

1

> library ( leaps )

2

> y x selec = leaps (x , y , method = " Cp " )

5

> plot ( selec $ size -1 , selec $ Cp )

@freakonometrics

142

http://www.ub.edu/riskcenter

Feature Selection Use random forest algorithm, removing some features at each iterations (the less relevent ones). The algorithm uses shadow attributes (obtained from existing features by shuffling the values). 20

> library ( Boruta )

2

> B plot ( B ) ●

@freakonometrics

INSYS

INCAR

REPUL

PRDIA

PAPUL

5

PVENT

3

143

http://www.ub.edu/riskcenter

Feature Selection Use random forests, and variable importance plots 1

> library ( varSelRFBoot )

2

> X Y library ( randomForest )

5

> rf VB plot ( VB )

@freakonometrics

Number of variables

144

8

7

6

0.00 2

FALSE )

0.05

5

> V library ( randomForest )

2

> fit = randomForest ( PRONO ~ . , data = train _ myocarde )

3

> train _ Y =( train _ myocarde $ PRONO == " Survival " )

4

> test _ Y =( test _ myocarde $ PRONO == " Survival " )

5

> train _ S = predict ( fit , type = " prob " , newdata = train _ myocarde ) [ ,2]

6

> test _ S = predict ( fit , type = " prob " , newdata = test _ myocarde ) [ ,2]

7

> vp = seq (0 ,1 , length =101)

8

> roc _ train = t ( Vectorize ( function ( u ) roc . curve ( train _Y , train _S , s = u ) ) ( vp ) )

9

> roc _ test = t ( Vectorize ( function ( u ) roc . curve ( test _Y , test _S , s = u ) ) ( vp ) )

10

> plot ( roc _ train , type = " b " , col = " blue " , xlim =0:1 , ylim =0:1)

@freakonometrics

146

http://www.ub.edu/riskcenter

Comparing Classifiers: ROC Curves The Area Under the Curve, AUC, can be interpreted as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, see Swets, Dawes & Monahan (2000) Many other quantities can be computed, see 1

> library ( hmeasures )

2

> HMeasure (Y , S ) $ metrics [ ,1:5]

3

Class labels have been switched from ( Death , Survival ) to (0 ,1) H

4 5

Gini

AUC

AUCH

KS

scores 0.7323154 0.8834154 0.9417077 0.9568966 0.8144499

with the H-measure (see hmeasure), Gini and AUC, as well as the area under the convex hull (AUCH).

@freakonometrics

147

http://www.ub.edu/riskcenter

Comparing Classifiers: ROC Curves Consider our previous logistic regression (on heart attacks) 1.0

> logistic S Y library ( ROCR )

2

> pred perf plot ( perf )

0.0

1

@freakonometrics

0.0

0.2

0.4

0.6

0.8

1.0

False positive rate

148

http://www.ub.edu/riskcenter

Comparing Classifiers: ROC Curves

ci = TRUE ) 3

> roc . se roc library ( pROC )

20

1

100

On can get econfidence bands (obtained using bootstrap procedures)

4

> plot ( roc . se , type = " shape " , col = " light blue " )

0

5) ) 100

80

60

40

20

0

Specificity (%)

see also for Gains and Lift curves 1

> library ( gains )

@freakonometrics

149

http://www.ub.edu/riskcenter

Comparing Classifiers: Accuracy and Kappa Kappa statistic κ compares an Observed Accuracy with an Expected Accuracy (random chance), see Landis & Koch (1977). b Y = 0 b Y = 1

Y = 0

Y = 1

TN

FN

TN+FN

FP

TP

FP+TP

TN+FP

FN+TP

n

See also Obsersed and Random Confusion Tables b Y = 0 b Y = 1

Y = 0

Y = 1

25

3

28

4

39

43

29

42

71

b Y = 0 b Y = 1

Y = 0

Y = 1

11.44

16.56

28

17.56

25.44

43

29

42

71

TP + TN total accuracy = ∼ 90.14% n [T N + F P ] · [T P + F N ] + [T P + F P ] · [T N + F N ] random accuracy = ∼ 51.93% n2 total accuracy − random accuracy κ= ∼ 79.48% 1 − random accuracy @freakonometrics

150

http://www.ub.edu/riskcenter

Comparing Models on the myocarde Dataset

@freakonometrics

151

http://www.ub.edu/riskcenter

Comparing Models on the myocarde Dataset

rf

●

gbm

●

● ●●

boost

●●

nn

●

bag tree

● ●

svm

●●

knn

●●

aic

●

glm

●● ●

@freakonometrics

1.0

●

0.0

●

−0.5

loess

● ●

0.5

If we average over all training samples

152

http://www.ub.edu/riskcenter

Gini and Lorenz Type Curves

> L L L plot ( Poly _ Sicily , col = " yellow " , add = TRUE )

Abruzzo

@freakonometrics

Apulia

Basilicata

Calabria

164

http://www.ub.edu/riskcenter

> plot ( ita1 , col = " light green " )

2

> plot ( Poly _ Sicily , col = " yellow " , add = TRUE )

3

> abline ( v =5:20 , col = " light blue " )

4

> abline ( h =35:50 , col = " light blue " )

5

> axis (1)

38

6

> axis (2)

40

42

44

1

36

46

Maps and Polygons

5

@freakonometrics

10

15

20

165

http://www.ub.edu/riskcenter

Maps and Polygons 1

> pos ita _ south ita _ north plot ( ita1 )

5

> plot ( ita _ south , col = " red " , add = TRUE )

6

> plot ( ita _ north , col = " blue " , add = TRUE )

@freakonometrics

166

http://www.ub.edu/riskcenter

Maps and Polygons 1

> library ( xlsx )

2

> data _ codes names ( data _ codes ) [1]= " NAME _ 1 "

4

> ita2 pos library ( rgeos )

2

> ita _ s ita _ n plot ( ita1 )

5

> plot ( ita _s , col = " red " , add = TRUE )

6

> plot ( ita _n , col = " blue " , add = TRUE )

@freakonometrics

168

http://www.ub.edu/riskcenter

Maps and Polygons On polygons, it is also possible to visualize centroids, 1

> plot ( ita1 , col = " light green " )

2

> plot ( Poly _ Sicily , col = " yellow " , add = TRUE )

3

> gCentroid ( Poly _ Sicily , byid = TRUE )

4

SpatialPoints : x

5

y

6

14 14.14668 37.58842

7

Coordinate Reference System ( CRS ) arguments :

8

+ proj = longlat + ellps = WGS84

9

+ datum = WGS84 + no _ defs + towgs84 =0 ,0 ,0

10

●

> points ( gCentroid ( Poly _ Sicily , byid = TRUE ) , pch =19 , col = " red " )

@freakonometrics

169

http://www.ub.edu/riskcenter

Maps and Polygons or 17 19 13

1

> G plot ( ita1 , col = " light green " )

3

> text ( G $x , G $y ,1:20)

14

2 3 4

@freakonometrics

15

170

http://www.ub.edu/riskcenter

Maps and Polygons Consider two trajectories, caracterized by a series of knots (from the centroïds list) 1

> c1 c2 plot ( cross _ road , pch =19 , cex =1.5 , add = TRUE )

@freakonometrics

y

172

http://www.ub.edu/riskcenter

Maps and Polygons To add elements on maps, consider, e.g. 1

plot ( ita1 , col = " light green " )

2

grat point _ in _ i = function (i , point )

6

+ point . in . polygon ( point [1] , point [2] , poly _ paris [ poly _ paris $ PID == i , " X " ] , poly _ paris [ poly _ paris $ PID == i , " Y " ])

7

> where _ is _ point = function ( point )

8

+ which ( Vectorize ( function ( i ) point _ in _ i (i , point ) ) (1: length ( paris ) ) >0)

@freakonometrics

178

http://www.ub.edu/riskcenter

Maps, with R 1

> where _ is _ point ( point )

2

[1] 100

1

> library ( RColorBrewer )

2

> plotclr vizualize _ point is . box = function (i , loc ) {

2

+ box _ i = minmax ( i )

3

+ x = box _ i [1]) & ( loc [1] which . box which . box ( point )

4

[1]

● ●

48.862

3

48.864

(1: length ( paris ) ) ) } 1 100 101

●

●

48.860

● ●

2

●

● ● ●●● ● ●● ●

> polygon ( poly _ paris [ poly _ paris $ PID ==1 , c ( " X "

48.858

> plot ( sub _ poly [ , c ( " X " ," Y " ) ] , col = " white " )

● ●● ● ●

●●

48.856

1

●

Y

see

● ●

●

●

," Y " ) ] , col = plotclr [2]) 3

> polygon ( poly _ paris [ poly _ paris $ PID ==100 , c ( "

48.854

●

●

2.320

2.325

2.330

2.335

2.340

2.345

X

X " ," Y " ) ] , col = plotclr [1]) 4

> polygon ( poly _ paris [ poly _ paris $ PID ==101 , c ( " X " ," Y " ) ] , col = plotclr [2])

@freakonometrics

182

http://www.ub.edu/riskcenter

Maps, with R Finally use 1

> which . poly 0) idx _ valid = c ( idx _ valid , i ) }

9

+ return ( idx _ valid ) }

to identify the IRIS polygon 1

> which . poly ( point )

2

[1] 100

@freakonometrics

183

http://www.ub.edu/riskcenter

Maps, with R > vizualize _ point

( MAIF

8

Information from URL : http : / / maps . googleapis . com / maps / api / geocode /

( Rennes distHaversine ( MAIF , Rennes , r =6378.137)

2

[1] 217.9371

@freakonometrics

185

http://www.ub.edu/riskcenter

(in km.) while the driving distance is 1

> mapdist ( as . numeric ( MAIF ) , as . numeric ( Rennes ) , mode = ’ driving ’)

2

by using this function you are agreeing to the terms at :

3

http : / / code . google . com / apis / maps / documentation / distancematrix /

4

from

5

to 6

m

km

miles seconds

1 200 Avenue Salvador Allende , 79000 Niort , France 10 -11 Place Hoche , 35000 Rennes , France 257002 257.002 159.701 minutes

7 8

1

9591

hours

159.85 2.664167

@freakonometrics

186

http://www.ub.edu/riskcenter

Visualizing Maps, via Google Maps 48.0

Loc = data . frame ( rbind ( MAIF , Rennes ) )

1

47.5

> library ( ggmap )

3

> library ( RgoogleMaps )

4

> CenterOfMap W WMap WMap

48.0

47.5

lat

or 1

> W ggmap ( W )

−2

−1

0

lon

@freakonometrics

187

http://www.ub.edu/riskcenter

1.5

Visualizing Maps, via Google Maps from a technical point of view, those are ggplot2 maps, see ggmapCheatsheet 1

lat

1.4

1.3

> W ggmap ( W ) 1.5

or 1.4

> W ggmap ( W )

1.2

103.7

103.8

103.9

lon

@freakonometrics

188

http://www.ub.edu/riskcenter

Visualizing Maps, via Google Maps 1

> Paris ParisMap library ( maptools )

48.87

lat

2

4

> library ( rgdal )

5

> paris = re adS hap eSp at ial ( " paris - cartelec . shp

48.84

48.81

") 2.25

2.30

2.35

2.40

2.45

2.375

2.400

lon

6

> proj4string ( paris ) paris ParisMapPlus ParisMapPlus

@freakonometrics

2.300

2.325

2.350

lon

189

http://www.ub.edu/riskcenter

OpenStreet Map 1

> library ( OpenStreetMap )

2

> map map plot ( map )

@freakonometrics

190

http://www.ub.edu/riskcenter

OpenStreet Map 1

> library ( OpenStreetMap )

2

> map plot ( map )

It is a standard R plot, we can add points on that graph.

@freakonometrics

191

http://www.ub.edu/riskcenter

OpenStreet Map 1

> library ( maptools ) ●

2

> deaths head ( deaths@coords )

●

● ●

● ● ● ● ● ●●

● ●

●

● ●

5 6

0 1

529308.7 529312.2

● ●

8 9 10

2

181025.2

3 4

529314.4

181020.3

529317.4

181014.3

529320.7

181007.9

● ●● ● ●

● ●

●

● ●

●● ● ●

●

7

●

●

●

● ● ● ● ●

●

● ● ●

● ● ● ●

● ●

●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●● ● ● ●● ●●

● ● ●●

●

● ● ●● ● ● ● ● ●

●● ● ●● ●

●

● ● ● ● ●●●

●

●

●●

●

● ●

●

●

● ● ●● ●●

● ● ● ●●

● ● ● ●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●

●

●

● ●●

● ●●

● ● ●●

● ● ●

● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ● ●

●

● ●

● ● ● ● ● ● ●

●●

● ● ● ● ● ● ●

● ● ● ● ● ● ●

●

●

●

●

● ●● ● ●●

● ●

●

● ●

● ●

● ● ● ●

●

● ●● ●● ● ●●

●

181031.4

●

●

●

coords . x1 coords . x2

4

●

●

●

●

● ●

●

> points ( deaths@coords , col = " red " , pch =19 , cex =.7 )

Cholera outbreak, in London, 1854, dataset collected by John (not Jon) Snow

@freakonometrics

192

●

http://www.ub.edu/riskcenter

OpenStreet Map ●

1

> X library ( KernSmooth )

●

● ● ●

3

● ●

> kde2d library ( grDevices )

●

●

> clrs image ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $ fhat ,

● ●

●● ● ●

●

5

● ●● ● ●

●

● ● ● ● ●

●

● ● ●

● ● ●

●

● ●● ● ● ●●

●●

● ● ●●

●

●

● ● ●●

●

● ● ● ● ●●●

●

●●

● ●

● ● ● ●

●

● ● ● ● ● ● ● ●

●

●

●

● ●

● ● ● ● ● ● ●

●●

● ● ● ● ● ● ●

● ● ● ● ● ● ●

●

●

● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ●

● ● ●● ●●

●

●

●

● ●●

● ● ●●

● ●

●

●

●●

●

● ●● ● ● ● ● ● ●● ● ●●

●

● ● ● ● ●

●

●

●

● ● ●

● ●

●●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

● ●

●

● ●● ● ●●

● ●

● ● ● ●

● ●● ●● ● ●●

●

●

● ● ● ●

●

●

[ ,1]) , bw . ucv ( X [ ,2]) ) )

●

●

●

● ●

4

●

●

●

● ●

●

add = TRUE , col = clrs )

@freakonometrics

193

●

http://www.ub.edu/riskcenter

OpenStreet Map ●

●

●

● ● ●

● ● ●

● ●

we can add a lot of things on that map ●

> contour ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $

●

●

● ●

●● ●

●

● ● ●

8e−06 ● ● ●

● ● ● ● ●

●

● ● ● ● ● ● ● 05 ● ●1.4e− ● ● ● ● ●

●

● ●

●●

●

● ●● ● ● ●●

●●

● ● ●●

●

●

●

● ● ● ● ●●●

●

6e−06

● ● ●●

●●

● ● ● ●

● ● ● ● ● ● ● ● ●

● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ●

● ● ●● ●●

●

●

●

● ●●

● ● ●●

● ●

●

● ● ●

●

● ●

● ● ● ● ● ● ●

●●

● ● ● ● ● ● ●

● ● ● ● ● ● ●

●

●

●●

●

● ● ● ● ●

● ●

●

●

● ●● ● ● ● ● ● ●● ● ●●

●

● ●● ● ●●

● ●

●

1e−05 ●

●

● ● ●

● ●

●

● ●

1.2e−05

●

●

fhat , add = TRUE )

● ●● ● ●

1e−05● ● ●● ●● ● ●●

●

4e−06

● ● ● ●

●

● ●

●

1

● ● ●●

●

● ●

●

2e−06

●

●

●

●

●

● ●

●

@freakonometrics

194

●

http://www.ub.edu/riskcenter

OpenStreet Map There are alternative packages related to OpenStreetMap. See for instance 1

> library ( leafletR )

2

> devtools :: install _ github ( " rstudio / leaflet " )

(see http://rstudio.github.io/leaflet/ for more information). 1

> library ( osmar )

2

> src loc . london bb ua bg _ ids bg _ ids bg bg _ poly = as _ sp ( bg , " polygons " )

5

> plot ( bg _ poly , col = gray . colors (12) [11] , border = " gray " )

@freakonometrics

196

http://www.ub.edu/riskcenter

OpenStreet Map We can visualize and leisure areas 1

> nat _ ids = find ( ua , way ( tags ( k == " waterway " ) ) )

2

> nat _ ids = find _ down ( ua , way ( nat _ ids ) )

3

> nat = subset ( ua , ids = nat _ ids )

4

> nat _ poly = as _ sp ( nat , " polygons " )

5

> nat _ ids = find ( ua , way ( tags ( k == " leisure " ) ) )

6

> nat _ ids = find _ down ( ua , way ( nat _ ids ) )

7

> nat = subset ( ua , ids = nat _ ids )

8

> nat _ poly = as _ sp ( nat , " polygons " )

9

> plot ( nat _ poly , col = " #99 dd99 " , add = TRUE , border = " #99 dd99 " )

(here we consider waterway and leisure tags)

@freakonometrics

197

http://www.ub.edu/riskcenter

OpenStreet Map and finally, add the deaths 1

● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ●●● ● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●

> points ( df _ deathscoords,col="red",pch=19)

and some heat map 1

> X library ( KernSmooth )

3

> kde2d image ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $ fhat * 1000 , add = TRUE , col = clrs )

5

● ●● ● ●● ●● 300●00● ●20000 ● ● ● ● 0● ● ● ●● ● ● ●● ●● ● ●● ● 60●00 ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● 40000 ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● 8000 ● ● ● ● ● ●●●● 0 ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● 70● ● ● ● ●● ● 05 ● ● ● ● ● 1e+ ● ●●● ● ● 00●0 ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ●● ● ● ● ● 90 ● ●● ● 00 ● 0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● 50000 ● ● ●●● ● ● ● ● ● ● ● ● ● 0 1000

> contour ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $ fhat , add = TRUE )

@freakonometrics

198

http://www.ub.edu/riskcenter

The analogous Google Map plot 51.515

1

> library ( ggmap )

2

> get _ london london london

Let us add points 1

> df _ deaths library ( sp )

2

> library ( rgdal )

3

51.515

> coordinates ( df _ deaths ) = ~ coords . x1 + coords .

●

●

● ●●● ● ●● ●

lat

●

> df _ deaths = spTransform ( df _ deaths , CRS ( " + proj = longlat + datum = WGS84 " ) )

6

51.513

●

51.512

● ● ● ● ●

●● ●

● ● ● ●●

●● ●● ● ●●

● ● ●

●

● ● ●●

●

●

●

● ● ●

● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ●●

● ● ● ●

● ● ● ●

● ●● ●● ● ●●

●● ● ●

● ● ● ●

●●

● ● ●

●

●

5

● ● ● ● ●●

●●

● ● ●

●

●

●

●

51.514

> proj4string ( df _ deaths ) = CRS ( " + init = epsg :27700 " )

●

●

● ●

x2 4

● ●

● ● ●● ●● ●●

> london + geom _ point ( aes ( x = coords . x1 , y = coords . x2 ) , data = data . frame ( df _ deaths

51.511 −0.140

coords),col="red")

−0.138

−0.136

−0.134

lon

and we can add some heat map, too,

@freakonometrics

200

http://www.ub.edu/riskcenter

1

> london + geom _ point ( aes ( x = coords . x1 , y = coords . x2 ) ,

2

data = data . frame ( df _ deaths@coords ) , col = " red " ) +

3

51.515

geom _ density2d ( data = data . frame ( df _

● ●

●

●

51.514

aes ( x = coords . x1 , y = coords . x2 ) , size = 0.3)

●

●

● ●●● ●

● ● ●

●● ●

51.513

●

●

deaths@coords ) , 51.512

6

●

●

lat

stat _ density2d ( data = data . frame ( df _

● ● ●

●

aes ( x = coords . x1 , y = coords . x2 , fill = ..

● ● ● ● ●

●● ●

● ● ● ●●

●● ●● ● ●●

● ● ●

●

● ● ●●

●

●

●

● ● ●

● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ●●

● ● ● ●

● ● ● ●

● ●● ●● ● ●●

●● ● ●

● ● ● ●

●●

●●

+ 5

● ● ● ● ●●

● ●

deaths@coords ) , 4

●

●

●

● ● ●● ●● ●●

level .. , alpha = .. level ..) , size = 0.01 , 51.511

bins = 16 , geom = " polygon " ) + scale _

−0.140

−0.138

−0.136

−0.134

lon

fill _ gradient ( low = " green " , high = " red " , guide = FALSE ) + 7

scale _ alpha ( range = c (0 , 0.3) , guide = FALSE )

@freakonometrics

201

http://www.ub.edu/riskcenter

More Interactive Maps As discussed previously, one can use RStudio and library(leaflty) see rpubs.com/freakonometrics/ 1

> devtools :: install _ github ( " rstudio / leaflet " )

2

> require ( leaflet )

3

> setwd ( " / cholera / " )

4

> deaths df _ deaths coordinates ( df _ deaths ) = ~ coords . x1 + coords . x2

7

> proj4string ( df _ deaths ) = CRS ( " + init = epsg :27700 " )

8

> df _ deaths = spTransform ( df _ deaths , CRS ( " + proj = longlat + datum = WGS84 " ) )

9

> df = data . frame ( df _ deaths@coords )

10

> lng = df $ coords . x1

11

> lat = df $ coords . x2

1

> m = leaflet () % >% addTiles ()

2

> m % >% fitBounds ( -.141 ,

@freakonometrics

51.511 , -.133 , 51.516)

202

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

203

http://www.ub.edu/riskcenter

More Interactive Maps One can add points 1

rd =.5

2

op =.8

3

clr = " blue "

4

m = leaflet () % >% addTiles ()

5

m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr )

@freakonometrics

204

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

205

http://www.ub.edu/riskcenter

More Interactive Maps We can also add some heatmap. 1

> X = cbind ( lng , lat )

2

> kde2d x = kde2d $ x1

2

> y = kde2d $ x2

3

> z = kde2d $ fhat

4

> CL = contourLines ( x , y , z )

We have now a list that contains lists of polygons corresponding to isodensity curves. To visualise of of then, use 1

> m = leaflet () % >% addTiles ()

2

> m % >% addPolygons ( CL [[5]] $x , CL [[5]] $y , fillColor = " red " , stroke = FALSE )

@freakonometrics

206

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

207

http://www.ub.edu/riskcenter

More Interactive Maps We can get at the same time the points and the polygon 1

> m = leaflet () % >% addTiles ()

2

> m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr ) % >%

3

addPolygons ( CL [[5]] $x , CL [[5]] $y , fillColor = " red " , stroke = FALSE )

1

> m = leaflet () % >% addTiles ()

2

> m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr ) % >%

3

addPolygons ( CL [[1]] $x , CL [[1]] $y , fillColor = " red " , stroke = FALSE ) % >%

4

addPolygons ( CL [[3]] $x , CL [[3]] $y , fillColor = " red " , stroke = FALSE ) % >%

5

addPolygons ( CL [[5]] $x , CL [[5]] $y , fillColor = " red " , stroke = FALSE ) % >%

6

addPolygons ( CL [[7]] $x , CL [[7]] $y , fillColor = " red " , stroke = FALSE ) % >%

7

addPolygons ( CL [[9]] $x , CL [[9]] $y , fillColor = " red " , stroke = FALSE )

@freakonometrics

208

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

209

http://www.ub.edu/riskcenter

More Interactive Maps 1

> m = leaflet () % >% addTiles ()

2

> m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr ) % >%

3

addPolylines ( CL [[1]] $x , CL [[1]] $y , color = " red " ) % >%

4

addPolylines ( CL [[5]] $x , CL [[5]] $y , color = " red " ) % >%

5

addPolylines ( CL [[8]] $x , CL [[8]] $y , color = " red " )

@freakonometrics

210

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

211

http://www.ub.edu/riskcenter

More Interactive Maps Another package can be considered 1

> require ( rleafmap )

2

> library ( sp )

3

> library ( rgdal )

4

> library ( maptools )

5

> library ( KernSmooth )

6

> setwd ( " / home / arthur / Documents / " )

7

> deaths df _ deaths coordinates ( df _ deaths ) = ~ coords . x1 + coords . x2

10

> proj4string ( df _ deaths ) = CRS ( " + init = epsg :27700 " )

11

> df _ deaths = spTransform ( df _ deaths , CRS ( " + proj = longlat + datum = WGS84 " ) )

12

> df = data . frame ( df _ deaths@coords )

13

> stamen _ bm j _ snow writeMap ( stamen _ bm , j _ snow , width = 1000 , height = 750 , setView = c ( mean ( df [ ,1]) , mean ( df [ ,2]) ) , setZoom = 14)

3

> writeMap ( stamen _ bm , j _ snow , width = 1000 , height = 750 , setView = c ( mean ( df [ ,1]) , mean ( df [ ,2]) ) , setZoom = 16)

@freakonometrics

213

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

214

http://www.ub.edu/riskcenter

More Interactive Maps 1

> library ( spatstat )

2

> library ( maptools )

3

> win df _ deaths _ ppp df _ deaths _ ppp _ d df _ deaths _ d df _ deaths _ d $ v [ df _ deaths _ d $ v < 10^3] stamen _ bm mapquest _ bm j _ snow df _ deaths _ den my _ ui writeMap ( stamen _ bm , mapquest _ bm , j _ snow , df _ deaths _ den , width = 1000 , height = 750 , interface = my _ ui , setView = c ( mean ( df [ ,1]) , mean ( df [ ,2]) ) , setZoom = 16)

@freakonometrics

216

http://www.ub.edu/riskcenter

More Interactive Maps

@freakonometrics

217

http://www.ub.edu/riskcenter

Visualizing a Spatial Process Consider car/ bike accident in Paris, see data.gouv.fr for bodily injury car accident in France (2006-2011, BAAC1 dataset) or opendata.paris.fr for accident in Paris, only. 1

> caraccident geo _ loc mat _ geo _ loc save ( mat _ geo _ loc , file = " mat _ geo _ loc . RData " )

@freakonometrics

218

http://www.ub.edu/riskcenter

Visualizing a Spatial Process Keep only accidents located in Paris 5

> mat _ loc x idx 2.25) & ( x [ ,1] 48.83) & ( x [ ,2] x library ( ks )

6

> fhat image ( fhat $ eval . points [[1]] , fhat $ eval . points [[2]] , fhat $ estimate , col = rev ( heat . colors (100) ) , xlab = " " , ylab = " " , xlim = c (2.25 ,2.4) , ylim = c (48.83 ,48.9) , axes = FALSE )

8

> plot ( paris , add = TRUE , border = " grey " )

9

> points (x , pch =19 , cex =.3)

@freakonometrics

219

http://www.ub.edu/riskcenter

Visualizing a Spatial Process

●

●

●

●

● ●

● ●

● ●

●

● ●●

● ●

●● ●

●

●

●

● ●

●● ● ● ●

●

● ●

●● ●

●

● ● ●● ●

●

●

●●

●

●

● ● ● ● ●

●

●

● ● ●

● ●

●

● ●

● ● ●

● ●

●

●

● ●

● ●

●

●

●

●● ●

●

●

●● ● ● ● ●

● ● ●

● ●

●

●

●

● ●

● ●

●

● ●

●●

●

●

● ● ● ●

●

●

●

●

●

● ● ●●

●

●● ● ●

● ●● ●

● ●● ●●

●

●

●●

●● ● ●

●

●

●

●●●

●

● ●

●

● ● ●

●

●

●

● ● ● ●

● ●

●

● ●

●

● ● ● ●

●●

● ● ●

● ●

● ● ● ●

● ●

●

● ● ●●

● ●●

●

●

● ●● ● ●

● ● ●●

●

● ●

● ●●

●

● ●● ● ● ● ● ●

●

● ● ● ●

● ● ●

●

● ●

● ● ● ●●

●●

●

●

●

●

● ●

●

●

● ● ● ● ● ●● ● ● ● ● ●

●

● ● ● ●

● ●

● ● ●●

●

●

● ●

●

●

●

●●

● ● ● ● ● ● ●

●

●

●

● ● ●

●

● ●

●

●

● ● ●

●

●

●

●

● ●

●

● ●

●

●●

●

●●

●

●

●● ● ●

●

●

●

● ● ●

●

●

● ● ●

●

●●

● ● ● ●

●

● ● ●

● ● ●

● ●● ● ● ●

●

● ● ●

●

● ●

●●

● ● ●

●

● ●

●● ●

●

●

●● ●

● ● ●● ● ●

● ● ● ●

●● ● ● ●

● ● ● ●●● ●●

●

● ●

● ●

●

●

● ● ●

● ●

●

● ●

● ● ●

●

●

●

● ●●●

●● ●

●

●●

●

● ●

●

●

● ●

●

● ● ●●● ● ●

●

●

●

● ● ●

●

● ●

● ● ●

●

● ● ● ● ● ●● ●● ● ● ● ● ● ●

●

● ● ●● ● ●

●● ● ● ● ● ●

●● ● ●● ●

●

●

● ●

● ● ●

● ●

●

● ●

●

●

●●

●

●

● ●● ●

●

● ●

● ●

●● ● ● ●● ●●

●

● ●

●

●

● ●

●

● ● ●

● ● ●

●

● ●● ●●

● ●

● ●

●

●

● ●

● ●

●

●

●

● ● ●

● ●

●

● ●

●

●

●● ● ●

●

●

●

●

●

●

●

●

● ● ● ●

● ● ●

●

● ●

●

●

● ● ●

●● ● ● ●●

●

● ● ●●●

●

●

●● ●

●

●

● ●

●

● ● ●

●

●

●

●

● ●

●

● ●●

●

●

●

● ●

●

●

●

●● ●

●

●

●

● ● ●

●

●

●

● ●

●

●

●

●

●

●

● ●

● ● ●

●

●

●

●

● ● ●

● ●

●

● ●

● ●●

● ●

●

●

● ●

●

●

● ● ●

@freakonometrics

●

● ●

●

●

●

●

●

● ●● ● ● ●

●

● ●

●

220

http://www.ub.edu/riskcenter

Visualizing Hurricane Paths National Hurricane Center (NHC) collects datasets with all storms in North Atlantic, the North Atlantic Hurricane Database (HURDAT weather.unisys.com) ) For all sorms we have the location of the storm, every six jours (at midnight, six a.m., noon and six p.m.), the maximal wind speed (on a 6 hour window) and the pressure in the eye of the storm. E.g. for 2012, http://weather.unisys.com/hurricane/atlantic/2012/index.php

@freakonometrics

221

http://www.ub.edu/riskcenter

http://weather.unisys.com/hurricane/atlantic/2012/SANDY/track.dat Date: 21-31 OCT 2012 Hurricane-3 SANDY ADV LAT LON TIME 1 14.30 -77.40 10/21/18Z 2 13.90 -77.80 10/22/00Z 3 13.50 -78.20 10/22/06Z 4 13.10 -78.60 10/22/12Z 5 12.70 -78.70 10/22/18Z 6 12.60 -78.40 10/23/00Z 7 12.90 -78.10 10/23/06Z 8 13.40 -77.90 10/23/12Z 9 14.00 -77.60 10/23/18Z 10 14.70 -77.30 10/24/00Z 11 15.60 -77.10 10/24/06Z 12 16.60 -76.90 10/24/12Z @freakonometrics

WIND 25 25 25 30 35 40 40 40 45 55 60 65

PR 1006 1005 1003 1002 1000 998 998 995 993 990 987 981

STAT LOW LOW LOW TROPICAL DEPRESSION TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM HURRICANE-1 222

http://www.ub.edu/riskcenter

Data Scraping: Hurricane Dataset 1. get the name of all hurricane for a given year 1

> library ( XML )

2

> year loc tabs tabs [1]

6

$ ‘ NULL ‘

7

#

Name Tropical Storm ALBERTO

8

1

1

9

2

2

10

3

3

Hurricane -1 CHRIS

11

4

4

12

5

5

13 14

Date Wind Pres Cat 19 -23 MAY

50

995

-

Tropical Storm BERYL 25 MAY - 2 JUN

60

992

-

17 -24 JUN

75

974

1

Tropical Storm DEBBY

23 -27 JUN

55

990

-

Hurricane -2 ERNESTO

1 -10 AUG

85

973

2

6

6 Tropical Storm FLORENCE

3 - 8 AUG

50 1002

-

7

7

9 -19 AUG

40 1004

-

Tropical Storm HELENE

@freakonometrics

223

http://www.ub.edu/riskcenter

Data Scraping: Hurricane Dataset We split the Name variable to extract the name 15

> storms storms

17

[1] " Tropical "

" Storm "

" ALBERTO "

18

[6] " BERYL "

" Hurricane -1 " " CHRIS "

" Tropical "

" Storm "

" Tropical "

" Storm "

But we keep only relevant information 19

> index nstorms

21

> nstorms

for ( i in length ( nstorms ) :1) {

26

if (( nstorms [ i ]== " SIXTEE " ) & ( year ==2008) ) nstorms [ i ] gridx gridy = gridy [ j ]) & ( TOTTRACK $ LAT < gridy [ j +1]) )

Then we look for all possible next move (i.e. 6 hours later), if any 1

> for ( s in 1: length ( idx ) ) {

2

> locx locy rm ( list = ls () )

2

> year =2014

3

> loc = paste ( " http : / / donnees . roulez - eco . fr / opendata / annee / " , year , sep = " ")

4

> download . file ( loc , destfile = " oil . zip " )

5 6

Content type ’ application / zip ’ length 15248088 bytes (14.5 MB )

7 8

> unzip ( " oil . zip " , exdir = " . / " )

9

> fichier = paste ( " PrixCarburan ts _ annuel _ " , year ,

10

" . xml " , sep = " " )

11

> library ( plyr )

12

> library ( XML )

13

> library ( lubridate )

14

> l = xmlToList ( fichier )

15

> length ( l )

16

[1] 11064

@freakonometrics

234

http://www.ub.edu/riskcenter

Gas Price, in France To extract information for gas station no=2 and Gasole use 1

>

prix = list ()

2

>

date = list ()

3

>

nom = list ()

4

>

j =0; no =2

5

>

for ( i in 1: length ( l [[ no ]]) ) {

6

+

v = names ( l [[ no ]])

7

+

if ( ! is . null ( v [ i ]) ) {

8

+

if ( v [ i ]== " prix " ) {

9

+

10

+

date [[ j ]]= as . character ( l [[ no ]][[ i ]][ " maj " ])

11

+

prix [[ j ]]= as . character ( l [[ no ]][[ i ]][ " valeur " ])

12

+

nom [[ j ]]= as . character ( l [[ no ]][[ i ]][ " nom " ])

13

+

14

>

j = j +1

}}

}

id = which ( unlist ( nom ) == type _ gas )

@freakonometrics

235

http://www.ub.edu/riskcenter

Gas Price, in France 1

>

ext _ y = function ( j ) substr ( date [[ id [ j ]]] ,1 ,4)

2

>

ext _ m = function ( j ) substr ( date [[ id [ j ]]] ,6 ,7)

3

>

ext _ d = function ( j ) substr ( date [[ id [ j ]]] ,9 ,10)

4

>

ext _ h = function ( j ) substr ( date [[ id [ j ]]] ,12 ,13)

5

>

ext _ mn = function ( j ) substr ( date [[ id [ j ]]] ,15 ,16)

6

>

prix _ essence = function ( i ) as . numeric ( prix [[ id [ i ]]]) / 1000

7

>

Y = unlist ( lapply (1: n , ext _ y ) )

8

>

M = unlist ( lapply (1: n , ext _ m ) )

9

>

D = unlist ( lapply (1: n , ext _ d ) )

10

>

H = unlist ( lapply (1: n , ext _ h ) )

11

>

MN = unlist ( lapply (1: n , ext _ mn ) )

12

>

date = paste ( base1 $Y , " -" , base1 $M , " -" , base1 $D ,

13

+

" " , base1 $H , " : " , base1 $ MN , " :00 " , sep = " " )

14

>

date _ base = as . POSIXct ( date , format =

15

+

@freakonometrics

" %Y -% m -% d % H :% M :% S " , tz = " UTC " )

236

http://www.ub.edu/riskcenter

Gas Price, in France 1

> d = paste ( year , " -01 -01 12:00:00 " , sep = " " )

2

> f = paste ( year , " -12 -31 12:00:00 " , sep = " " )

3

> vecteur _ date = seq ( as . POSIXct (d , format =

4

+ " %Y -% m -% d % H :% M :% S " ) ,

5

+ as . POSIXct (f , format =

6

+ " %Y -% m -% d % H :% M :% S " ) , by = " days " )

7

> vect _ idx = Vectorize ( function ( t ) sum ( vecteur _ date [ t ] >= date _ base ) ) (1: length ( vecteur _ date ) )

8

> prix _ essence = function ( i ) as . numeric ( prix [[ id [ i ]]]) / 1000

9 10

> P = c ( NA , unlist ( lapply (1: n , prix _ essence ) ) ) > Z = ts ( P [1+ vect _ idx ] , start = year , frequency =365)

@freakonometrics

237

http://www.ub.edu/riskcenter

Gas Price, in France 1

> dt = as . Date ( " 2014 -05 -05 " )

2

>

base = NULL

3

>

for ( no in 1: length ( l ) ) {

4

+

prix = list ()

5

+

date = list ()

6

+

j =0

7

+

for ( i in 1: length ( l [[ no ]]) ) {

8

+

v = names ( l [[ no ]])

9

+

if ( ! is . null ( v [ i ]) ) {

10

+

if ( v [ i ]== " prix " ) {

11

+

j = j +1

12

+

date [[ j ]]= as . character ( l [[ no ]][[ i ]][ " maj " ])

13

+

14

+

n=j

15

+

D = as . Date ( substr ( unlist ( date ) ,1 ,10) ," %Y -% m -% d " )

16

+

k = which ( D == D [ which . max ( D [D 0) {

2

+

3

+ if ( " nom " % in %

4

+

k = which ( B [ " nom " ,]== " Gazole " )

5

+

prix = as . numeric ( B [ " valeur " ,k ]) / 1000

6

+

if ( length ( prix ) ==0) prix = NA

7

+

base1 = data . frame ( indice = no ,

8

+

lat = as . numeric ( l [[ no ]] $ . attrs [ " latitude " ]) / 100000 ,

9

+

lon = as . numeric ( l [[ no ]] $ . attrs [ " longitude " ]) / 100000 ,

10

+

gaz = prix )

11

+

base = rbind ( base , base1 )

12

+ }}

B = Vectorize ( function ( i ) l [[ no ]][[ k [ i ]]]) (1: length ( k ) )

@freakonometrics

rownames ( B ) ) {

239

http://www.ub.edu/riskcenter

Gas Price, in France 1

> idx = which (( base $ lon >( -10) ) & ( base $ lon 35) & ( base $ lat B = base [ idx ,]

4

> Q = quantile ( B $ gaz , seq (0 ,1 , by =.01) , na . rm = TRUE )

5

> Q [1]=0

6

> x = as . numeric ( cut ( B $ gaz , breaks = unique ( Q ) ) )

7

> CL = c ( rgb (0 ,0 ,1 , seq (1 ,0 , by = -.025) ) ,

8

+ rgb (1 ,0 ,0 , seq (0 ,1 , by =.025) ) )

9

> plot ( B $ lon , B $ lat , pch =19 , col = CL [ x ])

10

> library ( maps )

11

> map ( " france " )

12

> points ( B $ lon , B $ lat , pch =19 , col = CL [ x ])

@freakonometrics

240

http://www.ub.edu/riskcenter

Gas Price, in France 1

> library ( OpenStreetMap )

2

> map map plot ( map )

6

> points ( B $ lon , B $ lat , pch =19 , col = CL [ x ])

1

> library ( tripack )

2

> V plot (V , add = TRUE )

@freakonometrics

c ( lat = 47 ,

241

http://www.ub.edu/riskcenter

Gas Price, in France 1

> plot ( map )

2

> P library ( sp )

4

> point _ in _ i = function (i , point ) point . in . polygon ( point [1] , point [2] , P [[ i ]][ ,1] , P [[ i ]][ ,2])

5

> which _ point = function ( i ) which ( Vectorize ( function ( j ) point _ in _ i (i , c ( dB $ lon [ id [ j ]] , dB $ lat [ id [ j ]]) ) ) (1: length ( id ) ) >0)

6

> for ( i in 1: length ( P ) ) polygon ( P [[ i ]] , col = CL [ x [ id [ which _ point ( i ) ]]] , border = NA )

@freakonometrics

242

Machine Learning & Data Science for Actuaries ... - Freakonometrics

des documents recommandant