http://www.ub.edu/riskcenter
Machine Learning & Data Science for Actuaries, with R Arthur Charpentier (Université de Rennes 1 & UQàM)
Universitat de Barcelona, April 2016. http://freakonometrics.hypotheses.org
@freakonometrics
1
http://www.ub.edu/riskcenter
Machine Learning & Data Science for Actuaries, with R Arthur Charpentier (Université de Rennes 1 & UQàM)
Professor, Economics Department, Univ. Rennes 1 In charge of Data Science for Actuaries program, IA Research Chair actinfo (Institut Louis Bachelier) (previously Actuarial Sciences at UQàM & ENSAE Paristech actuary in Hong Kong, IT & Stats FFSA) PhD in Statistics (KU Leuven), Fellow Institute of Actuaries MSc in Financial Mathematics (Paris Dauphine) & ENSAE Editor of the freakonometrics.hypotheses.org’s blog Editor of Computational Actuarial Science, CRC @freakonometrics
2
http://www.ub.edu/riskcenter
Agenda
0. Introduction, see slides 1. Classification, y ∈ {0, 1} 2. Regression Models, y ∈ R 3. Model Choice, Feature Selection, etc 4. Data Visualisation & Maps
@freakonometrics
3
http://www.ub.edu/riskcenter
Part 1. Classification, y ∈ {0, 1}
@freakonometrics
4
http://www.ub.edu/riskcenter
Classification? Example: Fraud detection, automatic reading (classifying handwriting symbols), face recognition, accident occurence, death, purchase of optinal insurance cover, etc
1.0
Here yi ∈ {0, 1} or yi ∈ {−1, +1} or yi ∈ {•, •}.
0.8
●
●
●
●
0.4
• the score function, s(x) = P(Y = 1X = x) ∈ [0, 1]
●
●
0.6
We look for a (good) predictive model here. There will be two steps,
●
0.2
● ●
0.0
●
• the classification function s(x) → Yb ∈ {0, 1}.
@freakonometrics
0.0
0.2
0.4
0.6
0.8
1.0
5
http://www.ub.edu/riskcenter
Modeling a 0/1 random variable
Myocardial infarction of patients admited in E.R. ◦ heart rate (FRCAR), ◦ ◦ ◦ ◦ ◦ ◦ ◦
1
heart index (INCAR) stroke index (INSYS) diastolic pressure (PRDIA) pulmonary arterial pressure (PAPUL) ventricular pressure (PVENT) lung resistance (REPUL) death or survival (PRONO)
> myocarde = read . table ( " http : / / fre a ko no me tr ics . free . fr / myocarde . csv " , head = TRUE , sep = " ; " )
@freakonometrics
6
http://www.ub.edu/riskcenter
Logistic Regression Assume that P(Yi = 1) = πi , logit(πi ) = xT i β, where logit(πi ) = log or −1
πi = logit
(xT i β)
πi 1 − πi
,
exp[xT i β] = . T 1 + exp[xi β]
The loglikelihood is log L(β) =
n X
yi log(πi )+(1−yi ) log(1−πi ) =
i=1
n X
yi log(πi (β))+(1−yi ) log(1−πi (β))
i=1
and the first order conditions are solved numerically n
∂ log L(β) X = xk,i [yi − πi (β)] = 0. ∂βk i=1
@freakonometrics
7
http://www.ub.edu/riskcenter
Logistic Regression, Output (with R) 1
> logistic summary ( logistic )
3 4
Coefficients : Estimate Std . Error z value Pr ( > z )
5 6
( Intercept ) 10.187642
11.895227
0.856
0.392
7
FRCAR
0.138178
0.114112
1.211
0.226
8
INCAR
5.862429
6.748785
0.869
0.385
9
INSYS
0.717084
0.561445
1.277
0.202
10
PRDIA
0.073668
0.291636
0.253
0.801
11
PAPUL
0.016757
0.341942
0.049
0.961
12
PVENT
0.106776
0.110550
0.966
0.334
13
REPUL
0.003154
0.004891
0.645
0.519
14 15
( Dispersion parameter for binomial family taken to be 1)
16 17
Number of Fisher Scoring iterations : 7
@freakonometrics
8
http://www.ub.edu/riskcenter
Logistic Regression, Output (with R) 1
> library ( VGAM )
2
> mlogistic summary ( mlogistic )
4 5
Coefficients :
6
Estimate Std . Error
z value
7
( Intercept ) 10.1876411 11.8941581
0.856525
8
FRCAR
0.1381781
9
INCAR
5.8624289
10
INSYS
0.7170840
11
PRDIA
0.0736682
12
PAPUL
0.0167565
13
PVENT
0.1067760
0.1105456
0.965901
14
REPUL
0.0031542
0.0048907
0.644939
0.1141056 1.210967 6.7484319
0.868710
0.5613961 1.277323 0.2916276
0.252610
0.3419255 0.049006
15 16
Name of linear predictor : log ( mu [ ,1] / mu [ ,2])
@freakonometrics
9
http://www.ub.edu/riskcenter
Logistic (Multinomial) Regression In the Bernoulli case, y ∈ {0, 1}, XTβ
P(Y = 1) =
p1 1 p0 e = ∝ p and P(Y = 0) = = ∝ p0 1 Tβ T X X p0 + p1 p0 + p1 1+e 1+e
In the multinomial case, y ∈ {A, B, C} X T βA
P(X = A) =
e pA ∝ pA i.e. P(X = A) = X T β T B + eX β B + 1 pA + pB + pC e T
pB eX βB P(X = B) = ∝ pB i.e. P(X = B) = X T β T A + eX β B + 1 pA + pB + pC e 1 pC ∝ pC i.e. P(X = C) = X T β P(X = C) = T A + eX β B + 1 pA + pB + pC e
@freakonometrics
10
http://www.ub.edu/riskcenter
Logistic Regression, Numerical Issues b is The algorithm to compute β 1. start with some initial value β 0 2. define β k = β k−1 − H(β k−1 )−1 ∇ log L(β k−1 ) where ∇ log L(β)is the gradient, and H(β) the Hessian matrix, also called Fisher’s score. The generic term of the Hessian is n
∂ 2 log L(β) X = Xk,i X`,i [yi − πi (β)] ∂βk ∂β` i=1 Define Ω = [ωi,j ] = diag(b πi (1 − π bi )) so that the gradient is writen ∂ log L(β) ∇ log L(β) = = X T (y − π) ∂β
@freakonometrics
11
http://www.ub.edu/riskcenter
Logistic Regression, Numerical Issues and the Hessian
∂ 2 log L(β) T H(β) = = −X ΩX T ∂β∂β
The gradient descent algorithm is then β k = (X T ΩX)−1 X T ΩZ where Z = Xβ k−1 + X T Ω−1 (y − π), b = lim β , From maximum likelihood properties, if β k k→∞
√
L
b − β) → N (0, I(β)−1 ). n(β
From a numerical point of view, this asymptotic variance I(β)−1 satisfies I(β)−1 = −H(β).
@freakonometrics
12
http://www.ub.edu/riskcenter
Logistic Regression, Numerical Issues 1
> X = cbind (1 , as . matrix ( myocarde [ ,1:7]) )
2
> Y = myocarde $ PRONO == " Survival "
3
> beta = as . matrix ( lm ( Y ~ 0+ X ) $ coefficients , ncol =1)
4
> for ( s in 1:9) {
5
+
pi = exp ( X % * % beta [ , s ]) / (1+ exp ( X % * % beta [ , s ]) )
6
+
gradient = t ( X ) % * % (Y  pi )
7
+
omega = matrix (0 , nrow ( X ) , nrow ( X ) ) ; diag ( omega ) =( pi * (1  pi ) )
8
+
Hessian =  t ( X ) % * % omega % * % X
9
+
beta = cbind ( beta , beta [ , s ]  solve ( Hessian ) % * % gradient ) }
10
> beta
11
>  solve ( Hessian )
12
> sqrt (  diag ( solve ( Hessian ) ) )
@freakonometrics
13
http://www.ub.edu/riskcenter
Predicted Probability Let m(x) = E(Y X = x). With a logistic regression, we can get a prediction b exp[xT β] m(x) b = b 1 + exp[xT β] 1 2
> predict ( logistic , type = " response " ) [1:5] 1
2
3
4
5
3
0.6013894 0.1693769 0.3289560 0.8817594 0.1424219
4
> predict ( mlogistic , type = " response " ) [1:5 ,]
5
Death
Survival
6
1 0.3986106 0.6013894
7
2 0.8306231 0.1693769
8
3 0.6710440 0.3289560
9
4 0.1182406 0.8817594
10
5 0.8575781 0.1424219
@freakonometrics
14
http://www.ub.edu/riskcenter
Predicted Probability b exp[βb0 + βb1 x1 + · · · + βbk xk ] exp[xT β] m(x) b = = T b 1 + exp[x β] 1 + exp[βb0 + βb1 x1 + · · · + βbk xk ] use 1
> predict ( fit _ glm , newdata = data , type = " response " )
e.g. 3000
●
●
● ●
> GLM pred _ GLM = function (p , r ) {
● ●
●
●
●
●
●
●
+ return ( predict ( GLM , newdata =
●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
+ data . frame ( PVENT =p , REPUL = r ) , type = " response " ) }
●
● ● ●
● ●
●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
●
500
4
●
●
1500
2
● ●
●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
@freakonometrics
15
http://www.ub.edu/riskcenter
Predictive Classifier To go from a score to a class: if s(x) > s, then Yb (x) = 1 and s(x) ≤ s, then Yb (x) = 0 Plot T P (s) = P[Yb = 1Y = 1] against F P (s) = P[Yb = 1Y = 0]
@freakonometrics
16
http://www.ub.edu/riskcenter
Predictive Classifier With a threshold (e.g. s = 50%) and the predicted probabilities, one can get a classifier and the confusion matrix 1
> probabilities predictions .5) +1]
3
> table ( predictions , myocarde $ PRONO )
4 5
predictions Death Survival
6
Death
7
Survival
@freakonometrics
25
3
4
39
17
http://www.ub.edu/riskcenter
Visualization of a Classifier in Higher Dimension...
4
Death Survival
4
Death Survival
19
19
29
66
●
29
66
●
●
●
●
●
54
54 ●
2
●
64 34 ● 22 3 ● ●
15 11 10 ● 21 ● 67 37 7 56 ● ● 46● ● 12 51 36 ● ● 43 ● ● 61 47 35 ● ● ● 28 Survival 5324 71 ● 2 68 ● ● ● 32 17 ● 13 58 60 25● ● ● ● 16 ● ● ● ● ● 55 ● ● ● ● 20 Death 70 62 30 4841 ● ● 9● 44 40 38 ● ● 14 ● ● ● 50 5 39 ● ● ● ●4 1 ● 59 45 ● 26 6 ● 57 ● ● ● 18 ● ● ● ● 42 23 27 ● ● ● ● 8 63 49 ● ● ● 33
●
0
52 ●
52
−4 0
●
●
64 34 ● 22 3 ● ●
15 11 10 ● 21 ● 67 37 7 56 ● ● 46● ● 12 51 36 ● ● 43 ● ● 61 47 35 ● ● ● 28 Survival 5324 71 ● 2 68 ● ● ● 32 17 ● 13 58 60 25● ● ● ● 16 ● ● ● ● ● 55 ● ● ● ● 20 Death 70 62 30 4841 ● ● 9● 44 40 38 ● ● 14 ● ● ● 50 5 39 ● ● ● ●4 1 ● 59 45 ● 26 6 ● 57 ● ● ● 18 ● ● ● ● 42 23 27 ● ● ● ● 8 63 49 ● ● ● 33
●
●
−4
−2
31
●
●
−4
69 65
0
●
Dim 2 (18.64%)
31
−2
69 65
−2
Dim 2 (18.64%)
2
●
2
4
Dim 1 (54.26%)
5
0.
−4
−2
0
2
4
Dim 1 (54.26%)
Point z = (z1 , z2 , 0, · · · , 0) −→ x = (x1 , x2 , · · · , xk ).
@freakonometrics
18
http://www.ub.edu/riskcenter
... but be carefull about interpretation !
1
> prediction = predict ( logistic , type = " response " )
Use a 25% probability threshold 1
>
table ( prediction >.25 , myocarde $ PRONO ) Death Survival
2 3
FALSE
19
2
4
TRUE
10
40
or a 75% probability threshold 1
>
table ( prediction >.75 , myocarde $ PRONO ) Death Survival
2 3
FALSE
4
TRUE
@freakonometrics
27
9
2
33
19
http://www.ub.edu/riskcenter
Why a Logistic and not a Probit Regression? Bliss (1934) suggested a model such that P(Y = 1X = x) = H(xT β) where H(·) = Φ(·) the c.d.f. of the N (0, 1) distribution. This is the probit model. This yields a latent model, yi = 1(yi? > 0) where yi? = xT i β + εi is a nonobservable score. In the logistic regression, we model the odds ratio, P(Y = 1X = x) = exp[xT β] P(Y 6= 1X = x) exp[·] P(Y = 1X = x) = H(x β) where H(·) = 1 + exp[·] T
which is the c.d.f. of the logistic variable, see Verhulst (1845) @freakonometrics
20
http://www.ub.edu/riskcenter
kNearest Neighbors (a.k.a. kNN) In pattern recognition, the kNearest Neighbors algorithm (or kNN for short) is a nonparametric method used for classification and regression. (Source: wikipedia). X 1 yi E[Y X = x] ∼ k kxi −xk small
For kNearest Neighbors, the class is usually the majority vote of the k closest neighbors of x. 1
3000
●
> library ( caret )
●
● ●
> KNN
● ●
●
●
●
●
●
●
> pred _ KNN = function (p , r ) {
●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
+ return ( predict ( KNN , newdata =
●
● ● ●
● ●
●
5
●
●
1500
3
● ●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
+ data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]}
500
6
● ●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
@freakonometrics
21
http://www.ub.edu/riskcenter
kNearest Neighbors Distance k · k should not be sensitive to units: normalize by standard deviation 1
3000
●
> sP library ( rpart )
2
> cart library ( rpart . plot )
4
> library ( rattle )
5
> prp ( cart , type =2 , extra =1)
or 1
> fancyRpartPl ot ( cart , sub = " " )
@freakonometrics
24
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART The impurity is a function ϕ of the probability to have 1 at node N , i.e. P[Y = 1 node N ], and I(N ) = ϕ(P[Y = 1 node N ]) ϕ is nonnegative (ϕ ≥ 0), symmetric (ϕ(p) = ϕ(1 − p)), with a minimum in 0 and 1 (ϕ(0) = ϕ(1) < ϕ(p)), e.g. • Bayes error: ϕ(p) = min{p, 1 − p} • crossentropy: ϕ(p) = −p log(p) − (1 − p) log(1 − p) • Gini index: ϕ(p) = p(1 − p) Those functions are concave, minimum at p = 0 and 1, maximum at p = 1/2.
@freakonometrics
25
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART To split N into two {NL , NR }, consider X
I(NL , NR ) =
x∈{L,R}
nx I(Nx ) n
e.g. Gini index (used originally in CART, see Breiman et al. (1984)) X nx X nx,y nx,y gini(NL , NR ) = − 1− n nx nx x∈{L,R}
y∈{0,1}
and the crossentropy (used in C4.5 and C5.0) entropy(NL , NR ) = −
X x∈{L,R}
@freakonometrics
nx n
X nx,y nx,y log nx nx
y∈{0,1}
26
http://www.ub.edu/riskcenter
Classification (and Regression) Trees, CART
15
20
25
30
30
@freakonometrics
−0.14 −0.16 −0.18 −0.14 16
18
20
22
2000
20
22
24
26
28
−0.16
−0.14 1500
18
REPUL
−0.18 1000
16
PVENT
−0.20 500
32
−0.16 14
−0.16
−0.25 10 12 14 16
12
second split −→
REPUL
28
−0.20
35
−0.45 8
24
−0.14
25
20
PAPUL
−0.18 20
−0.35
−0.25 −0.35 −0.45
6
3.0
−0.20
24
PVENT
4
2.6
−0.18
−0.25
←− first split
−0.35 20
2.2
PRDIA
−0.45
−0.35 −0.45
16
−0.20 1.8
PAPUL
−0.25
PRDIA
12
−0.14 −0.16
j∈{1,··· ,k},s
3.0
{I(NL , NR )}
−0.18
2.5
max
−0.18
solve
−0.20
2.0
INSYS
−0.14
1.5
INCAR
−0.16
1.0
NR : {xi,j > s}
−0.20
−0.25 −0.35
NL : {xi,j ≤ s}
−0.45
−0.45
−0.25
INSYS
−0.35
INCAR
4
6
8
10
12
14
500
700
900
27
1100
http://www.ub.edu/riskcenter
Pruning Trees One can grow a big tree, until leaves have a (preset) small number of observations, and then possibly go back and prune branches (or leaves) that do not improve gains on good classification sufficiently. Or we can decide, at each node, whether we split, or not.
@freakonometrics
28
http://www.ub.edu/riskcenter
Pruning Trees In trees, overfitting increases with the number of steps, and leaves. Drop in impurity at node N is defined as n nR L ∆I(NL , NR ) = I(N ) − I(NL , NR ) = I(N ) − I(NL ) − I(NR ) n n
1
3000
●
> library ( rpart )
●
● ●
> CART
●
● ●
●
●
●
●
●
●
> pred _ CART = function (p , r ) {
●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
+ return ( predict ( CART , newdata =
●
● ● ●
●
●
5
●
●
1500
3
● ●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
+ data . frame ( PVENT =p , REPUL = r ) [ , " Survival " ]) }
500
6
● ●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
−→ we cut if ∆I(NL , NR )/I(N ) (relative gain) exceeds cp (complexity parameter, default 1%). @freakonometrics
29
http://www.ub.edu/riskcenter
Pruning Trees 1
3000
●
> library ( rpart )
●
● ●
> CART
● ● ●
●
● ●
●
●
●
●
●
●
> pred _ CART = function (p , r ) {
●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
+ return ( predict ( CART , newdata =
●
●
●
5
●
●
1500
3
● ●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
+ data . frame ( PVENT =p , REPUL = r ) [ , " Survival " ]) }
500
6
● ●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
See also 1
> library ( mvpart )
2
> ? prune
Define the missclassification rate of a tree R(tree)
@freakonometrics
30
http://www.ub.edu/riskcenter
Pruning Trees Given a costcomplexity parameter cp (see tunning parameter in RidgeLasso) define a penalized R(·) Rcp (tree) = R(tree) + cpktreek  {z }  {z } loss
complexity
If cp is small the optimal tree is large, if cp is large the optimal tree has no leaf, see Breiman et al. (1984). size of tree 2
3
7
9
2
> plotcp ( cart )
●
0.8
=3)
X−val Relative Error
> cart prune ( cart , cp =0.06) 0.4
3
Inf
0.27
0.06
0.024
0.013
cp
@freakonometrics
31
http://www.ub.edu/riskcenter
Bagging Bootstrapped Aggregation (Bagging) , is a machine learning ensemble metaalgorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification (Source: wikipedia). It is an ensemble method that creates multiple models of the same type from different subsamples of the same dataset [boostrap]. The predictions from each separate model are combined together to provide a superior result [aggregation]. → can be used on any kind of model, but interesting for trees, see Breiman (1996) Boostrap can be used to define the concept of margin, B B 1 X 1 X margini = 1(b yi = yi ) − 1(b yi 6= yi ) B B b=1
b=1
Remark Probability that ith raw is not selection (1 − n−1 )n → e−1 ∼ 36.8%, cf training / validation samples (2/31/3) @freakonometrics
32
http://www.ub.edu/riskcenter
Bagging Trees 1
> margin for ( b in 1:1 e4 ) {
3
+ idx = sample (1: n , size =n , replace = TRUE )
4
> cart margin [j ,] .5) ! =
● ●
●
●
●●
●
●
●
●
( myocarde $ PRONO == " Survival " )
●
●
5
10
15
20
7
+ }
8
> apply ( margin , 2 , mean )
PVENT
@freakonometrics
33
http://www.ub.edu/riskcenter
Bagging Trees Interesting because of instability in CARTs (in terms of tree structure, not necessarily prediction)
@freakonometrics
34
http://www.ub.edu/riskcenter
Bagging and Variance, Bagging and Bias Assume that y = m(x) + ε. The mean squared error over repeated random samples can be decomposed in three parts Hastie et al. (2001) 2 2 b − E[(m(x)] b E[(Y − m(x)) b ] = {z} σ + E[m(x)] b − m(x) + E m(x) {z }  {z }  2
2
1
2
3
1 reflects the variance of Y around m(x) 2 is the squared bias of m(x) b 3 is the variance of m(x) b −→ biasvariance tradeoff. Boostrap can be used to reduce the bias, and he variance (but be careful of outliers)
@freakonometrics
35
http://www.ub.edu/riskcenter
1
3000
●
> library ( ipred )
●
● ●
> BAG
●
● ●
●
●
●
●
●
●
> pred _ BAG = function (p , r ) {
●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
+ return ( predict ( BAG , newdata =
●
● ● ●
●
●
5
●
●
1500
3
● ●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
+ data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]) }
500
6
● ●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
@freakonometrics
36
http://www.ub.edu/riskcenter
Random Forests Strictly speaking, when boostrapping among observations, and aggregating, we use a bagging algorithm. In the random forest algorithm, we combine Breiman’s bagging idea and the random selection of features, introduced independently by Ho (1995)) and Amit & Geman (1997)) 1
3000
●
> library ( randomForest )
●
● ●
> RF
●
● ●
●
●
●
●
●
●
> pred _ RF = function (p , r ) {
●
● ●
●
● ●
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
+ return ( predict ( RF , newdata =
●
● ● ●
●
●
5
●
●
1500
3
● ●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
+ data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]) }
500
6
● ●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
@freakonometrics
37
http://www.ub.edu/riskcenter
Random Forest At each node, select
√
k covariates out of k (randomly).
can deal with small n large kproblems Random Forest are used not only for prediction, but also to assess variable importance (discussed later on).
@freakonometrics
38
http://www.ub.edu/riskcenter
Support Vector Machine SVMs were developed in the 90’s based on previous work, from Vapnik & Lerner (1963), see Vailant (1984) Assume that points are linearly separable, i.e. there is ω and b such that +1 if ω T x + b > 0 Y = −1 if ω T x + b < 0 Problem: infinite number of solutions, need a good one, that separate the data, (somehow) far from the data. Concept : VC dimension. Let H : {h : Rd 7→ {−1, +1}}. Then H is said to shatter a set of points X is all dichotomies can be achieved. E.g. with those three points, all configurations can be achieved
@freakonometrics
39
http://www.ub.edu/riskcenter
Support Vector Machine
● ●
● ●
●
● ●
● ●
●
●
● ●
●
● ●
● ●
●
●
● ●
●
●
E.g. with those four points, several configurations cannot be achieved (with some linear separator, but they can with some quadratic one)
@freakonometrics
40
http://www.ub.edu/riskcenter
Support Vector Machine Vapnik’s (VC) dimension is the size of the largest shattered subset of X. This dimension is intersting to get an upper bound of the probability of missclassification (with some complexity penalty, function of VC(H)). Now, in practice, where is the optimal hyperplane ? The distance from x0 to the hyperplane ω T x + b is ω T x0 + b d(x0 , Hω,b ) = kωk and the optimal hyperplane (in the separable case) is argmin min d(xi , Hω,b ) i=1,··· ,n
@freakonometrics
41
http://www.ub.edu/riskcenter
Support Vector Machine Define support vectors as observations such that ω T xi + b = 1 The margin is the distance between hyperplanes defined by support vectors. The distance from support vectors to Hω,b is kωk−1 , and the margin is then 2kωk−1 . −→ the algorithm is to minimize the inverse of the margins s.t. Hω,b separates ±1 points, i.e. 1 T min ω ω s.t. Yi (ω T xi + b) ≥ 1, ∀i. 2
@freakonometrics
42
http://www.ub.edu/riskcenter
Support Vector Machine Problem difficult to solve: many inequality constraints (n) −→ solve the dual problem... In the primal space, the solution was X X ω= αi Yi xi with αi Yi = 0. i=1
In the dual space, the problem becomes (hint: consider the Lagrangian) ) ( X X 1X T max αi αj Yi Yj xi xj s.t. αi Yi = 0. αi − 2 i=1 i=1 i=1 which is usually written 0 ≤ α ∀i 1 T i min α Qα − 1T α s.t. α yT α = 0 2 where Q = [Qi,j ] and Qi,j = yi yj xT i xj . @freakonometrics
43
http://www.ub.edu/riskcenter
Support Vector Machine Now, what about the nonseparable case? Here, we cannot have yi (ω T xi + b) ≥ 1 ∀i. −→ introduce slack variables, ω T x + b ≥ +1 − ξ when y = +1 i i i ω T xi + b ≤ −1 + ξi when yi = −1 where ξi ≥ 0 ∀i. There is a classification error when ξi > 1. The idea is then to solve 1 T 1 T min ω ω + C1T 1ξ>1 , instead of min ω ω 2 2
@freakonometrics
44
http://www.ub.edu/riskcenter
Support Vector Machines, with a Linear Kernel So far, d(x0 , Hω,b ) = min {kx0 − xk`2 } x∈Hω,b
where k · k`2 is the Euclidean (`2 ) norm, kx0 − xk`2
●
●
> SVM2 library ( kernlab )
3000
1
p √ = (x0 − x) · (x0 − x) = x0 ·x0 − 2x0 ·x + x·x
●
myocarde ,
●
4
> pred _ SVM2 = function (p , r ) {
REPUL
+ prob . model = TRUE , kernel = " vanilladot " )
●
● ●
● ● ●
● ●
1500
3
2000
● ●
● ●
●
●
●
●
●
●
●
● ●
+ return ( predict ( SVM2 , newdata =
●
●
●
+ data . frame ( PVENT =p , REPUL = r ) , type = " probabilities " ) [ ,2]) }
●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
●
500
6
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
5
●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
@freakonometrics
45
http://www.ub.edu/riskcenter
Support Vector Machines, with a Non Linear Kernel More generally, d(x0 , Hω,b ) = min {kx0 − xkk } x∈Hω,b
where k · kk is some kernelbased norm, p kx0 − xkk = k(x0 ,x0 ) − 2k(x0 ,x) + k(x·x)
●
●
> SVM2 library ( kernlab )
3000
1
●
myocarde ,
●
4
> pred _ SVM2 = function (p , r ) {
REPUL
+ prob . model = TRUE , kernel = " rbfdot " )
●
● ●
● ● ●
● ●
1500
3
2000
● ●
● ●
●
●
●
●
●
●
●
● ●
+ return ( predict ( SVM2 , newdata =
●
●
●
+ data . frame ( PVENT =p , REPUL = r ) , type = " probabilities " ) [ ,2]) }
●
●
●
● ●
●
●
●
● ● ●
● ● ● ●
●
500
6
●
●
●
● ●
● ●
●
● ● ● ●
●
●
1000
5
●
● ● ●
0
●
●
● ●
5
●
●
10
15
20
PVENT
@freakonometrics
46
http://www.ub.edu/riskcenter
Heuristics on SVMs An interpretation is that data aren’t linearly seperable in the original space, but might be separare by some kernel transformation,
● ●● ● ●
● ●
● ●
●
● ● ● ●
●
● ●●● ● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ●● ● ● ● ● ●
@freakonometrics
●
●
● ● ●●●● ●
●
● ● ● ● ● ● ● ● ● ●● ● ● ● ●
●● ●● ●● ● ● ● ● ● ● ●
●
● ●● ● ● ● ●● ●●● ●● ● ●
●
● ●
●
●● ● ●●●● ●●
● ● ● ● ●●● ● ● ● ● ●
●
● ●
● ●
●
●
47
http://www.ub.edu/riskcenter
Still Hungry ? There are still several (machine learning) techniques that can be used for classification • Fisher’s Linear or Quadratic Discrimination (closely related to logistic regression, and PCA), see Fisher (1936)) XY = 0 ∼ N (µ0 , Σ0 ) and XY = 1 ∼ N (µ1 , Σ1 )
@freakonometrics
48
http://www.ub.edu/riskcenter
Still Hungry ? • Perceptron or more generally Neural Networks In machine learning, neural networks are a family of statistical learning models inspired by biological neural networks and are used to estimate or approximate functions that can depend on a large number of inputs and are generally unknown. wikipedia, see Rosenblatt (1957) • Boosting (see next section) • Naive Bayes In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. wikipedia, see Russell & Norvig (2003) See also the (great) package 1
> library ( caret )
@freakonometrics
49
http://www.ub.edu/riskcenter
●
Promotion •
No Purchase
85.17%
61.60%
Purchase
14.83%
38.40%
0.8 0.6
●
●
0.4
●
●
● ● ● 0.0
Control
●
●
0.2
In many applications (e.g. marketing), we do need two models to analyze the impact of a treatment. We need two groups, a control and a treatment group. Data : {(xi , yi )} with yi ∈ {•, •} Data : {(xj , yj )} with yi ∈ {, } See clinical trials, treatment vs. control group E.g. direct mail campaign in a bank
1.0
Difference in Differences
0.0
0.2
0.4
0.6
0.8
1.0
overall uplift effect +23.57%, see Guelman et al. (2014) for more details.
@freakonometrics
50
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims Consider a (standard) logistic regression, on two covariate (age of driver, and age of campingcar) π = logit−1 (β0 + β1 x1 + β2 x2 ) 0.00
0.05
0.10
0.15
10
data = camping , family = binomial )
20
> reg _ glm = glm ( nombre ~ ageconducteur + agevehicule ,
0
1
Age du véhicule
30
40
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
20
30
40
50
60
70
80
Age du conducteur principal
@freakonometrics
51
90
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims Consider a (standard) logistic regression, on two covariate (age of driver, and age of campingcar), smoothed with splines π = logit−1 (β0 + s1 (x1 ) + s2 (x2 )) 0.00
0.05
0.10
0.15
10
agevehicule ) , data = camping , family = binomial )
20
> reg _ add = glm ( nombre ~ bs ( ageconducteur ) + bs (
0
1
Age du v?hicule
30
40
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
20
30
40
50
60
70
80
Age du conducteur principal
@freakonometrics
52
90
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims Consider a (standard) logistic regression, on two covariate (age of driver, and age of campingcar), smoothed with bivariate spline π = logit−1 (β0 + s(x1 ,x2 )) 0.00
0.05
0.10
0.15
2
> reg _ gam = gam ( nombre ~ s ( ageconducteur , agevehicule ) ,
0
10
data = camping , family = binomial )
20
> library ( mgcv ) Age du véhicule
1
30
40
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
20
30
40
50
60
70
80
Age du conducteur principal
@freakonometrics
53
90
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims One can also use kNearest Neighbours (kNN) 0.00
0.05
0.10
0.15
3
> sv = sd ( camping $ agevehicule )
4
> knn = knn3 (( nombre ==1) ~ I ( ageconducteur / sc ) + I (
30
> sc = sd ( camping $ ageconducteur )
20
2
Age du véhicule
> library ( caret )
10
1
40
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
agevehicule / sv ) , data = camping , k =100) 20
30
40
50
60
70
80
Age du conducteur principal
(be carefull about scaling problems)
@freakonometrics
54
90
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims We can also use a tree data = camping , cp =7 e 4)
0.00
0.05
0.10
0.15
0.00
0.10
0.15
40 30 0
10
20
Age du véhicule
30 20 10 20
30
40
50
60
70
Age du conducteur principal
@freakonometrics
0.05
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
40
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0
2
> tree = rpart (( nombre ==1) ~ ageconducteur + agevehicule ,
Age du véhicule
1
80
90
20
30
40
50
60
70
80
90
Age du conducteur principal
55
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims or bagging techniques (rather close to random forests)
0.00
0.05
0.10
0.15
> library ( ipred )
2
> bag = bagging (( nombre ==1) ~ ageconducteur +
> library ( randomForest )
4
> rf = randomForest (( nombre ==1) ~ ageconducteur + agevehicule , data = camping )
@freakonometrics
20 10
3
0
agevehicule , data = camping )
Age du véhicule
30
1
40
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
20
30
40
50
60
70
80
Age du conducteur principal
56
90
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
> library ( dismo )
2
> library ( gbm )
3
> fit predict ( fit , type = " response " , n . trees =700)
0.00
4
0.05
learning . rate =0.001 , bag . fraction =0.5)
20
30
40
50
60
70
80
Age du conducteur principal
@freakonometrics
57
90
http://www.ub.edu/riskcenter
Application on Motor Insurance Claims
Boosting algorithms can also be considered (see next time) 1
> library ( dismo )
2
> library ( gbm )
0.05
0.10
0.15
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
> fit predict ( fit , type = " response " , n . trees =400)
0
4
20
learning . rate =0.01 , bag . fraction =0.5)
Age du véhicule
30
y =13 , family = " bernoulli " , tree . complexity =5 ,
20
30
40
50
60
70
80
Age du conducteur principal
@freakonometrics
58
90
http://www.ub.edu/riskcenter
Part 2. Regression
@freakonometrics
59
http://www.ub.edu/riskcenter
Regression? In statistics, regression analysis is a statistical process for estimating the relationships among variables [...] In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. (Source: wikipedia). Here regression is opposed to classification (as in the CART algorithm). y is either a continuous variable y ∈ R or a counting variable y ∈ N .
@freakonometrics
60
http://www.ub.edu/riskcenter
Regression? Parametrics, nonparametrics and machine learning In many cases in econometric and actuarial literature we simply want a good fit for the conditional expectation, E[Y X = x]. Regression analysis estimates the conditional expectation of the dependent variable given the independent variables (Source: wikipedia). Example: A popular nonparametric technique, kernel based regression, P i Yi · Kh (X i − x) P m(x) b = i Kh (X i − x) In econometric litterature, interest on asymptotic normality properties and plugin techniques. In machine learning, interest on outof sample crossvalidation algorithms.
@freakonometrics
61
http://www.ub.edu/riskcenter
Linear, NonLinear and Generalized Linear
Linear Model: • (Y X = x) ∼ N (θx , σ 2 ) • E[Y X = x] = θx = xT β 1
> fit fit fit fit e for ( i in 1:100) {
3
+ W for ( i in 1:100) {
3
+ ind fit predict ( fit , newdata = data . frame ( X = x ) )
@freakonometrics
71
http://www.ub.edu/riskcenter
Regression Smoothers: Spline Functions
1
> fit predict ( fit , newdata = data . frame ( X = x ) )
see Generalized Additive Models.
@freakonometrics
72
http://www.ub.edu/riskcenter
Fixed Knots vs. Optimized Ones
1
> library ( f re ekn ot spli ne s )
2
> gen fit predict ( fit , newdata = data . frame ( X = x ) )
@freakonometrics
73
http://www.ub.edu/riskcenter
Interpretation of Penalty Unbiased estimators are important in mathematical statistics, but are they the best estimators ? Consider a sample, i.i.d., {y1 , · · · , yn } with distribution N (µ, σ 2 ). Define θb = αY . What is the optimal α? to get the best estimator of µ ? • bias: bias θb = E θb − µ = (α − 1)µ α2 σ 2 • variance: Var θb = n 2 2 α σ 2 2 b • mse: mse θ = (α − 1) µ + n ?
The optimal value is α =
µ2 2
µ2 +
@freakonometrics
σ n
< 1.
74
http://www.ub.edu/riskcenter
Linear Model Consider some linear model yi = xT i β + εi for all i = 1, · · · , n. Assume that εi are i.i.d. with E(ε) = 0 (and finite variance). Write β0 y1 1 x1,1 · · · x1,k ε1 β1 . .. . .. .. . . . . + . . = . . . . . .. . yn εn 1 xn,1 · · · xn,k βk {z }  {z }   {z }  {z } y,n×1 ε,n×1 X,n×(k+1) β,(k+1)×1
Assuming ε ∼ N (0, σ 2 I), the maximum likelihood estimator of β is b = argmin{ky − X T βk` } = (X T X)−1 X T y β 2 ... under the assumtption that X T X is a fullrank matrix. T −1 b What if X T X T y does not exist, but i X cannot be inverted? Then β = [X X] b = [X T X + λI]−1 X T y always exist if λ > 0. β λ
@freakonometrics
75
http://www.ub.edu/riskcenter
Ridge Regression b = [X T X + λI]−1 X T y is the Ridge estimate obtained as solution The estimator β of n X 2 b = argmin β [yi − β0 − xT i β] + λ kβk`2  {z } β i=1 1T β 2
for some tuning parameter λ. One can also write b = argmin {kY − X T βk` } β 2 β;kβk`2 ≤s
b = argmin {objective(β)} where Remark Note that we solve β β
objective(β) =
L(β)  {z }
training loss
@freakonometrics
+
R(β)  {z }
regularization
76
http://www.ub.edu/riskcenter
Going further on sparcity issues In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s 0). Ici dim(β) = s. We wish we could solve b = argmin {kY − X T βk` } β 2 β;kβk`0 ≤s
Problem: it is usually not possible to describe all possible constraints, since s coefficients should be chosen here (with k (very) large). k Idea: solve the dual problem b= β
argmin β;kY
−X T βk
{kβk`0 }
`2 ≤h
where we might convexify the `0 norm, k · k`0 .
@freakonometrics
78
http://www.ub.edu/riskcenter
Regularization `0 , `1 et `2
@freakonometrics
79
http://www.ub.edu/riskcenter
Optimal LASSO Penalty Use cross validation, e.g. Kfold, b β (−k) (λ) = argmin
X
i6∈Ik
2 [yi − xT i β] + λkβk
then compute the sum of the squared errors, X 2 b Qk (λ) = [yi − xT i β (−k) (λ)] i∈Ik
and finally solve (
1 X λ = argmin Q(λ) = Qk (λ) K
)
?
k
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest the largest λ such that K X 1 Q(λ) ≤ Q(λ? ) + se[λ? ] with se[λ]2 = 2 [Qk (λ) − Q(λ)]2 K k=1
@freakonometrics
80
http://www.ub.edu/riskcenter
Going further on sparcity issues On [−1, +1]k , the convex hull of kβk`0 is kβk`1 On [−a, +a]k , the convex hull of kβk`0 is a−1 kβk`1 Hence, b = argmin {kY − X T βk` } β 2 β;kβk`1 ≤˜ s
is equivalent (KuhnTucker theorem) to the Lagragian optimization problem b = argmin{kY − X T βk` +λkβk` } β 2 1
@freakonometrics
81
http://www.ub.edu/riskcenter
LASSO Least Absolute Shrinkage and Selection Operator b ∈ argmin{kY − X T βk` +λkβk` } β 2 1 is a convex problem (several algorithms? ), but not strictly convex (no unicity of b are unique b = xT β the minimum). Nevertheless, predictions y
?
MM, minimize majorization, coordinate descent Hunter (2003).
@freakonometrics
82
http://www.ub.edu/riskcenter
1
> freq = merge ( contrat , nombre _ RC )
2
> freq = merge ( freq , nombre _ DO )
3
> freq [ ,10]= as . factor ( freq [ ,10])
4
> mx = cbind ( freq [ , c (4 ,5 ,6) ] , freq [ ,9]== " D " ,
LASSO, third party
freq [ ,3]% in % c ( " A " ," B " ," C " ) ) 5
> colnames ( mx ) = c ( names ( freq ) [ c (4 ,5 ,6) ] , " diesel " ," zone " )
8
[1]
puissance agevehicule ageconducteur diesel
1
−10
−9
−8
−7
−6
−5
0.10
3
4
1
zone
−0.05
> names ( mx )
Coefficients
7
4
0.00
mx [ , i ]) ) / sd ( mx [ , i ])
4
0.05
> for ( i in 1: ncol ( mx ) ) mx [ , i ]=( mx [ , i ]  mean (
4
−0.10
6
4
10
> library ( glmnet )
−0.15
9
> fit = glmnet ( x = as . matrix ( mx ) , y = freq [ ,11] ,
−0.20
3
offset = log ( freq [ ,2]) , family = " poisson
5
Log Lambda
") 11
> plot ( fit , xvar = " lambda " , label = TRUE )
@freakonometrics
83
http://www.ub.edu/riskcenter
4
4 4
1
−0.05
0.00
0.10
3
−0.10
Coefficients
LASSO, third party
2
0.05
0
2
> cvfit = cv . glmnet ( x = as . matrix ( mx ) , y = freq
3
−0.15
> plot ( fit , label = TRUE )
−0.20
1
[ ,11] , offset = log ( freq [ ,2]) , family = "
5
0.0
0.1
0.2
0.4
L1 Norm
poisson " ) 3
0.3
> plot ( cvfit )
> log ( cvfit $ lambda . min )
7
[1] 8.16453
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
0.246
0.248
• Cross validation curve + error bars
0.256
6
0.254
[1] 0.0002845703
0.252
5
0.250
> cvfit $ lambda . min
Poisson Deviance
4
0.258
4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 2 1
−10
@freakonometrics
−9
−8
−7
log(Lambda)
−6
−5
84
http://www.ub.edu/riskcenter
1
> freq = merge ( contrat , nombre _ RC )
2
> freq = merge ( freq , nombre _ DO )
3
> freq [ ,10]= as . factor ( freq [ ,10])
4
> mx = cbind ( freq [ , c (4 ,5 ,6) ] , freq [ ,9]== " D " ,
LASSO, Fréquence DO
freq [ ,3]% in % c ( " A " ," B " ," C " ) ) 5
> colnames ( mx ) = c ( names ( freq ) [ c (4 ,5 ,6) ] , " > for ( i in 1: ncol ( mx ) ) mx [ , i ]=( mx [ , i ]  mean (
[1]
puissance agevehicule ageconducteur diesel
9 10
zone
> library ( glmnet ) > fit = glmnet ( x = as . matrix ( mx ) , y = freq [ ,12] , offset = log ( freq [ ,2]) , family = " poisson
2
1
1
−9
−8
−7
−6
−5
−4
4 1 5
−0.4
8
3
−0.6
> names ( mx )
Coefficients
7
4
−0.2
mx [ , i ]) ) / sd ( mx [ , i ])
4
−0.8
6
0.0
diesel " ," zone " )
2
Log Lambda
") 11
> plot ( fit , xvar = " lambda " , label = TRUE )
@freakonometrics
85
http://www.ub.edu/riskcenter
1
1
1
2
4
0.0
0
4
1
> plot ( fit , label = TRUE )
2
> cvfit = cv . glmnet ( x = as . matrix ( mx ) , y = freq
−0.4 −0.8
−0.6
LASSO, material
Coefficients
−0.2
1 5
2
[ ,12] , offset = log ( freq [ ,2]) , family = "
0.0
0.2
0.4
0.8
1.0
L1 Norm
poisson " ) 3
0.6
> plot ( cvfit )
6
> log ( cvfit $ lambda . min )
7
[1] 7.653266
●
●
●
● ● ● ●
0.215
• Cross validation curve + error bars
●
0.230
[1] 0.0004744917
0.225
5
●
0.220
> cvfit $ lambda . min
Poisson Deviance
4
0.235
4 4 4 4 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1
● ● ●
●● ●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
−9
@freakonometrics
−8
−7
−6
log(Lambda)
●
●
●
●
−5
−4
86
http://www.ub.edu/riskcenter
Some thoughts about Tuning parameters Regularization is a key issue in machine learning, to avoid overfitting. In (traditional) econometrics are based on plugin methods: see Silverman bandwith rule in Kernel density estimation, 5 4b σ ∼ 1.06b σ n−1/5 . h? = 3n In machine learning literature, use on outofsample crossvalidation methods for choosing amount of regularization.
@freakonometrics
87
http://www.ub.edu/riskcenter
Optimal LASSO Penalty Use cross validation, e.g. Kfold, X X 2 b β [yi − xT βk  (−k) (λ) = argmin { i β] + λ
i6∈Ik
k
then compute the sum or the squared errors, X 2 b Qk (λ) = [yi − xT i β (−k) (λ)] i6∈Ik
and finally solve (
1 X λ = argmin Q(λ) = Qk (λ) K
)
?
k
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest the largest λ such that K X 1 Q(λ) ≤ Q(λ? ) + se[λ? ] with se[λ]2 = 2 [Qk (λ) − Q(λ)]2 K k=1
@freakonometrics
88
http://www.ub.edu/riskcenter
Big Data, Oracle and Sparcity Assume that k is large, and that β ∈ Rk can be partitioned as β = (β imp , β nonimp ), as well as covariates x = (ximp , xnonimp ), with important and nonimportant variables, i.e. β nonimp ∼ 0. Goal : achieve variable selection and make inference of β imp Oracle property of high dimensional model selection and estimation, see Fan and Li (2001). Only the oracle knows which variables are important... k If sample size is large enough (n >> kimp 1 + log ) we can do inference as kimp if we knew which covariates were important: we can ignore the selection of covariates part, that is not relevant for the confidence intervals. This provides cover for ignoring the shrinkage and using regularstandard errors, see Athey & Imbens (2015).
@freakonometrics
89
http://www.ub.edu/riskcenter
Why Shrinkage Regression Estimates ? Interesting for model selection (alternative to peanlized criterions) and to get a good balance between bias and variance. In decision theory, an admissible decision rule is a rule for making a decisionsuch that there is not any other rule that is always better than it. When k ≥ 3, ordinary least squares are not admissible, see the improvement by James–Stein estimator.
@freakonometrics
90
http://www.ub.edu/riskcenter
Regularization and Scalability What if k is (extremely) large? never trust ols with more than five regressors (attributed to Zvi Griliches in Athey & Imbens (2015)) Use regularization techniques, see Ridge, Lasso, or subset selection ( n ) X X T 2 b = argmin β [yi − β0 − xi β] + λkβk`0 where kβk`0 = 1(βk 6= 0). β
@freakonometrics
i=1
k
91
http://www.ub.edu/riskcenter
Penalization and Splines In order to get a sufficiently smooth model, why not penalyse the sum of squares of errors, Z n X [yi − m(xi )]2 + λ [m00 (t)]2 dt i=1
for some tuning parameter λ. Consider some cubic spline basis, so that m(x) =
J X
θj Nj (x)
j=1
then the optimal expression for m is obtained using b = [N T N + λΩ]−1 N T y θ where N i,j is the matrix of Nj (X i )’s and Ωi,j =
@freakonometrics
R
Ni00 (t)Nj00 (t)dt 92
http://www.ub.edu/riskcenter
Smoothing with Multiple Regressors Actually n X
[yi − m(xi )]2 + λ
Z
[m00 (t)]2 dt
i=1
is based on some multivariate penalty functional, e.g. 2 Z X 2 Z X ∂ 2 m(t) 2 ∂ m(t) dt [m00 (t)]2 dt = +2 2 ∂ti ∂tj ∂ti i i,j
@freakonometrics
93
http://www.ub.edu/riskcenter
Regression Trees The partitioning is sequential, one covariate at a time (see adaptative neighbor estimation). Start with Q =
n X
[yi − y]2
i=1
For covariate k and threshold t, split the data according to {xi,k ≤ t} (L) or {xi,k > t} (R). Compute P P i,xi,k ≤t yi i,xi,k >t yi and y R = P yL = P i,xi,k ≤t 1 i,xi,k >t 1 and let (k,t)
mi
@freakonometrics
y if x ≤ t i,k L = y if xi,k > t R
94
http://www.ub.edu/riskcenter
Regression Trees Then compute (k ? , t? ) = argmin
( n X
) (k,t) 2
[yi − mi
]
, and partition the space
i=1 ?
intro two subspace, whether xk? ≤ t , or not. Then repeat this procedure, and minimize n X [yi − mi ]2 + λ · #{leaves}, i=1
(cf LASSO). One can also consider random forests with regression trees.
@freakonometrics
95
http://www.ub.edu/riskcenter
Local Regression
1
> W W W W library ( KernSmooth )
6
> library ( sp )
@freakonometrics
99
http://www.ub.edu/riskcenter
Local Regression : Kernel Based Smoothing
1
> library ( np )
2
> fit predict ( fit , newdata = data . frame ( X = x ) )
@freakonometrics
100
http://www.ub.edu/riskcenter
From Linear to Generalized Linear Models The (Gaussian) Linear Model and the logistic regression have been extended to the wide class of the exponential family, yθ − b(θ) f (yθ, φ) = exp + c(y, φ) , a(φ) where a(·), b(·) and c(·) are functions, θ is the natural  canonical  parameter and φ is a nuisance parameter. The Gaussian distribution N (µ, σ 2 ) belongs to this family θ = µ , φ = σ 2 , a(φ) = φ, b(θ) = θ2 /2  {z }  {z }
θ↔E(Y )
@freakonometrics
φ↔Var(Y )
101
http://www.ub.edu/riskcenter
From Linear to Generalized Linear Models The Bernoulli distribution B(p) belongs to this family p , a(φ) = 1, b(θ) = log(1 + exp(θ)), and φ = 1 θ = log 1−p {z }  θ=g? (E(Y ))
where the g? (·) is some link function (here the logistic transformation): the canonical link. Canonical links are 1
binomial ( link = " logit " )
2
gaussian ( link = " identity " )
3
Gamma ( link = " inverse " )
4
inverse . gaussian ( link = " 1 / mu ^2 " )
5
poisson ( link = " log " )
6
quasi ( link = " identity " , variance = " constant " )
7
quasibinomial ( link = " logit " )
8
quasipoisson ( link = " log " )
@freakonometrics
102
http://www.ub.edu/riskcenter
From Linear to Generalized Linear Models Observe that µ = E(Y ) = b0 (θ) and Var(Y ) = b00 (θ) · φ = b00 ([b0 ]−1 (µ)) · φ  {z }
variance function V (µ)
−→ distributions are characterized by this variance function, e.g. V (µ) = 1 for the Gaussian family (homoscedastic models), V (µ) = µ for the Poisson and V (µ) = µ2 for the Gamma distribution, V (µ) = µ3 for the inverseGaussian family. Note that g? (·) = [b0 ]−1 (·) is the canonical link. Tweedie (1984) suggested a powertype variance function V (µ) = µγ · φ. When γ ∈ [1, 2], then Y has a compound Poisson distribution with Gamma jumps. 1
> library ( tweedie )
@freakonometrics
103
http://www.ub.edu/riskcenter
From the Exponential Family to GLM’s So far, there no regression model. Assume that yi θi − b(θi ) f (yi θi , φ) = exp + c(yi , φ) where θi = g?−1 (g(xT i β)) a(φ) so that the loglikelihood is L(θ, φy) =
n Y i=1
Pn f (yi θi , φ) = exp
i=1
Pn
yi θ i − a(φ)
i=1
b(θi )
+
n X
! c(yi , φ) .
i=1
To derive the first order condition, observe that we can write ∂ log L(θ, φy i ) = ω i,j xi,j [yi − µi ] ∂β j for some ω i,j (see e.g. Müller (2004)) which are simple when g? = g.
@freakonometrics
104
http://www.ub.edu/riskcenter
From the Exponential Family to GLM’s The first order conditions can be writen X T W −1 [y − µ] = 0 which are first order conditions for a weighted linear regression model. As for the logistic regression, W depends on unkown β’s : use an iterative algorithm b 0 = y, θ 0 = g(b 1. Set µ µ0 ) and b 0 )g 0 (b z 0 = θ 0 + (y − µ µ0 ). Define W 0 = diag[g 0 (b µ0 )2 Var(b y )] and fit a (weighted) lineare regression of Z 0 on X, i.e. b = [X T W −1 X]−1 X T W −1 z 0 β 1 0 0 b , θ k = g(b b = Xβ 2. Set µ µ ) and k
k
k
b k )g 0 (b z k = θ k + (y − µ µk ). @freakonometrics
105
http://www.ub.edu/riskcenter
From the Exponential Family to GLM’s Define W k = diag[g 0 (b µk )2 Var(b y )] and fit a (weighted) lineare regression of Z k on X, i.e. T −1 −1 b β X T W −1 k+1 = [X W k X] k Zk b b and loop... until changes in β k+1 are (sufficiently) small. Then set β = β ∞ P b→ Under some technical conditions, we can prove that β β and
√
L
b − β) → N (0, I(β)−1 ). n(β
where numerically I(β) = φ · [X T W −1 ∞ X]).
@freakonometrics
106
http://www.ub.edu/riskcenter
From the Exponential Family to GLM’s We estimate (see linear regression estimation) φ by n
2 X 1 [y − µ b ] i i φb = ω i,i n − dim(X) i=1 Var(b µi )
This asymptotic expression can be used to derive confidence intervals, or tests. But is might be a poor approximation when n is small. See use of boostrap in claims reserving. Those are theorerical results: in practice, the algorithm may fail to converge
@freakonometrics
107
http://www.ub.edu/riskcenter
GLM’s outside the Exponential Family? Actually, it is possible to consider more general distributions, see Yee (2014)) 1
> library ( VGAM )
2
> vglm ( y ~ x , family = Makeham )
3
> vglm ( y ~ x , family = Gompertz )
4
> vglm ( y ~ x , family = Erlang )
5
> vglm ( y ~ x , family = Frechet )
6
> vglm ( y ~ x , family = pareto1 ( location =100) )
Those functions can also be used for a multivariate response y
@freakonometrics
108
http://www.ub.edu/riskcenter
GLM: Link and Distribution
@freakonometrics
109
http://www.ub.edu/riskcenter
GLM: Distribution? From a computational point of view, the Poisson regression is not (really) related to the Poisson distribution. Here we solve the first order conditions (or normal equations) X
[Yi − exp(X T i β)]Xi,j = 0 ∀j
i
with unconstraint β, using Fisher’s scoring technique β k+1 = β k − H −1 k ∇k where H k = −
X i
T exp(X T β )X X i i k i
and ∇k =
X
T XT [Y − exp(X i i i β k )]
i
−→ There is no assumption here about Y ∈ N: it is possible to run a Poisson regression on nonintegers.
@freakonometrics
110
http://www.ub.edu/riskcenter
The Exposure and (Annual) Claim Frequency In General Insurance, we should predict blueyearly claims frequency. Let Ni denote the number of claims over one year for contrat i. We did observe only the contract for a period of time Ei Let Yi denote the observed number of claims, over period [0, Ei ].
@freakonometrics
111
http://www.ub.edu/riskcenter
The Exposure and (Annual) Claim Frequency Assuming that claims occurence is driven by a Poisson process of intensity λ, if Ni ∼ P(λ), then Yi ∼ P(λ · Ei ), where N is the annual frequency. L(λ, Y , E) =
n Y e−λEi [λEi ]Yi i=1
Yi !
the first order condition is n n X ∂ 1X log L(λ, Y , E) = − Ei + Yi = 0 ∂λ λ i=1 i=1
for
@freakonometrics
Pn n X Y Yi Ei i b = Pni=1 = ωi where ωi = Pn λ Ei i=1 Ei i=1 Ei i=1
112
http://www.ub.edu/riskcenter
The Exposure and (Annual) Claim Frequency Assume that Yi ∼ P(λi · Ei ) where λi = exp[X T i β]. Here E(Yi X i ) = Var(Yi X i ) = λi = exp[X T i β + log Ei ]. log L(β; Y ) =
n X
Yi · [X T i β + log Ei ] − (exp[X i β] + log Ei ) − log(Yi !)
i=1
1
> model model gbm . step ( data = myocarde , gbm . x = 1:7 , gbm . y = 8 ,
0.6
0.7
1
1.0
1.1
PRONO01, d − 5, lr − 0.01
200
400
600
800
1000
no. of trees
@freakonometrics
123
http://www.ub.edu/riskcenter
Exponential distribution, deviance, loss function, residuals, etc • Gaussian distribution ←→ `2 loss function n X Deviance is (yi − m(xi ))2 , with gradient εbi = yi − m(xi ) i=1
• Laplace distribution ←→ `1 loss function Deviance is
n X
yi − m(xi )), with gradient εbi = sign(yi − m(xi ))
i=1
@freakonometrics
124
http://www.ub.edu/riskcenter
Exponential distribution, deviance, loss function, residuals, etc • Bernoullli {−1, +1} distribution ←→ `adaboost loss function Deviance is
n X
e−yi m(xi ) , with gradient εbi = −yi e−[yi ]m(xi )
i=1
• Bernoullli {0, 1} distribution n X Deviance 2 [yi · log i=1
εbi = yi −
yi m(xi )
(1 − yi ) log
1 − yi 1 − m(xi )
with gradient
exp[m(xi )] 1 + exp[m(xi )]
• Poisson distribution n X Deviance 2 yi · log i=1
@freakonometrics
yi m(xi )
yi − m(xi ) − [yi − m(xi )] with gradient εbi = p m(xi ) 125
http://www.ub.edu/riskcenter
Regularized GLM In Regularized GLMs, we introduced a penalty in the loss function (the deviance), see e.g. `1 regularized logistic regression n k X X T β0 +xi β max yi [β0 + xT ]] − λ βj  i β − log[1 + e i=1
j=1
7
> glm _ ridge plot ( lm _ ridge )
6
4
4
> x y library ( glmnet )
7
−4
1
7
0
L1 Norm
@freakonometrics
126
http://www.ub.edu/riskcenter
Collective vs. Individual Model Consider a Tweedie distribution, with variance function power p ∈ (0, 1), mean µ and scale parameter φ, then it is a compound Poisson model, φµ2−p • N ∼ P(λ) with λ = 2−p p−2 φµ1−p • Yi ∼ G(α, β) with α = − and β = p−1 p−1 Consversely, consider a compound Poisson model N ∼ P(λ) and Yi ∼ G(α, β), α+2 • variance function power is p = α+1 λα • mean is µ = β • scale parameter is φ =
[λα]
α+2 α+1 −1
β α+1
2− α+2 α+1
seems to be equivalent... but it’s not. @freakonometrics
127
http://www.ub.edu/riskcenter
Collective vs. Individual Model In the context of regression Ni ∼ P(λi ) with λi = exp[X T i βλ ] Yj,i ∼ G(µi , φ) with µi = exp[X T i βµ ] Then Si = Y1,i + · · · + YN,i has a Tweedie distribution • variance function power is p =
φ+2 φ+1
• mean is λi µi 1 φ+1 −1
• scale parameter is
λi
φ φ+1
µi
φ 1+φ
There are 1 + 2dim(X) degrees of freedom. @freakonometrics
128
http://www.ub.edu/riskcenter
Collective vs. Individual Model Note that the scale parameter should not depend on i. A Tweedie regression is • variance function power is p =∈ (0, 1) • mean is µi = exp[X T i β Tweedie ] • scale parameter is φ There are 2 + dim(X) degrees of freedom. Note that oone can easily boost a Tweedie model 1
> library ( TDboost )
@freakonometrics
129
http://www.ub.edu/riskcenter
Part 3. Model Choice, Feature Selection, etc.
@freakonometrics
130
http://www.ub.edu/riskcenter
AIC, BIC AIC and BIC are both maximum likelihood estimate driven and penalize useless parameters(to avoid overfitting) AIC = −2 log[likelihood] + 2k and BIC = −2 log[likelihood] + log(n)k AIC focus on overfit, while BIC depends on n so it might also avoid underfit BIC penalize complexity more than AIC does. Minimizing AIC ⇔ minimizing crossvalidation value, Stone (1977). Minimizing BIC ⇔ kfold leaveout crossvalidation, Shao (1997), with k = n[1 − (log n − 1)] −→ used in econometric stepwise procedures
@freakonometrics
131
http://www.ub.edu/riskcenter
CrossValidation Formally, the leaveoneout cross validation is based on n
1X CV = `(yi , m b −i (xi )) n i=1 where m b −i is obtained by fitting the model on the sample where observation i has been dropped, e.g. n 1X 2 CV = [yi , m b −i (xi )] n i=1 The Generalized crossvalidation, for a quadratic loss function, is defined as 2 n b −i (xi ) 1 X yi − m GCV = n i=1 1 − trace(S)/n
@freakonometrics
132
http://www.ub.edu/riskcenter
CrossValidation for kernel based local regression
1
(β0 ,β1 )
2
● ● ●
● ●
●
1
● ●
● ●
●
● ● ●
●
● ● ●
● ●
●
● ● ● ●
● ●
● ●
●
● ● ●
●
@freakonometrics
●
● ● ●
●
● ● ● ● ● ●● ● ●●
●
●
●
● ● ● ●●●● ●
i=1
● ●●
●
●
●
● ● ● ● ● ●
● ● ●
−2
where h? is given by some rule of thumb (see previous discussion).
●
● ●
● ●
●
● ●
●
●
h
● ●
● ●● ● ● ● ●
−1
0
● ●
0
Econometric approach [x] [x] Define m(x) b = βb0 + βb1 x with ( n ) X [x] [x] [x] ω ? [yi − (β0 + β1 xi )]2 (βb , βb ) = argmin
0
2
4
6
8
● ●
10
133
http://www.ub.edu/riskcenter
CrossValidation for kernel based local regression Bootstrap based approach
12
Use bootstrap samples, compute h?b , and get m b b (x)’s. 2
● ● ●
● ●
●
●
● ●
●
●
●
10
● ● ●
● ●
● ●
●
● ● ●
●
●
●
0
●
● ● ●
● ●
●
● ●
● ● ●● ● ● ● ● ● ● ● ● ●
8
●
●
1
● ●
●
● ● ●
●
6
● ●
●
● ●
●
● ● ● ●● ● ●●
●
●
4
● ● ● ●●●● ●
−1
● ●●
●
●
●
−2 0
@freakonometrics
● ● ●
2
4
6
8
● ●
0
● ●
2
● ● ● ●
10
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
134
http://www.ub.edu/riskcenter
CrossValidation for kernel based local regression Statistical learning approach (Cross Validation (leaveoneout)) Given j ∈ {1, · · · , n}, given h, solve [(i),h]
(βb0
[(i),h]
, βb1
) = argmin (β0 ,β1 )
X
(i)
ωh [Yj − (β0 + β1 xj )]2
j6=i
[(i),h] [h] [(i),h] xi . Define + βb1 and compute m b (i) (xi ) = βb0
mse(h) =
n X
[h]
[yi − m b (i) (xi )]2
i=1
and set h? = argmin{mse(h)}. [x] [x] Then compute m(x) b = βb0 + βb1 x with ( n ) X [x] [x] b[x] b ω ? [yi − (β0 + β1 xi )]2 (β , β ) = argmin 0
@freakonometrics
1
(β0 ,β1 )
h
i=1
135
http://www.ub.edu/riskcenter
CrossValidation for kernel based local regression
@freakonometrics
136
http://www.ub.edu/riskcenter
CrossValidation for kernel based local regression Statistical learning approach (Cross Validation (kfold)) Given I ∈ {1, · · · , n}, given h, solve [(I),h]
(βb0
[xi ,h]
, βb1
) = argmin (β0 ,β1 )
X
(I)
ωh [yj − (β0 + β1 xj )]2
j ∈I /
[(i),h] [h] [(i),h] xi , ∀i ∈ I. Define + βb1 and compute m b (I) (xi ) = βb0 XX [h] mse(h) = [yi − m b (I) (xi )]2 I
i∈I
and set h? = argmin{mse(h)}. [x] [x] Then compute m(x) b = βb0 + βb1 x with ( n ) X [x] [x] [x] (βb , βb ) = argmin ω ? [yi − (β0 + β1 xi )]2 0
@freakonometrics
1
(β0 ,β1 )
h
i=1
137
http://www.ub.edu/riskcenter
CrossValidation for kernel based local regression
@freakonometrics
138
http://www.ub.edu/riskcenter
CrossValidation for Ridge & Lasso
3
> x cvfit cvfit $ lambda . min
1.2
> y library ( glmnet )
0.6
1
Binomial Deviance
1.4
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
●●●●●●●●●●● ●●●●●●●●● ●●●●● ●●●● ●●● ●● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●● ●●●●●
−2
0
2
4
6
log(Lambda)
> cvfit cvfit $ lambda . min
12
[1] 0.03315514
13
> plot ( cvfit )
7
7
6
6
6
6
5
5
6
5
4
4
3
3
2
1
4
9
7
3
> plot ( cvfit )
●●● ●●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●
2
8
Binomial Deviance
[1] 0.0408752
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ●●● ● ●●●● ● ●●● ● ●● ●● ●●● ●● ●●● ●● ●●● ●●●●●●●●●●●●●●●●●
1
7
−10
−8
−6
−4
−2
log(Lambda)
@freakonometrics
139
http://www.ub.edu/riskcenter
Variable Importance for Trees 1 X X Nt Given some random forest with M trees, set I(Xk ) = ∆i(t) M m t N where the first sum is over all trees, and the second one is over all nodes where the split is done based on variable Xk . 1
> RF = randomForest ( PRONO ~ . , data = myocarde )
2
> varImpPlot ( RF , main = " " )
3
> importance ( RF ) INSYS
M ea nDe cr eas eGin i
4 5
FRCAR
1.107222
6
INCAR
8.194572
7
INSYS
9.311138
8
PRDIA
2.614261
9
PAPUL
2.341335
10
PVENT
3.313113
11
REPUL
7.078838
●
INCAR
●
REPUL
●
PVENT
●
PRDIA
@freakonometrics
●
PAPUL
●
FRCAR
●
0
2
4
6
8
MeanDecreaseGini
140
http://www.ub.edu/riskcenter
Partial Response Plots One can also compute Partial Response Plots, n
1 Xb E[Y Xk = x, X i,(k) = xi,(k) ] x 7→ n i=1
1
> i mpo rt an ceO rd er names for ( name in names )
4
+ partialPlot ( RF , myocarde , eval ( name ) , col = " red " , main = " " , xlab = name )
@freakonometrics
141
http://www.ub.edu/riskcenter
Feature Selection Use Mallow’s Cp , from Mallow (1974) on all subset of predictors, in a regression n 1 X Cp = 2 [Yi − Ybi ]2 − n + 2p, S i=1
1
> library ( leaps )
2
> y x selec = leaps (x , y , method = " Cp " )
5
> plot ( selec $ size 1 , selec $ Cp )
@freakonometrics
142
http://www.ub.edu/riskcenter
Feature Selection Use random forest algorithm, removing some features at each iterations (the less relevent ones). The algorithm uses shadow attributes (obtained from existing features by shuffling the values). 20
> library ( Boruta )
2
> B plot ( B ) ●
@freakonometrics
INSYS
INCAR
REPUL
PRDIA
PAPUL
5
PVENT
3
143
http://www.ub.edu/riskcenter
Feature Selection Use random forests, and variable importance plots 1
> library ( varSelRFBoot )
2
> X Y library ( randomForest )
5
> rf VB plot ( VB )
@freakonometrics
Number of variables
144
8
7
6
0.00 2
FALSE )
0.05
5
> V library ( randomForest )
2
> fit = randomForest ( PRONO ~ . , data = train _ myocarde )
3
> train _ Y =( train _ myocarde $ PRONO == " Survival " )
4
> test _ Y =( test _ myocarde $ PRONO == " Survival " )
5
> train _ S = predict ( fit , type = " prob " , newdata = train _ myocarde ) [ ,2]
6
> test _ S = predict ( fit , type = " prob " , newdata = test _ myocarde ) [ ,2]
7
> vp = seq (0 ,1 , length =101)
8
> roc _ train = t ( Vectorize ( function ( u ) roc . curve ( train _Y , train _S , s = u ) ) ( vp ) )
9
> roc _ test = t ( Vectorize ( function ( u ) roc . curve ( test _Y , test _S , s = u ) ) ( vp ) )
10
> plot ( roc _ train , type = " b " , col = " blue " , xlim =0:1 , ylim =0:1)
@freakonometrics
146
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC Curves The Area Under the Curve, AUC, can be interpreted as the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, see Swets, Dawes & Monahan (2000) Many other quantities can be computed, see 1
> library ( hmeasures )
2
> HMeasure (Y , S ) $ metrics [ ,1:5]
3
Class labels have been switched from ( Death , Survival ) to (0 ,1) H
4 5
Gini
AUC
AUCH
KS
scores 0.7323154 0.8834154 0.9417077 0.9568966 0.8144499
with the Hmeasure (see hmeasure), Gini and AUC, as well as the area under the convex hull (AUCH).
@freakonometrics
147
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC Curves Consider our previous logistic regression (on heart attacks) 1.0
> logistic S Y library ( ROCR )
2
> pred perf plot ( perf )
0.0
1
@freakonometrics
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
148
http://www.ub.edu/riskcenter
Comparing Classifiers: ROC Curves
ci = TRUE ) 3
> roc . se roc library ( pROC )
20
1
100
On can get econfidence bands (obtained using bootstrap procedures)
4
> plot ( roc . se , type = " shape " , col = " light blue " )
0
5) ) 100
80
60
40
20
0
Specificity (%)
see also for Gains and Lift curves 1
> library ( gains )
@freakonometrics
149
http://www.ub.edu/riskcenter
Comparing Classifiers: Accuracy and Kappa Kappa statistic κ compares an Observed Accuracy with an Expected Accuracy (random chance), see Landis & Koch (1977). b Y = 0 b Y = 1
Y = 0
Y = 1
TN
FN
TN+FN
FP
TP
FP+TP
TN+FP
FN+TP
n
See also Obsersed and Random Confusion Tables b Y = 0 b Y = 1
Y = 0
Y = 1
25
3
28
4
39
43
29
42
71
b Y = 0 b Y = 1
Y = 0
Y = 1
11.44
16.56
28
17.56
25.44
43
29
42
71
TP + TN total accuracy = ∼ 90.14% n [T N + F P ] · [T P + F N ] + [T P + F P ] · [T N + F N ] random accuracy = ∼ 51.93% n2 total accuracy − random accuracy κ= ∼ 79.48% 1 − random accuracy @freakonometrics
150
http://www.ub.edu/riskcenter
Comparing Models on the myocarde Dataset
@freakonometrics
151
http://www.ub.edu/riskcenter
Comparing Models on the myocarde Dataset
rf
●
gbm
●
● ●●
boost
●●
nn
●
bag tree
● ●
svm
●●
knn
●●
aic
●
glm
●● ●
@freakonometrics
1.0
●
0.0
●
−0.5
loess
● ●
0.5
If we average over all training samples
152
http://www.ub.edu/riskcenter
Gini and Lorenz Type Curves
> L L L plot ( Poly _ Sicily , col = " yellow " , add = TRUE )
Abruzzo
@freakonometrics
Apulia
Basilicata
Calabria
164
http://www.ub.edu/riskcenter
> plot ( ita1 , col = " light green " )
2
> plot ( Poly _ Sicily , col = " yellow " , add = TRUE )
3
> abline ( v =5:20 , col = " light blue " )
4
> abline ( h =35:50 , col = " light blue " )
5
> axis (1)
38
6
> axis (2)
40
42
44
1
36
46
Maps and Polygons
5
@freakonometrics
10
15
20
165
http://www.ub.edu/riskcenter
Maps and Polygons 1
> pos ita _ south ita _ north plot ( ita1 )
5
> plot ( ita _ south , col = " red " , add = TRUE )
6
> plot ( ita _ north , col = " blue " , add = TRUE )
@freakonometrics
166
http://www.ub.edu/riskcenter
Maps and Polygons 1
> library ( xlsx )
2
> data _ codes names ( data _ codes ) [1]= " NAME _ 1 "
4
> ita2 pos library ( rgeos )
2
> ita _ s ita _ n plot ( ita1 )
5
> plot ( ita _s , col = " red " , add = TRUE )
6
> plot ( ita _n , col = " blue " , add = TRUE )
@freakonometrics
168
http://www.ub.edu/riskcenter
Maps and Polygons On polygons, it is also possible to visualize centroids, 1
> plot ( ita1 , col = " light green " )
2
> plot ( Poly _ Sicily , col = " yellow " , add = TRUE )
3
> gCentroid ( Poly _ Sicily , byid = TRUE )
4
SpatialPoints : x
5
y
6
14 14.14668 37.58842
7
Coordinate Reference System ( CRS ) arguments :
8
+ proj = longlat + ellps = WGS84
9
+ datum = WGS84 + no _ defs + towgs84 =0 ,0 ,0
10
●
> points ( gCentroid ( Poly _ Sicily , byid = TRUE ) , pch =19 , col = " red " )
@freakonometrics
169
http://www.ub.edu/riskcenter
Maps and Polygons or 17 19 13
1
> G plot ( ita1 , col = " light green " )
3
> text ( G $x , G $y ,1:20)
14
2 3 4
@freakonometrics
15
170
http://www.ub.edu/riskcenter
Maps and Polygons Consider two trajectories, caracterized by a series of knots (from the centroïds list) 1
> c1 c2 plot ( cross _ road , pch =19 , cex =1.5 , add = TRUE )
@freakonometrics
y
172
http://www.ub.edu/riskcenter
Maps and Polygons To add elements on maps, consider, e.g. 1
plot ( ita1 , col = " light green " )
2
grat point _ in _ i = function (i , point )
6
+ point . in . polygon ( point [1] , point [2] , poly _ paris [ poly _ paris $ PID == i , " X " ] , poly _ paris [ poly _ paris $ PID == i , " Y " ])
7
> where _ is _ point = function ( point )
8
+ which ( Vectorize ( function ( i ) point _ in _ i (i , point ) ) (1: length ( paris ) ) >0)
@freakonometrics
178
http://www.ub.edu/riskcenter
Maps, with R 1
> where _ is _ point ( point )
2
[1] 100
1
> library ( RColorBrewer )
2
> plotclr vizualize _ point is . box = function (i , loc ) {
2
+ box _ i = minmax ( i )
3
+ x = box _ i [1]) & ( loc [1] which . box which . box ( point )
4
[1]
● ●
48.862
3
48.864
(1: length ( paris ) ) ) } 1 100 101
●
●
48.860
● ●
2
●
● ● ●●● ● ●● ●
> polygon ( poly _ paris [ poly _ paris $ PID ==1 , c ( " X "
48.858
> plot ( sub _ poly [ , c ( " X " ," Y " ) ] , col = " white " )
● ●● ● ●
●●
48.856
1
●
Y
see
● ●
●
●
," Y " ) ] , col = plotclr [2]) 3
> polygon ( poly _ paris [ poly _ paris $ PID ==100 , c ( "
48.854
●
●
2.320
2.325
2.330
2.335
2.340
2.345
X
X " ," Y " ) ] , col = plotclr [1]) 4
> polygon ( poly _ paris [ poly _ paris $ PID ==101 , c ( " X " ," Y " ) ] , col = plotclr [2])
@freakonometrics
182
http://www.ub.edu/riskcenter
Maps, with R Finally use 1
> which . poly 0) idx _ valid = c ( idx _ valid , i ) }
9
+ return ( idx _ valid ) }
to identify the IRIS polygon 1
> which . poly ( point )
2
[1] 100
@freakonometrics
183
http://www.ub.edu/riskcenter
Maps, with R > vizualize _ point
( MAIF
8
Information from URL : http : / / maps . googleapis . com / maps / api / geocode /
( Rennes distHaversine ( MAIF , Rennes , r =6378.137)
2
[1] 217.9371
@freakonometrics
185
http://www.ub.edu/riskcenter
(in km.) while the driving distance is 1
> mapdist ( as . numeric ( MAIF ) , as . numeric ( Rennes ) , mode = ’ driving ’)
2
by using this function you are agreeing to the terms at :
3
http : / / code . google . com / apis / maps / documentation / distancematrix /
4
from
5
to 6
m
km
miles seconds
1 200 Avenue Salvador Allende , 79000 Niort , France 10 11 Place Hoche , 35000 Rennes , France 257002 257.002 159.701 minutes
7 8
1
9591
hours
159.85 2.664167
@freakonometrics
186
http://www.ub.edu/riskcenter
Visualizing Maps, via Google Maps 48.0
Loc = data . frame ( rbind ( MAIF , Rennes ) )
1
47.5
> library ( ggmap )
3
> library ( RgoogleMaps )
4
> CenterOfMap W WMap WMap
48.0
47.5
lat
or 1
> W ggmap ( W )
−2
−1
0
lon
@freakonometrics
187
http://www.ub.edu/riskcenter
1.5
Visualizing Maps, via Google Maps from a technical point of view, those are ggplot2 maps, see ggmapCheatsheet 1
lat
1.4
1.3
> W ggmap ( W ) 1.5
or 1.4
> W ggmap ( W )
1.2
103.7
103.8
103.9
lon
@freakonometrics
188
http://www.ub.edu/riskcenter
Visualizing Maps, via Google Maps 1
> Paris ParisMap library ( maptools )
48.87
lat
2
4
> library ( rgdal )
5
> paris = re adS hap eSp at ial ( " paris  cartelec . shp
48.84
48.81
") 2.25
2.30
2.35
2.40
2.45
2.375
2.400
lon
6
> proj4string ( paris ) paris ParisMapPlus ParisMapPlus
@freakonometrics
2.300
2.325
2.350
lon
189
http://www.ub.edu/riskcenter
OpenStreet Map 1
> library ( OpenStreetMap )
2
> map map plot ( map )
@freakonometrics
190
http://www.ub.edu/riskcenter
OpenStreet Map 1
> library ( OpenStreetMap )
2
> map plot ( map )
It is a standard R plot, we can add points on that graph.
@freakonometrics
191
http://www.ub.edu/riskcenter
OpenStreet Map 1
> library ( maptools ) ●
2
> deaths head ( deaths@coords )
●
● ●
● ● ● ● ● ●●
● ●
●
● ●
5 6
0 1
529308.7 529312.2
● ●
8 9 10
2
181025.2
3 4
529314.4
181020.3
529317.4
181014.3
529320.7
181007.9
● ●● ● ●
● ●
●
● ●
●● ● ●
●
7
●
●
●
● ● ● ● ●
●
● ● ●
● ● ● ●
● ●
●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●● ● ● ●● ●●
● ● ●●
●
● ● ●● ● ● ● ● ●
●● ● ●● ●
●
● ● ● ● ●●●
●
●
●●
●
● ●
●
●
● ● ●● ●●
● ● ● ●●
● ● ● ●
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
●
●
● ●●
● ●●
● ● ●●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ●
●
● ●
● ● ● ● ● ● ●
●●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●
●
●
●
● ●● ● ●●
● ●
●
● ●
● ●
● ● ● ●
●
● ●● ●● ● ●●
●
181031.4
●
●
●
coords . x1 coords . x2
4
●
●
●
●
● ●
●
> points ( deaths@coords , col = " red " , pch =19 , cex =.7 )
Cholera outbreak, in London, 1854, dataset collected by John (not Jon) Snow
@freakonometrics
192
●
http://www.ub.edu/riskcenter
OpenStreet Map ●
1
> X library ( KernSmooth )
●
● ● ●
3
● ●
> kde2d library ( grDevices )
●
●
> clrs image ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $ fhat ,
● ●
●● ● ●
●
5
● ●● ● ●
●
● ● ● ● ●
●
● ● ●
● ● ●
●
● ●● ● ● ●●
●●
● ● ●●
●
●
● ● ●●
●
● ● ● ● ●●●
●
●●
● ●
● ● ● ●
●
● ● ● ● ● ● ● ●
●
●
●
● ●
● ● ● ● ● ● ●
●●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●
●
● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ●● ●●
●
●
●
● ●●
● ● ●●
● ●
●
●
●●
●
● ●● ● ● ● ● ● ●● ● ●●
●
● ● ● ● ●
●
●
●
● ● ●
● ●
●●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
●
● ●● ● ●●
● ●
● ● ● ●
● ●● ●● ● ●●
●
●
● ● ● ●
●
●
[ ,1]) , bw . ucv ( X [ ,2]) ) )
●
●
●
● ●
4
●
●
●
● ●
●
add = TRUE , col = clrs )
@freakonometrics
193
●
http://www.ub.edu/riskcenter
OpenStreet Map ●
●
●
● ● ●
● ● ●
● ●
we can add a lot of things on that map ●
> contour ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $
●
●
● ●
●● ●
●
● ● ●
8e−06 ● ● ●
● ● ● ● ●
●
● ● ● ● ● ● ● 05 ● ●1.4e− ● ● ● ● ●
●
● ●
●●
●
● ●● ● ● ●●
●●
● ● ●●
●
●
●
● ● ● ● ●●●
●
6e−06
● ● ●●
●●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ● ●● ●●
●
●
●
● ●●
● ● ●●
● ●
●
● ● ●
●
● ●
● ● ● ● ● ● ●
●●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●
●
●●
●
● ● ● ● ●
● ●
●
●
● ●● ● ● ● ● ● ●● ● ●●
●
● ●● ● ●●
● ●
●
1e−05 ●
●
● ● ●
● ●
●
● ●
1.2e−05
●
●
fhat , add = TRUE )
● ●● ● ●
1e−05● ● ●● ●● ● ●●
●
4e−06
● ● ● ●
●
● ●
●
1
● ● ●●
●
● ●
●
2e−06
●
●
●
●
●
● ●
●
@freakonometrics
194
●
http://www.ub.edu/riskcenter
OpenStreet Map There are alternative packages related to OpenStreetMap. See for instance 1
> library ( leafletR )
2
> devtools :: install _ github ( " rstudio / leaflet " )
(see http://rstudio.github.io/leaflet/ for more information). 1
> library ( osmar )
2
> src loc . london bb ua bg _ ids bg _ ids bg bg _ poly = as _ sp ( bg , " polygons " )
5
> plot ( bg _ poly , col = gray . colors (12) [11] , border = " gray " )
@freakonometrics
196
http://www.ub.edu/riskcenter
OpenStreet Map We can visualize and leisure areas 1
> nat _ ids = find ( ua , way ( tags ( k == " waterway " ) ) )
2
> nat _ ids = find _ down ( ua , way ( nat _ ids ) )
3
> nat = subset ( ua , ids = nat _ ids )
4
> nat _ poly = as _ sp ( nat , " polygons " )
5
> nat _ ids = find ( ua , way ( tags ( k == " leisure " ) ) )
6
> nat _ ids = find _ down ( ua , way ( nat _ ids ) )
7
> nat = subset ( ua , ids = nat _ ids )
8
> nat _ poly = as _ sp ( nat , " polygons " )
9
> plot ( nat _ poly , col = " #99 dd99 " , add = TRUE , border = " #99 dd99 " )
(here we consider waterway and leisure tags)
@freakonometrics
197
http://www.ub.edu/riskcenter
OpenStreet Map and finally, add the deaths 1
● ●● ● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ●●● ● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●
> points ( df _ deathscoords,col="red",pch=19)
and some heat map 1
> X library ( KernSmooth )
3
> kde2d image ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $ fhat * 1000 , add = TRUE , col = clrs )
5
● ●● ● ●● ●● 300●00● ●20000 ● ● ● ● 0● ● ● ●● ● ● ●● ●● ● ●● ● 60●00 ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● 40000 ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ●● ● ● ● 8000 ● ● ● ● ● ●●●● 0 ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● 70● ● ● ● ●● ● 05 ● ● ● ● ● 1e+ ● ●●● ● ● 00●0 ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ●● ●● ● ● ● ● 90 ● ●● ● 00 ● 0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ●● 50000 ● ● ●●● ● ● ● ● ● ● ● ● ● 0 1000
> contour ( x = kde2d $ x1 , y = kde2d $ x2 , z = kde2d $ fhat , add = TRUE )
@freakonometrics
198
http://www.ub.edu/riskcenter
The analogous Google Map plot 51.515
1
> library ( ggmap )
2
> get _ london london london
Let us add points 1
> df _ deaths library ( sp )
2
> library ( rgdal )
3
51.515
> coordinates ( df _ deaths ) = ~ coords . x1 + coords .
●
●
● ●●● ● ●● ●
lat
●
> df _ deaths = spTransform ( df _ deaths , CRS ( " + proj = longlat + datum = WGS84 " ) )
6
51.513
●
51.512
● ● ● ● ●
●● ●
● ● ● ●●
●● ●● ● ●●
● ● ●
●
● ● ●●
●
●
●
● ● ●
● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●●
● ● ● ●
● ● ● ●
● ●● ●● ● ●●
●● ● ●
● ● ● ●
●●
● ● ●
●
●
5
● ● ● ● ●●
●●
● ● ●
●
●
●
●
51.514
> proj4string ( df _ deaths ) = CRS ( " + init = epsg :27700 " )
●
●
● ●
x2 4
● ●
● ● ●● ●● ●●
> london + geom _ point ( aes ( x = coords . x1 , y = coords . x2 ) , data = data . frame ( df _ deaths
51.511 −0.140
coords),col="red")
−0.138
−0.136
−0.134
lon
and we can add some heat map, too,
@freakonometrics
200
http://www.ub.edu/riskcenter
1
> london + geom _ point ( aes ( x = coords . x1 , y = coords . x2 ) ,
2
data = data . frame ( df _ deaths@coords ) , col = " red " ) +
3
51.515
geom _ density2d ( data = data . frame ( df _
● ●
●
●
51.514
aes ( x = coords . x1 , y = coords . x2 ) , size = 0.3)
●
●
● ●●● ●
● ● ●
●● ●
51.513
●
●
deaths@coords ) , 51.512
6
●
●
lat
stat _ density2d ( data = data . frame ( df _
● ● ●
●
aes ( x = coords . x1 , y = coords . x2 , fill = ..
● ● ● ● ●
●● ●
● ● ● ●●
●● ●● ● ●●
● ● ●
●
● ● ●●
●
●
●
● ● ●
● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●●
● ● ● ●
● ● ● ●
● ●● ●● ● ●●
●● ● ●
● ● ● ●
●●
●●
+ 5
● ● ● ● ●●
● ●
deaths@coords ) , 4
●
●
●
● ● ●● ●● ●●
level .. , alpha = .. level ..) , size = 0.01 , 51.511
bins = 16 , geom = " polygon " ) + scale _
−0.140
−0.138
−0.136
−0.134
lon
fill _ gradient ( low = " green " , high = " red " , guide = FALSE ) + 7
scale _ alpha ( range = c (0 , 0.3) , guide = FALSE )
@freakonometrics
201
http://www.ub.edu/riskcenter
More Interactive Maps As discussed previously, one can use RStudio and library(leaflty) see rpubs.com/freakonometrics/ 1
> devtools :: install _ github ( " rstudio / leaflet " )
2
> require ( leaflet )
3
> setwd ( " / cholera / " )
4
> deaths df _ deaths coordinates ( df _ deaths ) = ~ coords . x1 + coords . x2
7
> proj4string ( df _ deaths ) = CRS ( " + init = epsg :27700 " )
8
> df _ deaths = spTransform ( df _ deaths , CRS ( " + proj = longlat + datum = WGS84 " ) )
9
> df = data . frame ( df _ deaths@coords )
10
> lng = df $ coords . x1
11
> lat = df $ coords . x2
1
> m = leaflet () % >% addTiles ()
2
> m % >% fitBounds ( .141 ,
@freakonometrics
51.511 , .133 , 51.516)
202
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
203
http://www.ub.edu/riskcenter
More Interactive Maps One can add points 1
rd =.5
2
op =.8
3
clr = " blue "
4
m = leaflet () % >% addTiles ()
5
m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr )
@freakonometrics
204
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
205
http://www.ub.edu/riskcenter
More Interactive Maps We can also add some heatmap. 1
> X = cbind ( lng , lat )
2
> kde2d x = kde2d $ x1
2
> y = kde2d $ x2
3
> z = kde2d $ fhat
4
> CL = contourLines ( x , y , z )
We have now a list that contains lists of polygons corresponding to isodensity curves. To visualise of of then, use 1
> m = leaflet () % >% addTiles ()
2
> m % >% addPolygons ( CL [[5]] $x , CL [[5]] $y , fillColor = " red " , stroke = FALSE )
@freakonometrics
206
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
207
http://www.ub.edu/riskcenter
More Interactive Maps We can get at the same time the points and the polygon 1
> m = leaflet () % >% addTiles ()
2
> m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr ) % >%
3
addPolygons ( CL [[5]] $x , CL [[5]] $y , fillColor = " red " , stroke = FALSE )
1
> m = leaflet () % >% addTiles ()
2
> m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr ) % >%
3
addPolygons ( CL [[1]] $x , CL [[1]] $y , fillColor = " red " , stroke = FALSE ) % >%
4
addPolygons ( CL [[3]] $x , CL [[3]] $y , fillColor = " red " , stroke = FALSE ) % >%
5
addPolygons ( CL [[5]] $x , CL [[5]] $y , fillColor = " red " , stroke = FALSE ) % >%
6
addPolygons ( CL [[7]] $x , CL [[7]] $y , fillColor = " red " , stroke = FALSE ) % >%
7
addPolygons ( CL [[9]] $x , CL [[9]] $y , fillColor = " red " , stroke = FALSE )
@freakonometrics
208
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
209
http://www.ub.edu/riskcenter
More Interactive Maps 1
> m = leaflet () % >% addTiles ()
2
> m % >% addCircles ( lng , lat , radius = rd , opacity = op , col = clr ) % >%
3
addPolylines ( CL [[1]] $x , CL [[1]] $y , color = " red " ) % >%
4
addPolylines ( CL [[5]] $x , CL [[5]] $y , color = " red " ) % >%
5
addPolylines ( CL [[8]] $x , CL [[8]] $y , color = " red " )
@freakonometrics
210
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
211
http://www.ub.edu/riskcenter
More Interactive Maps Another package can be considered 1
> require ( rleafmap )
2
> library ( sp )
3
> library ( rgdal )
4
> library ( maptools )
5
> library ( KernSmooth )
6
> setwd ( " / home / arthur / Documents / " )
7
> deaths df _ deaths coordinates ( df _ deaths ) = ~ coords . x1 + coords . x2
10
> proj4string ( df _ deaths ) = CRS ( " + init = epsg :27700 " )
11
> df _ deaths = spTransform ( df _ deaths , CRS ( " + proj = longlat + datum = WGS84 " ) )
12
> df = data . frame ( df _ deaths@coords )
13
> stamen _ bm j _ snow writeMap ( stamen _ bm , j _ snow , width = 1000 , height = 750 , setView = c ( mean ( df [ ,1]) , mean ( df [ ,2]) ) , setZoom = 14)
3
> writeMap ( stamen _ bm , j _ snow , width = 1000 , height = 750 , setView = c ( mean ( df [ ,1]) , mean ( df [ ,2]) ) , setZoom = 16)
@freakonometrics
213
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
214
http://www.ub.edu/riskcenter
More Interactive Maps 1
> library ( spatstat )
2
> library ( maptools )
3
> win df _ deaths _ ppp df _ deaths _ ppp _ d df _ deaths _ d df _ deaths _ d $ v [ df _ deaths _ d $ v < 10^3] stamen _ bm mapquest _ bm j _ snow df _ deaths _ den my _ ui writeMap ( stamen _ bm , mapquest _ bm , j _ snow , df _ deaths _ den , width = 1000 , height = 750 , interface = my _ ui , setView = c ( mean ( df [ ,1]) , mean ( df [ ,2]) ) , setZoom = 16)
@freakonometrics
216
http://www.ub.edu/riskcenter
More Interactive Maps
@freakonometrics
217
http://www.ub.edu/riskcenter
Visualizing a Spatial Process Consider car/ bike accident in Paris, see data.gouv.fr for bodily injury car accident in France (20062011, BAAC1 dataset) or opendata.paris.fr for accident in Paris, only. 1
> caraccident geo _ loc mat _ geo _ loc save ( mat _ geo _ loc , file = " mat _ geo _ loc . RData " )
@freakonometrics
218
http://www.ub.edu/riskcenter
Visualizing a Spatial Process Keep only accidents located in Paris 5
> mat _ loc x idx 2.25) & ( x [ ,1] 48.83) & ( x [ ,2] x library ( ks )
6
> fhat image ( fhat $ eval . points [[1]] , fhat $ eval . points [[2]] , fhat $ estimate , col = rev ( heat . colors (100) ) , xlab = " " , ylab = " " , xlim = c (2.25 ,2.4) , ylim = c (48.83 ,48.9) , axes = FALSE )
8
> plot ( paris , add = TRUE , border = " grey " )
9
> points (x , pch =19 , cex =.3)
@freakonometrics
219
http://www.ub.edu/riskcenter
Visualizing a Spatial Process
●
●
●
●
● ●
● ●
● ●
●
● ●●
● ●
●● ●
●
●
●
● ●
●● ● ● ●
●
● ●
●● ●
●
● ● ●● ●
●
●
●●
●
●
● ● ● ● ●
●
●
● ● ●
● ●
●
● ●
● ● ●
● ●
●
●
● ●
● ●
●
●
●
●● ●
●
●
●● ● ● ● ●
● ● ●
● ●
●
●
●
● ●
● ●
●
● ●
●●
●
●
● ● ● ●
●
●
●
●
●
● ● ●●
●
●● ● ●
● ●● ●
● ●● ●●
●
●
●●
●● ● ●
●
●
●
●●●
●
● ●
●
● ● ●
●
●
●
● ● ● ●
● ●
●
● ●
●
● ● ● ●
●●
● ● ●
● ●
● ● ● ●
● ●
●
● ● ●●
● ●●
●
●
● ●● ● ●
● ● ●●
●
● ●
● ●●
●
● ●● ● ● ● ● ●
●
● ● ● ●
● ● ●
●
● ●
● ● ● ●●
●●
●
●
●
●
● ●
●
●
● ● ● ● ● ●● ● ● ● ● ●
●
● ● ● ●
● ●
● ● ●●
●
●
● ●
●
●
●
●●
● ● ● ● ● ● ●
●
●
●
● ● ●
●
● ●
●
●
● ● ●
●
●
●
●
● ●
●
● ●
●
●●
●
●●
●
●
●● ● ●
●
●
●
● ● ●
●
●
● ● ●
●
●●
● ● ● ●
●
● ● ●
● ● ●
● ●● ● ● ●
●
● ● ●
●
● ●
●●
● ● ●
●
● ●
●● ●
●
●
●● ●
● ● ●● ● ●
● ● ● ●
●● ● ● ●
● ● ● ●●● ●●
●
● ●
● ●
●
●
● ● ●
● ●
●
● ●
● ● ●
●
●
●
● ●●●
●● ●
●
●●
●
● ●
●
●
● ●
●
● ● ●●● ● ●
●
●
●
● ● ●
●
● ●
● ● ●
●
● ● ● ● ● ●● ●● ● ● ● ● ● ●
●
● ● ●● ● ●
●● ● ● ● ● ●
●● ● ●● ●
●
●
● ●
● ● ●
● ●
●
● ●
●
●
●●
●
●
● ●● ●
●
● ●
● ●
●● ● ● ●● ●●
●
● ●
●
●
● ●
●
● ● ●
● ● ●
●
● ●● ●●
● ●
● ●
●
●
● ●
● ●
●
●
●
● ● ●
● ●
●
● ●
●
●
●● ● ●
●
●
●
●
●
●
●
●
● ● ● ●
● ● ●
●
● ●
●
●
● ● ●
●● ● ● ●●
●
● ● ●●●
●
●
●● ●
●
●
● ●
●
● ● ●
●
●
●
●
● ●
●
● ●●
●
●
●
● ●
●
●
●
●● ●
●
●
●
● ● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●
● ●
●
● ●
● ●●
● ●
●
●
● ●
●
●
● ● ●
@freakonometrics
●
● ●
●
●
●
●
●
● ●● ● ● ●
●
● ●
●
220
http://www.ub.edu/riskcenter
Visualizing Hurricane Paths National Hurricane Center (NHC) collects datasets with all storms in North Atlantic, the North Atlantic Hurricane Database (HURDAT weather.unisys.com) ) For all sorms we have the location of the storm, every six jours (at midnight, six a.m., noon and six p.m.), the maximal wind speed (on a 6 hour window) and the pressure in the eye of the storm. E.g. for 2012, http://weather.unisys.com/hurricane/atlantic/2012/index.php
@freakonometrics
221
http://www.ub.edu/riskcenter
http://weather.unisys.com/hurricane/atlantic/2012/SANDY/track.dat Date: 2131 OCT 2012 Hurricane3 SANDY ADV LAT LON TIME 1 14.30 77.40 10/21/18Z 2 13.90 77.80 10/22/00Z 3 13.50 78.20 10/22/06Z 4 13.10 78.60 10/22/12Z 5 12.70 78.70 10/22/18Z 6 12.60 78.40 10/23/00Z 7 12.90 78.10 10/23/06Z 8 13.40 77.90 10/23/12Z 9 14.00 77.60 10/23/18Z 10 14.70 77.30 10/24/00Z 11 15.60 77.10 10/24/06Z 12 16.60 76.90 10/24/12Z @freakonometrics
WIND 25 25 25 30 35 40 40 40 45 55 60 65
PR 1006 1005 1003 1002 1000 998 998 995 993 990 987 981
STAT LOW LOW LOW TROPICAL DEPRESSION TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM TROPICAL STORM HURRICANE1 222
http://www.ub.edu/riskcenter
Data Scraping: Hurricane Dataset 1. get the name of all hurricane for a given year 1
> library ( XML )
2
> year loc tabs tabs [1]
6
$ ‘ NULL ‘
7
#
Name Tropical Storm ALBERTO
8
1
1
9
2
2
10
3
3
Hurricane 1 CHRIS
11
4
4
12
5
5
13 14
Date Wind Pres Cat 19 23 MAY
50
995

Tropical Storm BERYL 25 MAY  2 JUN
60
992

17 24 JUN
75
974
1
Tropical Storm DEBBY
23 27 JUN
55
990

Hurricane 2 ERNESTO
1 10 AUG
85
973
2
6
6 Tropical Storm FLORENCE
3  8 AUG
50 1002

7
7
9 19 AUG
40 1004

Tropical Storm HELENE
@freakonometrics
223
http://www.ub.edu/riskcenter
Data Scraping: Hurricane Dataset We split the Name variable to extract the name 15
> storms storms
17
[1] " Tropical "
" Storm "
" ALBERTO "
18
[6] " BERYL "
" Hurricane 1 " " CHRIS "
" Tropical "
" Storm "
" Tropical "
" Storm "
But we keep only relevant information 19
> index nstorms
21
> nstorms
for ( i in length ( nstorms ) :1) {
26
if (( nstorms [ i ]== " SIXTEE " ) & ( year ==2008) ) nstorms [ i ] gridx gridy = gridy [ j ]) & ( TOTTRACK $ LAT < gridy [ j +1]) )
Then we look for all possible next move (i.e. 6 hours later), if any 1
> for ( s in 1: length ( idx ) ) {
2
> locx locy rm ( list = ls () )
2
> year =2014
3
> loc = paste ( " http : / / donnees . roulez  eco . fr / opendata / annee / " , year , sep = " ")
4
> download . file ( loc , destfile = " oil . zip " )
5 6
Content type ’ application / zip ’ length 15248088 bytes (14.5 MB )
7 8
> unzip ( " oil . zip " , exdir = " . / " )
9
> fichier = paste ( " PrixCarburan ts _ annuel _ " , year ,
10
" . xml " , sep = " " )
11
> library ( plyr )
12
> library ( XML )
13
> library ( lubridate )
14
> l = xmlToList ( fichier )
15
> length ( l )
16
[1] 11064
@freakonometrics
234
http://www.ub.edu/riskcenter
Gas Price, in France To extract information for gas station no=2 and Gasole use 1
>
prix = list ()
2
>
date = list ()
3
>
nom = list ()
4
>
j =0; no =2
5
>
for ( i in 1: length ( l [[ no ]]) ) {
6
+
v = names ( l [[ no ]])
7
+
if ( ! is . null ( v [ i ]) ) {
8
+
if ( v [ i ]== " prix " ) {
9
+
10
+
date [[ j ]]= as . character ( l [[ no ]][[ i ]][ " maj " ])
11
+
prix [[ j ]]= as . character ( l [[ no ]][[ i ]][ " valeur " ])
12
+
nom [[ j ]]= as . character ( l [[ no ]][[ i ]][ " nom " ])
13
+
14
>
j = j +1
}}
}
id = which ( unlist ( nom ) == type _ gas )
@freakonometrics
235
http://www.ub.edu/riskcenter
Gas Price, in France 1
>
ext _ y = function ( j ) substr ( date [[ id [ j ]]] ,1 ,4)
2
>
ext _ m = function ( j ) substr ( date [[ id [ j ]]] ,6 ,7)
3
>
ext _ d = function ( j ) substr ( date [[ id [ j ]]] ,9 ,10)
4
>
ext _ h = function ( j ) substr ( date [[ id [ j ]]] ,12 ,13)
5
>
ext _ mn = function ( j ) substr ( date [[ id [ j ]]] ,15 ,16)
6
>
prix _ essence = function ( i ) as . numeric ( prix [[ id [ i ]]]) / 1000
7
>
Y = unlist ( lapply (1: n , ext _ y ) )
8
>
M = unlist ( lapply (1: n , ext _ m ) )
9
>
D = unlist ( lapply (1: n , ext _ d ) )
10
>
H = unlist ( lapply (1: n , ext _ h ) )
11
>
MN = unlist ( lapply (1: n , ext _ mn ) )
12
>
date = paste ( base1 $Y , " " , base1 $M , " " , base1 $D ,
13
+
" " , base1 $H , " : " , base1 $ MN , " :00 " , sep = " " )
14
>
date _ base = as . POSIXct ( date , format =
15
+
@freakonometrics
" %Y % m % d % H :% M :% S " , tz = " UTC " )
236
http://www.ub.edu/riskcenter
Gas Price, in France 1
> d = paste ( year , " 01 01 12:00:00 " , sep = " " )
2
> f = paste ( year , " 12 31 12:00:00 " , sep = " " )
3
> vecteur _ date = seq ( as . POSIXct (d , format =
4
+ " %Y % m % d % H :% M :% S " ) ,
5
+ as . POSIXct (f , format =
6
+ " %Y % m % d % H :% M :% S " ) , by = " days " )
7
> vect _ idx = Vectorize ( function ( t ) sum ( vecteur _ date [ t ] >= date _ base ) ) (1: length ( vecteur _ date ) )
8
> prix _ essence = function ( i ) as . numeric ( prix [[ id [ i ]]]) / 1000
9 10
> P = c ( NA , unlist ( lapply (1: n , prix _ essence ) ) ) > Z = ts ( P [1+ vect _ idx ] , start = year , frequency =365)
@freakonometrics
237
http://www.ub.edu/riskcenter
Gas Price, in France 1
> dt = as . Date ( " 2014 05 05 " )
2
>
base = NULL
3
>
for ( no in 1: length ( l ) ) {
4
+
prix = list ()
5
+
date = list ()
6
+
j =0
7
+
for ( i in 1: length ( l [[ no ]]) ) {
8
+
v = names ( l [[ no ]])
9
+
if ( ! is . null ( v [ i ]) ) {
10
+
if ( v [ i ]== " prix " ) {
11
+
j = j +1
12
+
date [[ j ]]= as . character ( l [[ no ]][[ i ]][ " maj " ])
13
+
14
+
n=j
15
+
D = as . Date ( substr ( unlist ( date ) ,1 ,10) ," %Y % m % d " )
16
+
k = which ( D == D [ which . max ( D [D 0) {
2
+
3
+ if ( " nom " % in %
4
+
k = which ( B [ " nom " ,]== " Gazole " )
5
+
prix = as . numeric ( B [ " valeur " ,k ]) / 1000
6
+
if ( length ( prix ) ==0) prix = NA
7
+
base1 = data . frame ( indice = no ,
8
+
lat = as . numeric ( l [[ no ]] $ . attrs [ " latitude " ]) / 100000 ,
9
+
lon = as . numeric ( l [[ no ]] $ . attrs [ " longitude " ]) / 100000 ,
10
+
gaz = prix )
11
+
base = rbind ( base , base1 )
12
+ }}
B = Vectorize ( function ( i ) l [[ no ]][[ k [ i ]]]) (1: length ( k ) )
@freakonometrics
rownames ( B ) ) {
239
http://www.ub.edu/riskcenter
Gas Price, in France 1
> idx = which (( base $ lon >( 10) ) & ( base $ lon 35) & ( base $ lat B = base [ idx ,]
4
> Q = quantile ( B $ gaz , seq (0 ,1 , by =.01) , na . rm = TRUE )
5
> Q [1]=0
6
> x = as . numeric ( cut ( B $ gaz , breaks = unique ( Q ) ) )
7
> CL = c ( rgb (0 ,0 ,1 , seq (1 ,0 , by = .025) ) ,
8
+ rgb (1 ,0 ,0 , seq (0 ,1 , by =.025) ) )
9
> plot ( B $ lon , B $ lat , pch =19 , col = CL [ x ])
10
> library ( maps )
11
> map ( " france " )
12
> points ( B $ lon , B $ lat , pch =19 , col = CL [ x ])
@freakonometrics
240
http://www.ub.edu/riskcenter
Gas Price, in France 1
> library ( OpenStreetMap )
2
> map map plot ( map )
6
> points ( B $ lon , B $ lat , pch =19 , col = CL [ x ])
1
> library ( tripack )
2
> V plot (V , add = TRUE )
@freakonometrics
c ( lat = 47 ,
241
http://www.ub.edu/riskcenter
Gas Price, in France 1
> plot ( map )
2
> P library ( sp )
4
> point _ in _ i = function (i , point ) point . in . polygon ( point [1] , point [2] , P [[ i ]][ ,1] , P [[ i ]][ ,2])
5
> which _ point = function ( i ) which ( Vectorize ( function ( j ) point _ in _ i (i , c ( dB $ lon [ id [ j ]] , dB $ lat [ id [ j ]]) ) ) (1: length ( id ) ) >0)
6
> for ( i in 1: length ( P ) ) polygon ( P [[ i ]] , col = CL [ x [ id [ which _ point ( i ) ]]] , border = NA )
@freakonometrics
242