A poor man's Wilks phenomenon

Jul 3, 2008 - Consistency of penalized maximum log-likelihood : BIC. ˜ k = arg max. { ln(ˆθk ) − ... Maximum likelihood estimator in Sobolev ball tends to have.
199KB taille 4 téléchargements 289 vues
A poor man’s Wilks phenomenon S. Boucheron1 and P. Massart2 1 Laboratoire

de Probabilités et Modèles Aléatoires Département de Mathématiques Université Paris-Diderot

2 Département

de Mathématiques Université Paris-Sud 3rd of July 2008

Early Motivations: Wilks phenomenon Context: maximum likelihood estimation • (Pθ , θ ∈ Θ ⊆

Rm ) :distributions over X

• ∀θ, pθ : density of Pθ w.r.t. µ. • Sample x1 , . . . , xn , xi ∈ X , P • `n (θ) = ni=1 log pθ (xi ). • Assumption: µ⊗n -a.s. ∃θˆ ∈ Θ such that

ˆ = sup `n (θ)

n X

θ∈Θ i=1

log pθ (xi ).

Early Motivations: Wilks phenomenon

• If model “smooth enough”, and X1 , . . . , Xn , ... ∼i.i.d. Pθ , then

  ˆ − `n (θ) 2 `n (θ)

χ2m

and 2nD(Pθ , Pθˆ) •



χ2m

 ˆ − `n (θ) excess empirical risk `n (θ)

• 2nD(Pθ , Pθˆ) excess risk

2010-11-26

A Wilks phenomenon Motivations Motivations Early Motivations: Wilks phenomenon

Early Motivations: Wilks phenomenon

• If model “smooth enough”, and X1 , . . . , Xn , ... ∼i.i.d. Pθ , then

  ˆ − `n (θ) 2 `n (θ)

χ2m

and 2nD(Pθ , Pθˆ) •



χ2m

 ˆ − `n (θ) excess empirical risk `n (θ)

• 2nD(Pθ , Pθˆ) excess risk

Note more than 2mn 1. Dealing with a contrast minimization problem, statistical learning also relies on M-estimation 2. Empirical risk minimizer is well-defined 3. 4. The ingredients

Applications : model selection/identification

• Wilks phenomenon and model assessment • Embedded models Θ m-dimensional submodel of

Θ0 ⊆ Rm+d

• If X1 , . . . , Xn , ... ∼i.i.d. Pθ , θ ∈ Θ then

  ˆ 2 `n (θˆ0 ) − `n (θ)

χ2d

• Akaike AIC criterion for model selection [1972] • Other model selection criteria ...

Markov order identification [Csiszár IEEE IT 2002] • Observations : sequences over a fixed finite alphabet X • Model k : Markov chains of order k (dimension

|X |k × (|X | − 1)) • Goal : Markov order identification • If true model k ∗ and fixed k > k ∗

`n (θˆk ) − `n (θˆk ∗ )

1 2 χ ∗ 2 d−d ∗

with d − d ∗ = (|X | − 1) × (|X |k − |X |k ) • Order identification : Need uniformity over k = O(log n) • Consistency of npenalized maximum log-likelihood : BIC o k

k˜ = arg max `n (θˆk ) −

(|X |−1)|X | 2

log n

• Considering a growing family of models

2010-11-26

A Wilks phenomenon

Markov order identification [Csiszár IEEE IT 2002] • Observations : sequences over a fixed finite alphabet X

Motivations Applications

• Model k : Markov chains of order k (dimension

|X |k × (|X | − 1)) • Goal : Markov order identification • If true model k ∗ and fixed k > k ∗

`n (θˆk ) − `n (θˆk ∗ )

Markov order identification [Csiszár IEEE IT 2002]

1 2 χ ∗ 2 d−d ∗

with d − d ∗ = (|X | − 1) × (|X |k − |X |k ) • Order identification : Need uniformity over k = O(log n) • Consistency of npenalized maximum log-likelihood : BIC o

k˜ = arg max `n (θˆk ) −

(|X |−1)|X |k 2

log n

• Considering a growing family of models

Note more than 2mn 1. 2. 3. 4. 5.

Generalized likelihood ratio in nested exponential models Comparison of smooth parametric models Asymptotic results Seems to rely too much on likelihood inference Known to fail under loss of identifiability

Generalizations: possible directions 1. Considering models of increasing dimensions 2. Beyond likelihood ratio inference 3. Generalized likelihood ratio statistics and Wilks phenomenon by Fan, Zhang & Zhang, AoS, 2001 • Nonparametric Gaussian regression model where the

parameter space is a Sobolev ball • Testing whether regression function is affine against

Sobolev ball • Maximum likelihood estimator in Sobolev ball tends to have

% dimension, q (χ2p − E[χ2p ])/ 2E[χ2p ]

• As m %

N (0, 1)

4. Generalization : when centered and scaled, log of ratio of maximum likelihoods non-degenerate random variable.

2010-11-26

A Wilks phenomenon

Generalizations: possible directions 1. Considering models of increasing dimensions

Motivations Applications

2. Beyond likelihood ratio inference 3. Generalized likelihood ratio statistics and Wilks phenomenon by Fan, Zhang & Zhang, AoS, 2001 • Nonparametric Gaussian regression model where the

parameter space is a Sobolev ball • Testing whether regression function is affine against

Generalizations: possible directions

Sobolev ball • Maximum likelihood estimator in Sobolev ball tends to have

% dimension, q (χ2p − E[χ2p ])/ 2E[χ2p ]

• As m %

N (0, 1)

4. Generalization : when centered and scaled, log of ratio of maximum likelihoods non-degenerate random variable.

1. 2. 3. This asymptotic pivotality property paves the way to non-trivial statistical applications.

Statistical learning setting

• Bounded contrats minimization • X × Y endowed with unknown P, • coordinate projections : X and Y . • Binary classification : Y = {−1, 1} • Bounded regression : Y = [−b, b] • Loss function ` : Y × Y → + • Hard loss : `(f (X ), Y ) = 1f (X )6=Y • Hinge loss : `(f (X ), Y ) = (1 − f (X )Y )+

R

• Risk of f ∈ Y X R(f ) = P`(f (X ), Y ) =

EP `(f (X ), Y )

Statistical learning setting • Assumption/notation : f ∗ minimizes R(f ) ∈ Y X • Example : Bayes classifier in binary classification

f ∗ (x) = 21E[Y |X ]>0 − 1 = sign(E[Y | X]) • Goal : given a model F ⊆ Y X

find ¯f ∈ F that minimizes risk R(.) over F • Recipes : minimize empirical risk n

Rn (f ) = Pn `(f (X ), Y ) =

1X `(f (Xi ), Yi ) n i=1

• Assumption/notation : ˆ f minimizes empirical risk over F

2010-11-26

A Wilks phenomenon Learning : the setting Setting

Statistical learning setting • Assumption/notation : f ∗ minimizes R(f ) ∈ Y X • Example : Bayes classifier in binary classification

f ∗ (x) = 21E[Y |X ]>0 − 1 = sign(E[Y | X]) • Goal : given a model F ⊆ Y X

find ¯f ∈ F that minimizes risk R(.) over F • Recipes : minimize empirical risk

Statistical learning setting

n

Rn (f ) = Pn `(f (X ), Y ) =

1X `(f (Xi ), Yi ) n i=1

• Assumption/notation : ˆ f minimizes empirical risk over F

The settings considered in (???) and the references we are aware of, share a common feature: they are connected with density estimation in a Gaussian framework. They disregard robustness considerations. This is in sharp contrast with the setting we are interested in: statistical learning (??). For example, in binary classification, X × {−1, 1} is endowed with a probability distribution P, the coordinate projections are denoted by X and Y . The problem consists in finding a function f on X such that the risk R(f ) = P{f (X ) 6= Y } is as small as possible starting from a sample (X1 , Y1 ), . . . , (Xn , Yn ) collected from an i.i.d. sample from P. The best possible classifier f ∗ , the so-called Bayes classifier is defined from the regression function η(x) = E [Y | X = x] by f ∗ (x) = sign(η(x)). ˆ

Excess risks

• Model bias L(¯ f ) = R(¯f ) − R(f ∗ ) • Excess risk R(ˆ f ) − R(¯f ) • Excess empirical risk Rn (¯ f ) − Rn (ˆf )

¯ n (f ) = Rn (f ) − R(f ) • Notation: R Excess risk

Excess empirical risk

z }| { z }| { ¯ n (¯f ) − R ¯ n (ˆf ) = R(ˆf ) − R(¯f ) + Rn (¯f ) − Rn (ˆf ) R • Control of excess risk/empirical excess risk :

,→ control of increments of centered empirical process.

Control of increments of centered empirical process

• If random function φn satisfies

∀f ∈ F

¯ n (f ) − R ¯ n (¯f ) ≤ φn (R(f )) R

• looking for largest value of R(f ) that satisfies

R(f ) − R(¯f ) ≤ φn (R(f )) • ,→ upper bound on R(ˆ f ) − R(¯f )

¯ n (·) − R ¯ n (¯f ) Controlling modulus of continuity of R • Loss class H = {`(f (·), ·), f ∈ F} • h(X , Y ) = `(f (X ), Y ) • h∗ (X , Y ) = `(f ∗ (X ), Y ) ¯ , Y ) = `(¯f (X ), Y ) • h(X

¯ in H : • Complexity of the L2 neighborhood of h √

nE

" sup ¯ 2 ≤r 2 h∈F ,P(h−h)

# ¯ ≤ ψ(r ) (Pn − P)(h − h

• Noise conditions

sup



∗ 2

P(h − h )

1/2



: P(h − h ) ≤ r

2

 ≤ ω(r ) .

• Assumptions: ψ, ω %, continuous ≥ 0, ψ(x)/x, ω(x)/x &

and ψ(1), ω(1) ≥ 1

2010-11-26

A Wilks phenomenon

¯ n (·) − R ¯ n (¯f ) Controlling modulus of continuity of R • Loss class H = {`(f (·), ·), f ∈ F} • h(X , Y ) = `(f (X ), Y ) • h∗ (X , Y ) = `(f ∗ (X ), Y ) ¯ , Y ) = `(¯f (X ), Y ) • h(X

Learning : the setting

¯ in H : • Complexity of the L2 neighborhood of h

Excess risks ¯ n (·) − R ¯ n (¯f ) Controlling modulus of continuity of R



nE

" sup ¯ 2 ≤r 2 h∈F ,P(h−h)

# ¯ ≤ ψ(r ) (Pn − P)(h − h

• Noise conditions

sup



P(h − h∗ )2

1/2

: P(h − h∗ ) ≤ r 2

 ≤ ω(r ) .

• Assumptions: ψ, ω %, continuous ≥ 0, ψ(x)/x, ω(x)/x &

and ψ(1), ω(1) ≥ 1

1. The connexion between the richness of F , the closeness of the Bayes classifier g to F , the distribution of |η(·)| (the so-called noise conditions), and the distribution of the risk R(ˆf ) = P1ˆf 6=Y and more recently the “ ” excess risk R(ˆf ) − R(¯f ) = P 1ˆf (X )6=Y − 1¯f (X )6=Y (where ¯f minimizes the risk in F ) has been the subject of intense research during the last thirty-five years (???????). 2. Thanks to ideas borrowed from robust statistics (?), empirical process theory (???) and the theory of concentration inequalities (????), we have a rather precise idea of the aforementioned connexions (?). 3. The function ψ 2 3 √ n

E4

sup f ∈F ,P(f −¯f )2 ≤r 2

˛ ˛ ˛(P − Pn )(f − ¯f )˛5 ≤ ψ(r )

aims at describing the richness of the L2 neighborhood of ¯f in F 4. while the function ω aims at describing the so-called noise conditions:

sup

“ ff ” 2 1/2 ∗ 2 P(f − g) : R(f ) − R(f ) ≤ r ≤ ω(r ) .

5. In general ψ and ω are sublinear (see the definition of the class C1 below). Defining r∗ as the positive root √ 2 upper-bounds the expected excess risk of the equation nr 2 = φ(ω(r )), we will check that a function of r∗ and the expected empirical risk. 6. Moreover as a by-product of the analysis it also avers that a function of r∗ upper-bounds the expected value of P(¯f − ˆf )2 and the expected value of Pn (¯f − ˆf )2 . 7. As a by product of the proofs, we establish that the tails of the distribution of those quantities are at worst exponentials.

Benchmark: VC classes under gentle noise

• Examples: half-spaces in

Rd

• Classification. Hard loss. `(y , y 0 ) = 1y 6=y 0 • VC classes with dimension V

E

• Random classification noise | [Y | X ]| = β

Y = sign(η(X )) with probability β • ω(r ) = √r β

,→ If P(h − h∗ ) ≤ r 2 then P(h − h∗ )2 ≤ p • ψ(r ) = Cr V (1 + log(1 ∨ r −1 )) V : ako model dimension

r2 β

Deviation for excess risk • (P, `, F) learning task • r∗ , solution of



nr 2 = ψ(2ω(r )) .

• ∃κ1 , κ2 , κ3 ≥ 1, w.p. ≥ 1 − 2δ:

  max R(bf ) − R(¯f ), Rn (¯f ) − Rn (bf ) 1 ≤ κ1 L(¯f ) + κ2 r∗2 + κ3 r∗2 log δ   b ¯ ¯ max E[R(f ) − R(f )], E[Rn (f ) − Rn (bf )] ≤ κ1 L(¯f ) + (κ2 + κ3 )r∗2 .

• Tools:

Peeling (Huber, 1967), Vapnik & Chervonenkis, Talagrand’s concentration inequality, Mammen

& Tsybakov AoS 2000, Koltchinskii (AoS 2006), Massart et al. (Toulouse 2000 , Saint-Flour 2003, AoS 2006), Bartlett and Mendelson (PTRF 2004)...

Learning VC classes

• VC-dimension of F: 1600 • R(f ∗ ) = .2 • ω(r ) = √r β • n = 20000

Toy problem from Kearns et al., Machine Learning, 1997

• 1000 trials, , β2 = .3, • [R(f ∗ ) − R(ˆ f )] ≈ 956.

E

• Sample variance : 784. • Blue line :

Gamma(1165, 1.21)

Objectives: poor man Wilks phenomenon • Understand the moments of EER

¯ − h) ˆ = n sup Pn (h ¯ − h) nPn (h h∈H

• Relate bounds on expectation, variance and ... • Hook: EER is a supremum of empirical process

E

  ¯ − h) ≤ 0 nPn (h   ¯ E n sup Pn (h − h) ≥ 0 h∈H

¯ − h) • Proving concentration inequalities for n suph∈H Pn (h

Efron-Stein estimates of variance • Z = h(X1 , X2 , ..., Xn ), (independent R.V) • Let X10 , ..., Xn0 ∼ X1 , ..., Xn and independent from X1 , ..., Xn . • For each i ∈ {1, ..., n} • Zi0 = h(X1 , ..., Xi−1 , Xi0 , Xi+1 , ...Xn ). • X (i) = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). • hi : a function of n − 1 arguments • Zi = hi (X1 , ..., Xi−1 , Xi+1 , Xn ) = hi (X (i) ). • Jackknife estimates of variance:

V+ = V

=

Pn

i=1

P

i

E

  (Z − Zi0 )2+ | X1 , ..., Xn

(Z − Zi )2 .

• Efron-Stein inequalities:

Var[Z ] 6 E[V+ ] 6 E[V ] .

2010-11-26

A Wilks phenomenon

Efron-Stein estimates of variance • Z = h(X1 , X2 , ..., Xn ), (independent R.V) • Let X10 , ..., Xn0 ∼ X1 , ..., Xn and independent from X1 , ..., Xn . • For each i ∈ {1, ..., n} • Zi0 = h(X1 , ..., Xi−1 , Xi0 , Xi+1 , ...Xn ). • X (i) = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). • hi : a function of n − 1 arguments • Zi = hi (X1 , ..., Xi−1 , Xi+1 , Xn ) = hi (X (i) ).

Variance of EER Efron-Stein inequalities

• Jackknife estimates of variance:

Efron-Stein estimates of variance

Pn

V+ = V

i=1

P

=

i

E

  (Z − Zi0 )2+ | X1 , ..., Xn

(Z − Zi )2 .

• Efron-Stein inequalities:

Var[Z ] 6 E[V+ ] 6 E[V ] .

1. Let Z denote now a (square-integrable) function of a sequence of independent random variables (X1 , X2 , ..., Xn ), that is Z = h(X1 , ..., Xn ) for some function h. Let X10 , ..., Xn0 be distributed as X1 , ..., Xn and be independent from X1 , ..., Xn . For each i in 1, ..., n, let Zi0 = h(X1 , ..., Xi−1 , Xi0 , Xi+1 , ...Xn ). And for each i in 1, ..., n, let X (i) = (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ). Let hi denote a function of n − 1 arguments and let Zi = hi (X1 , ..., Xi−1 , Xi+1 , Xn ) = hi (X (i) ). 2. The jackknife estimates of variance are V+ =

n X

E

i h 0 2 (Z − Zi )+ | X1 , ..., Xn

i=1

and V =

X

2

(Z − Zi ) .

i

3. Note that the last quantity is a bona fide estimator while the first one is just X (i) -measurable. The Efron-Stein inequalities (?) assert that the jackknife estimates of variance are upper-bounds: Var[Z ] 6

E[V+ ] 6 E[V ] .

(1)

Let us now recall how the Efron-Stein inequalities can be used to upper-bound suprema of bounded centered empirical processes. We may define Zi as X Zi = sup f (Xj ) f ∈F j6n,j6=i

and Zi0 as

8

9

Variance bounds for empirical excess risk ¯ L, ρ, ˆf , ... be defined as usual • Let F, H, f ∗ , ¯ f , h, • Assumption: the loss functions h = `(f (·), ·), f ∈ F are

[0, 1]-valued. ˆn minimizer of Pn h • h • Consequences of Efron-Stein inequalities : h i bn ) ¯−h Var nPn (h   h i 2  b b ¯ ¯ ≤ 2nE (Pn−1 − P)(h − hn−1 ) + 2nE P h − hn h and i bn ) ¯−h Var nPn (h  h i h i bn )2 + E P(h bn )2 . ¯−h ¯−h ≤ 2n E Pn (h

Notice ... Bounds h i bn ) ¯−h Var nPn (h   h i 2  b b ¯ ¯ ≤ 2nE (Pn−1 − P)(h − hn−1 ) + 2nE P h − hn h and i bn ) ¯−h Var nPn (h  h i h i bn )2 + E P(h bn )2 . ¯−h ¯−h ≤ 2n E Pn (h to be compared with   h   i 2 2 ¯ ˆ ¯ ¯ Var n Rn (f ) − Rn f ≤ nE sup Pn (h − h) +n sup P(h−h) h∈H

h∈H

2010-11-26

A Wilks phenomenon

Notice ... Bounds h i bn ) ¯−h Var nPn (h   h i 2  bn−1 ) + 2nE P h bn ¯−h ¯−h ≤ 2nE (Pn−1 − P)(h

Variance of EER Efron-Stein inequalities

h and i bn ) ¯−h Var nPn (h i i h  h bn )2 . bn )2 + E P(h ¯−h ¯−h ≤ 2n E Pn (h

Notice ...

to be compared with   h   i 2 ¯ − h)2 +n sup P(h−h) ¯ Var n Rn (¯f ) − Rn ˆf ≤ nE sup Pn (h h∈H

Var[n(Rn (¯f ) − Rn (ˆf ))] ≤ 2n



E[Rn (¯f ) − Rn (ˆf )] + ρ2

Proof • Rn (¯f ) − Rn (ˆf ) is a supremum of bounded (non-centered) empirical process • Refrain from using

E

h i ¯ ) − h(X ˆ ))2 | X (n) ≤ sup (h(X h∈H

E

ˆ and h ¯ • take advantage on bounds on the L2 distance between h

 ˆ E[L(fn )] + E[L(ˆfn

q

≤ 6nρ2 (Cr∗ )

h i ¯ ) − h(X ))2 (h(X

h∈H

Sketch of proof • First Efron-Stein inequality, •

(Z − Zi0 ) ≤ ,→



    ¯−h ˆ (Xi , Yi ) − h ¯−h ˆ (X 0 , Y 0 ) , h i i

  ¯ − h) ˆ 2 + P(h ¯ − h) ˆ 2 . V+ ≤ 2n Pn (h

• Taking expectations over X1 , . . . , Xn :

h i ¯ − h) ˆ Var nPn (h ≤ ≤

E[V+] h i h i ¯ − h) ˆ 2 + 2nE P(h ¯ − h) ˆ 2 . 2nE Pn (h

Deviation for excess risk • (P, `, F) learning task • r∗ , solution of



nr 2 = ψ(2ω(r )) .

• ∃κ1 , κ2 , κ3 ≥ 1, w.p. ≥ 1 − 2δ:

  max R(bf ) − R(¯f ), Rn (¯f ) − Rn (bf ) 1 ≤ κ1 L(¯f ) + κ2 r∗2 + κ3 r∗2 log δ   b ¯ ¯ max E[R(f ) − R(f )], E[Rn (f ) − Rn (bf )] ≤ κ1 L(¯f ) + (κ2 + κ3 )r∗2 .

• Tools:

Peeling (Huber, 1967), Vapnik & Chervonenkis, Talagrand’s concentration inequality, Mammen

& Tsybakov AoS 2000, Koltchinskii (AoS 2006), Massart et al. (Toulouse 2000 , Saint-Flour 2003, AoS 2006), Bartlett and Mendelson (PTRF 2004)...

Back to variance

• (P, `, F) : learning task. • ψ, ω ∈ C1 complexity and the noise • Let r∗ denote the positive solution of



nr 2 = ψ(2ω(r )) .

• ∃κ4 such that

 q  i 2 2 b ¯ ¯ Var n(Rn (f ) − Rn (f )) ≤ nκ4 ω (r∗ ) + ω L(f ) h

VC Classes • VC classes under random classification noise (L(¯ f ) = 0)



V (1+log(nh2 /V )) nβ



q 

∧ Vn   q  V (1+log(nβ 2 /V )) V • ω 2 (r∗ ) ≤ C 2 ∧ nβ 2 nβ 2 • ,→ r∗2 ≤ C 2



E

h i n(Rn (¯f ) − Rn (bf ))  √    V (1 + log(nβ 2 /V )) 2 ∧ nV ≤ (κ2 + κ3 ) C β h i Var n(Rn (¯f ) − Rn (bf )) !  s  2 /V )) V (1 + log(nh nV 2 ≤ κ4 C ∧ β2 β2

Ingredients for Bernstein-like inequalities

• Z satisfies a Bernstein inequality with paramters V and c

P{Z ≥ t} ≤ exp



 −κ min

t2 t , v c



• Recentered Γ(p, c) random variable satisfy Berstein

inequalities • If

kZ kq ≤



vq + cq

for q ≥ 2 then Z satisfies a Bernstein inequality.

General moment bounds B., Bousquet, Lugosi & Massart, AoP, 2005 • Assuming • (X1 , . . . , Xn ) independent random variables • Z = F (X1 , . . . , Xn ) • X10 , . . . , Xn0 , independent copies of X1 , . . . , Xn . • Zi0 = F (X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn ). Pn • V+ = i=1 E0 (Z − Zi0 )2+ . • for any q ≥ 2:

k(Z − E[Z ])+ kq ≤

q p

p 3q kV+ kq/2 = 3q V+ . q

• Assuming ∃M r.v. with (Z − Zi0 )+ ≤ M ∀i ≤ n, • for all q ≥ 2

k(Z − E[Z ])− kq

 

p

p ≤ 5q V+ ∨ kMkq . q

Main statement

A Bernstein-like for excess empirical risk.  inequality  b ¯ Let Z = nPn h − hn . For q > 2. kZ − E[Z ]kq  q   q q 0 ¯ ≤ nκ5 ω L(f ) + ω(r∗ ) q 1/2 + nκ06 ω(r∗ )q .

Deviation inequalities for L2 distances • ∃κ5 and κ6 such that for q ≥ 2



 2 2



b−h b−h ¯ ¯ ∨ Pn h

P h



q q  q   2 2 ¯ ≤ κ5 ω L(f ) + ω (r∗ ) + κ6 ω 2 (r∗ )q . • Argument: the same as for deriving deviation inequalities

for excess risk.  ¯ − h)2 : h ∈ H • Work on (h • Risk: expectation ! • Bounded process ... 

 p ≤ ω2 L(h) • Use contraction principle to get a convenient complexity function ¯ 2 • P (h − h)

2

Sketch of proof • Back to variance bounds: 

 bn )2 + P(h bn )2 . ¯−h ¯−h V+ ≤ 2n Pn (h

• for q ≥ 2:

k(Z − E[Z ])+ kq

r

  p

bn )2 + P(h bn )2 ¯−h ¯−h ≤ 3q 2n Pn (h

q r r

p

¯ b 2 bn )2 ¯−h ≤ 6nq + P(h − hn )

Pn (h

q/2

Plugging the bounds on L2 distances



q/2

k(Z − E[Z ])+ kq  q   p p ¯ L(f ) + ω(r∗ ) q 1/2 + 2 3nκ6 ω(r∗ )q . ≤ 2 6nκ5 ω

Learning VC classes

• VC-dimension of F: 1600 • R(f ∗ ) = .2 • ω(r ) = √r β • n = 20000

Toy problem from Kearns et al., Machine Learning, 1997

• 1000 trials, , β2 = .3, • [R(f ∗ ) − R(ˆ f )] ≈ 956.

E

• Sample variance : 784. • Blue line :

Gamma(1165, 1.21)