Bandit Optimization: Theory and Applications - Richard Combes

Introduction and examples of application. Tools and ... How to assess an algorithm applied to a set of problems ? ..... 1.2. KL−UCB î1(n) n = 102 t(n) = (2, 5, 96) î2(n) n = 102 t(n) = (2, 5, 96) î3(n) n = 102 ... Combinatorial multi-armed bandit:.
1MB taille 1 téléchargements 349 vues
Bandit Optimization: Theory and Applications Richard Combes1 and Alexandre Proutière2 1 Centrale-Supélec 2 KTH,

/ L2S, France Royal Institute of Technology, Sweden.

SIGMETRICS 2015

1 / 40

Outline

Introduction and examples of application

Tools and techniques

Discrete bandits with independent arms

2 / 40

A first example: sequential treatment allocation

I

There are T patients with the same symptoms awaiting treatment

I

Two treatments exist, one is better than the other

I

Based on past successes and failures which treatment should you use ?

3 / 40

The model I I I I

At time n, choose action xn ∈ X , observe feedback yn (xn ) ∈ Y, and obtain reward rn (xn ) ∈ R+ . ”Bandit feedback”: rewards and feedback depend on actions (often yn ≡ rn ) Admissible algorithm: xn+1 = fn+1 (x0 , r0 (x0 ), y0 (x0 ), ..., xn , rn (xn ), rn (yn )) Performance metric: regret " T # " T # X X rn (xn ) . R(T ) = max E rn (x) − E x∈X

|

n=1

n=1

{z

oracle

}

|

{z

your algorithm

}

4 / 40

Bandit taxonomy: adversarial vs stochastic Stochastic Bandit: I

Game against a stochastic environment

I

Unknown parameters θ ∈ Θ

I

(rn (x))n is i.i.d with expectation θx

Adversarial Bandit: I

Game against a non-adaptive adversary

I

For all x, (rn (x))n arbitrary sequence in X

I

At time 0, the adversary “writes down (rn (x))n,x in an envelope” Engineering problems are mainly stochastic

5 / 40

Independent vs correlated arms

I

Independent arms: Θ = [0, 1]K

I

Correlated arms: Θ 6= [0, 1]K : choosing 1 gives information on 1 and 2

Correlation enables (sometimes much) faster learning.

6 / 40

Bandit taxonomy: frequentist vs bayesian

How to assess an algorithm applied to a set of problems ? Frequentist (classical): I

Problem dependent regret: Rθπ (T ), θ fixed

I

Minimax regret: maxθ∈Θ Rθπ (T )

I

Usually very different regret scaling

Bayesian: I

Prior distribution θ ∼ P, known to the algorithm

I

Bayesian regret: Eθ∼P [Rθπ (T )]

I

P naturally includes information on the problem structure

7 / 40

Bandit taxonomy: cardinality of the set of arms Discrete Bandits: I

X = {1, ..., K }

I

All arms can be sampled infinitely many times √ Regret O(log(T )) (stochastic), O( T ) (adversarial)

I

Infinite Bandits: I

X = N, Bayesian setting (otherwise trivial)

I

Explore o(T ) arms until a good one is found √ Regret: O( T ).

I

Continuous Bandits: I

X ⊂ Rd convex, x 7→ µθ (x) has a structure

I

Structures: convex, Lipschitz, linear, unimodal (quasi-convex) etc.

I

Similar to derivative-free stochastic optimization √ Regret: O(poly(d) T ).

I

8 / 40

Bandit taxonomy: regret minimization vs best arm identification Sample arms and output the best arm with a given probability, similar to PAC learning Fixed budget setting: I

T fixed, sample arms x1 , ..., xT , and output xˆ T

I

Easier problem: estimation + budget allocation Goal: minimize P[xˆ T 6= x ∗ ]

I

Fixed confidence setting: I

δ fixed, sample arms x1 , ..., xτ and output xˆ τ

I

Harder problem: estimation + budget allocation + optimal stopping (τ is a stopping time) Goal: minimize E[τ ] s.t. P[xˆ τ 6= x ∗ ] ≤ δ

I

9 / 40

Example 1: Rate adaptation in wireless networks I

Adapting the modulation/coding scheme to the radio environment

I

Rates: r1 , r2 , . . . , rK

I

Success probabilities: θ1 , θ2 , . . . , θK

I

Throughputs: µ1 , µ2 , . . . , µK

Structure: unimodality + θ1 > θ2 > · · · > θK .

10 / 40

Example 2: Shortest path routing I

Choose a path minimizing expected delay

I I

Stochastic delays: Xi (n) ∼ Geometric(θi ) P Path M ∈ {0, 1}d , expected delay di=1 Mi /θi .

I

Semi-bandit feedback: Xi (n) , for {i : Mi (n) = 1} 4

5

12

16 16

θ15,16 3

2

1

6

θ1,6

θ6,11

11

θ11,15

15

7

10

14

8

9

13

11 / 40

Example 3: Learning to Rank (search engines) I

Given a query, N relevant items, L display slots

I

A user is shown L items, scrolls down and selects the first relevant item

I

One must show the most relevant items in the first slots.

I

θn probability of clicking on item n (independence between items is assumed)

I

Reward r (`) if user clicks on the `-th item, and 0 if the user https://www.google.fr/search?client=ubuntu&ch... does not click

jaguar - Recherche Google

jaguar

Web

Connexion

Images

Actualités

Vidéos

Maps

Plus

Outils de recherche

Environ 171 000 000 résultats (0,31 secondes) Les cookies assurent le bon fonctionnement de nos services. En utilisant ces derniers, vous acceptez l'utilisation des cookies. En savoir plus

OK

Jaguar Constructeur automobile

Jaguar France - Voitures de Sport et Voitures de luxe www.jaguar.fr/ Découvrez les voitures de luxe Jaguar. Alliant héritage et technologie, les berlines et voitures de sport Jaguar vous feront vivre une expérience de conduite ...

F-Type

Créez Votre Jaguar

La F-TYPE est une sportive d'exception héritière de la lignée ...

CRÉEZ VOTRE JAGUAR. Nous faisons tout notre possible pour ...

Jaguar, de son nom officiel « Jaguar Cars Ltd », est une marque automobile britannique connue pour ses voitures de luxe et ses modèles sportifs. Wikipédia Date de fondation : septembre 1922, Blackpool, Royaume-Uni PDG : Ralf Speth Fondateurs : William Walmsley, William

12 / 40

Example 4: Ad-display optimization I

Users are shown ads relevant to their queries

I

Announcers x ∈ {1, ..., K }, with µx click-through-rate and budget per unit of time cx

I

Bandit with budgets: each arm has a budget of plays

I

Displayed announcer is charged per impression/click

13 / 40

Outline

Introduction and examples of application

Tools and techniques

Discrete bandits with independent arms

14 / 40

Optimism in the face of uncertainty

I

Replace arm values by upper confidence bounds

I

”Index” bx (n) such that bx (n) ≥ θx with high probability

I

Select the arm with highest index xn ∈ arg maxx∈X bx (n)

I

Analysis idea: E[tx (T )] ≤

T X

P[bx ? (n) ≤ θ? ] +

P[xn = x, bx (n) ≥ θ? ] .

n=1

n=1

|

T X

{z

o(log(T ))

}

|

{z

dominant term

}

Almost all algorithms in the literature are optimistic (sic!)

15 / 40

Information theory and statistics I I

Distribtions P, Q with densities p and q w.r.t a measure m Kullback-Leibler divergence:   Z p(x) D(P||Q) = p(x) log m(dx), q(x) x

I

Pinsker’s inequality: r Z D(P||Q) 1 ≥ TV (P, Q) = |p(x) − q(x)|m(dx). 2 2 x

I

If P, Q ∼ Ber(p), Ber(q):     p 1−p D(P||Q) = p log + (1 − p) log q 1−q

I

Also (Pinkser + inequality log(x) ≤ x − 1): 2(p − q)2 ≤ D(P||Q) ≤

(p − q)2 q(1 − q)

The KL-divergence is ubiquitous in bandit problems 16 / 40

Empirical divergence and Sanov’s inequality

I

P a discrete distribution with support P ˆ n empirical distribution of an i.i.d sample of size n from P, P

I

Sanov’s inequality:

I

  n + |P| − 1 ˆ n ||P) ≥ δ] ≤ P[D(P e−nδ |P| + 1 I

Suggests confidence regions (risk α) of the type: ˆ n ||P) ≤ log(1/α)} {P : n D(P

17 / 40

Hypothesis testing and sample complexity I

How many samples are needed to distinguish P from Q ?

I

Observe X = (X1 , ..., Xn ) i.i.d with distribution P n or Q n

I

Additivity of the KL divergence: D(P n ||Q n ) = nD(P||Q)

I

Test φ(X ) ∈ {0, 1}, risk α > 0: EP [φ(X )] + EQ [1 − φ(X )] ≤ α

I

Tsybakov’s inequality: (1/2)e− min{D(P

I

n ||Q n ),D(Q n ||P n )}

≤ EP [φ(X )] + EQ [1 − φ(X )]

Minimal number of samples: n≥

log(1/α) − log(2) min{D(P||Q), D(Q||P)}

18 / 40

Regret Lower Bounds: general technique I

Decision x, two parameters θ, λ, with x ? (λ) = x 6= x ? (θ).

I

Consider consider an algorithm with R π (T ) = log(T ) for all parameters (unformly good): Eθ [tx (T )] = O(log(T )) , Eλ [tx (T )] = T − O(log(T )).

I

Markov inequality: Pθ [tx (T ) ≥ T /2] + Pλ [tx (T ) < T /2] ≤ O(T −1 log(T )).

I

1{tx (T ) ≤ T /2} is a hypothesis test, risk O(T −1 log(T ))

I

Hence (Neyman-Pearson / Tsybakov): X Eθ [tx (T )]I(θx , λx ) ≥ log(T ) − O(log(log(T ))). x

|

{z

}

KL divergence of the observations

19 / 40

Concentration inequalities: Chernoff bounds I

Building indexes requires tight concentration inequalities

I

Chernoff bounds: upper bound the MGF

I

X = (X1 , ..., Xn ) independent, with mean µ, Sn = log(E[eλ(Xn −µ) ])

I

G such that

I

Generic technique:

Pn

n0 =1 Xn0

≤ G(λ) , λ ≥ 0

P[Sn − nµ ≥ δ] = P[eλ(Sn −nµ) ≥ eλδ ] ≤ e−λδ E[eλ(Sn −nµ) ] (Markov) = exp(nG(λ) − λδ) (independence)   −1 ≤ exp −n max{λδn − G(λ)} . λ≥0

20 / 40

Concentration inequalities: Chernoff and Hoeffding’s inequality I

Bounded variables: if Xn ∈ [a, b] a.s then 2 2 E[eλ(Xn −µ) ] ≤ eλ (b−a) /8 (Hoeffding lemma)

I

Hoeffding’s inequality:  P[Sn − nµ ≥ δ] ≤ exp −

I

2δ 2 n(b − a)2

Subgaussian variables: E[eλ(Xn −µ) ] ≤ eσ E[eλ(Xn −µ) ]

I

Bernoulli variables:

I

Chernoff’s inequality:

=

2 λ2 /2

µeλ(1−µ)



, similar

− (1 − µ)e−λµ

P[Sn − nµ ≥ δ] ≤ exp(−nI(µ + δ/n, µ)) I

Pinsker’s inequality: Chernoff is stronger than Hoeffding. 21 / 40

Concentration inequalities: variable sample size and peeling I

In bandit problems, the sample size is random and depends on the samples themselves

I

Intervals Nk = {nk , ..., nk +1 } , N = ∪Kk=1 N

I

Idea: Zn = eλ(Sn −nµ) is a positive sub-martingale: P[max(Sn − µn) ≥ δ] = P[max Zn ≥ eλδ )] n∈Nk

n∈Nk

≤e

−λδ

E[Znk +1 ] (Doob’s inequality)

= exp(−λδ + nk +1 G(λ))   −1 ≤ exp −nk +1 max{λδnk +1 − G(λ)} . λ≥0

I

Peeling trick (Neveu): union bound over k , nk = (1 + α)k . 22 / 40

Concentration inequalities: self normalized versions I

Self-normalized versions of classical inequalities

I

Garivier’s inequality:   P max nI(Sn /n, µ) ≥ δ ≤ 2edlog(T )δee−δ 1≤n≤T

I

From Pinkser’s inequality (self-normalized Hoeffding):   √ 2 P max n|Sn /n − µ| ≥ δ ≤ 4edlog(T )δ 2 ee−2δ 1≤n≤T

I

Multi-dimensional version, Ynkk = nk I(Snk /nk , µ) " # K X P max Ynkk ≥ δ ≤ CK (log(T )δ)K e−δ (n1 ,...,nK )∈[1,T ]K

k =1

23 / 40

Outline

Introduction and examples of application

Tools and techniques

Discrete bandits with independent arms

24 / 40

The Lai-Robbins bound I

Actions X = {1, ..., K }

I

Rewards θ = (θ1 , ..., θK ) ∈ [0, 1]K

I

Uniformly good algorithm: R(T ) = O(log(T )) , ∀θ

Theorem (Lai ’85) For any uniformly good algorithm, and x s.t θx < θ? we have: 1 E[tx (T )] ≥ I(µx , µ? ) T →∞ log(T )

lim inf

I

For x 6= x ? , apply the generic technique with: λ = (θ1 , ..., θx−1 , θ? + , θx+1 , ..., θK )

25 / 40

The Lai-Robbins bound

θx Most confusing parameter

x

26 / 40

Optimistic Algorithms I

Select the arm with highest index xn ∈ arg maxx∈X bx (n)

I

UCB algorithm (Hoeffding’s ineqality): s 2 log(n) . bx (n) = θˆx (n) + | {z } tx (n) empirical mean | {z } exploration bonus

I

KL-UCB algorithm (using Garivier’s inequality): bx (n) = max{q ≤ 1 : tx (n)I(θˆx (n), q) ≤ {z } | likelihood ratio

f (n) |{z}

}.

log(confidence level−1 )

with f (n) = log(n) + 3 log(log(n)).

27 / 40

Regret of optimistic Algorithms Theorem (Auer’02) Under algorithm UCB, for all x s.t θx < θ? : E[tx (T )] ≤

8 log(T ) π2 + . 6 (θx − θ? )2

Theorem (Garivier’11) Under algorithm KL-UCB, for all x s.t θx < θ? and for all δ < θ ? − θx : E[tx (T )] ≤

log(T ) + C log(log(T )) + δ −2 . I(θx + δ, θ? )

28 / 40

Regret of KL-UCB: sketch of proof Decompose: E[tx (T )] ≤ E[|A|] + E[|B|] + E[|C|], A = {n ≤ T : bx ? (n) ≤ θ? }, B = {n ≤ T : n ∈ / A, xn = x, |θˆx (n) − θx | ≥ δ}, C = {n ≤ T : n ∈ / A, xn = x, |θˆx (n) − θx | ≤ δ, tx (T ) ≤ f (T )/I(θx + δ, θ? )}. Union bound: E[|A|] ≤ C log(log(T )),

(Index property)

E[|B|] ≤ δ −2 ,

(Hoeffding + Union bound) ?

|C| ≤ f (T )/I(θx + δ, θ )

(Counting)

29 / 40

Randomized algorithms: Thompson Sampling I

Prior distribution θ ∼ P

I

Time n, select x with probability P[θx = maxx 0 θx 0 |x0 , r0 , ..., xn , rn ]

Bernoulli with uniform priors: ˆx (n) + 1, tx (n)(1 − θˆx (n)) + 1) I Zx (n) ∼ Beta(tx (n)θ I

xn+1 ∈ arg maxx Zx (n)

Theorem (Kaufmann’12 , Agrawal’12) Thompson sampling is asymptotically optimal. If θx < θ? then: E[tx (T )] 1 ≤ . ? log(T ) I(θ x, θ ) T →∞

lim sup

30 / 40

Illustration of algorithms (one sample path) UCB

KL−UCB

1.2

1.2

t(n) = (20, 29, 54) µ ˆ 3(n)

n = 102 1

0.4

0.2

0.2

0.2

0.4

0.6

0.8

1

0 −0.2

1.2

0

0.2

0.4

0.6

0.8

1

1.2

Regret of various algorithms

15 Regret R(T)

0

µ ˆ 1(n)

0.6

0.4

0 −0.2

µ ˆ 2(n)

0.8

µ ˆ 1(n)

0.6

µ ˆ 3(n)

1

µ ˆ 2(n)

0.8

t(n) = (2, 5, 96)

n = 102

UCB KL−UCB Thompson

10

5

0

10

20

30

40

50 60 Time T

70

80

90

100

31 / 40

The EXP3 algorithm I

Adversarial setting: arm selection must be randomized

I

At time n, select xn with distribution p(n) = (p1 (n), ..., pK (n))

I

Reward estimate for x (unbiased): Rx (n) =

n X n0 =1

˜rx (n0 ) =

n X n0 =1

rn 1{xn = x} px (n)

I

Action distribution, px (n) ∝ exp(ηRx (n)) with η > 0 fixed.

I

Favor actions with good historical rewards + explore a bit: p(n) is a soft approximation to the max function

I

For small η, EXP3 is the replicator dynamics (!)

32 / 40

Regret of EXP3 Theorem Under EXP3 with η =

q

2 log(K ) KT ,

R π (T ) ≤

the regret is upper bounded by:

p 2TK log(K )

I

Larger exponent in T , but smaller in K

I

Suggests two regimes for (K , T ): Stochastic regime vs. Adversarial regime

I

Matching lower bound: consider a stochastic adversary with close arms

33 / 40

Bibliography Discrete bandits - Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, 1933 - Robbins, Some aspects of the sequential design of experiments, 1952 - Lai and Robbins. Asymptotically efficient adaptive allocation rules, 1985 - Lai. Adaptive treatment allocation and the multi-armed bandit problem, 1987 - Gittins, Bandit Processes and Dynamic Allocation Indices, 1989 - Auer, Cesa-Bianchi and Fischer, Finite time analysis of the multiarmed bandit problem. 2002. - Garivier and Moulines, On upper-confidence bound policies for non-stationary bandit problems, 2008 - Slivkins and Upfal, Adapting to a changing environment: the brownian restless bandits, 2008 34 / 40

Bibliography - Garivier and Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond, 2011 - Honda and Takemura, An Asymptotically Optimal Bandit Algorithm for Bounded Support Models, 2010 Discrete bandits with correlated arms - Anantharam, Varaiya, and Walrand, Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays, 1987 - Graves and Lai Asymptotically efficient adaptive choice of control laws in controlled Markov chains, 1997 - György, Linder, Lugosi and Ottucsák, The on-line shortest path problem under partial monitoring, 2007 - Yu and Mannor, Unimodal bandits, 2011 - Cesa-Bianchi and Lugosi, Combinatorial bandits, 2012. 35 / 40

Bibliography - Gai, Krishnamachari, and Jain, Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations, 2012 - Chen, Wang and Yuan. Combinatorial multi-armed bandit: General framework and applications, 2013 - Combes and Proutiere, Unimodal bandits: Regret lower bounds and optimal algorithms, 2014 - Magureanu, Combes, and Proutiere. Lipschitz bandits: Regret lower bounds and optimal algorithms, 2014. Thompson Sampling - Chapelle and Li, An Empirical Evaluation of Thompson Sampling, 2011 - Korda, Kaufmann and Munos, Thompson Sampling: an asymptotically optimal finite-time analysis, 2012 - Korda, Kaufmann and Munos, Thompson Sampling for one-dimensional exponential family bandits, 2013.

36 / 40

Bibliography - Agrawal and Goyal, Further optimal regret bounds for Thompson Sampling, 2013. - Agrawal and Goyal, Thompson Sampling for contextual bandits with linear payoffs, June 2013. Discrete adversarial bandits - Auer, Cesa-Bianchi, Freund and Schapire, The non-stochastic multi-armed bandit, 2002 Continuous Bandits (Lipschitz) - R. Agrawal, The continuum-armed bandit problem, 1995 - Auer, Ortner, and Szepesvári, Improved rates for the stochastic continuum-armed bandit problem, 2007 - Bubeck, Munos, Stoltz, and Szepesvári, Online optimization in x-armed bandits, 2008 37 / 40

Bibliography - Kleinberg. Nearly tight bounds for the continuum-armed bandit problem, 2004 - Kleinberg, Slivkins, and Upfal, Multi-armed bandits in metric spaces, 2008 - Bubeck, Stoltz and Yu, Lipschitz bandits without the Lipschitz constant, 2011 Continuous Bandits (strongly convex) - Cope, Regret and convergence bounds for a class of continuum-armed bandit problems, 2009 - Flaxman, Kalai, and McMahan, Online convex optimization in the bandit setting: gradient descent without a gradient, 2005 - Shamir, On the complexity of bandit and derivative-free stochastic convex optimization, 2013 - Agarwal, Foster, Hsu, Kakade, and Rakhlin, Stochastic convex optimization with bandit feedback, 2013. 38 / 40

Bibliography Continuous Bandits (linear) - Dani, Hayes, and Kakade, Stochastic linear optimization under bandit feedback, 2008 - Rusmevichientong and Tsitsiklis, Linearly Parameterized Bandits, 2010 - Abbasi-Yadkori, Pal, Szepesvári, Improved Algorithms for Linear Stochastic Bandits, 2011 Best Arm identification - Mannor and Tsitsiklis, The sample complexity of exploration in the multi-armed bandit problem, 2004 - Even-Dar, Mannor, Mansour, Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems, 2006 - Audibert, Bubeck, and Munos, Best arm identification in multi-armed bandits, 2010.

39 / 40

Bibliography - Kalyanakrishnan, Tewari, Auer, and Stone, Pac subset selection in stochastic multi-armed bandits, 2012 - Kaufmann and Kalyanakrishnan, Information complexity in bandit subset selection, 2013 - Kaufmann, Garivier and Cappé, On the Complexity of A/B Testing, 2014 Infinite Bandits - Berry, Chen, Zame, Heath, and Shepp, Bandit problems with infinitely many arms, 1997. - Wang, Audibert, and Munos, Algorithms for infinitely many-armed bandits, 2008. - Bonald and Proutiere, Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards, 2013 40 / 40