Bandit Op2miza2on: Theory and Applica2ons -â Part 2 -â

Page 2 ...... 79. -â State-âof-âthe-âart algorithms apply an appropriate discre2za2on of the set of arms, and op2mally exploit the structure. -â Discre2za2on: ...

Télécharger le PDF

9MB taille 2 téléchargements 86 vues

commentaire

Report

Bandit Op*miza*on: Theory and Applica*ons -‐ Part 2 -‐ R. Combes, A. Prou*ere

1

Part 2. Structured Bandits Discrete Structured Bandits 1. Regret lower bounds 2. Examples 3. Efficient algorithms for some structures: unimodal, Lipschitz

Infinite Bandits 1. Regret lower bounds 2. Op*mal algorithms

Con3nuous Structured Bandits 1. Regret lower bounds 2. Unimodal bandits 3. Lipschitz bandits

Conclusion and Open Problems

2

2-‐A. Discrete Structured Bandits

3

Discrete Structured Bandits -‐  -‐  -‐  -‐ 

K arms Reward distribu*ons parametrized by ✓ = (✓1 , . . . , ✓K ) Average reward of arm k: µk = µk (✓) Most oOen, reward distribu*ons are taken from a single parameter exponen*al family (e.g. Bernoulli, ✓ k = µ k ) -‐  K can be very large – yielding a prohibi*ve regret if arms are independent, i.e., ⇥(K log(T )) -‐  Structure maTers and has to be exploited! -‐  Nota*on: µ? (✓) = max µk (✓) = µk? (✓) k

4

Discrete Structured Bandits -‐  Unstructured bandits: average rewards are not related µ = (µ1 , . . . , µK ) 2 ⇥

⇥=

K Y

µ2 ⇥

[ai , bi ]

i=1

µ1

-‐  Structured bandits: the decision maker knows that average rewards are related, i.e., that µ 2 ⇥ µ2 ⇥ 6=

K Y

[ai , bi ]

⇥

i=1

µ1

-‐  The rewards observed for a given ac*on provide side-‐informa*on about the average rewards of other ac*ons -‐  How can we exploit this side-‐informa*on op*mally?

5

Example 1: Graphical Unimodality

G = (V, E)

Arms

6

Example 1: Graphical Unimodality k?

G = (V, E) j µ j > µi i

µ = (µi )i2V 2 UG

Graphical unimodality: from any vertex, there is a path with increasing rewards to the best vertex. 7

Example 1: Unimodality µi

1

2

3

4

5

6

7

8 Arms i

Classical unimodality, graph = line

8

Example 2: Lipschitz µi

+L

L x1

x2

x3

x4

x5

x6

x7

x 8 arms i

Let x 1 < x 2 < . . . < x K denote the posi(ons of the arms. We assume that: |µ k µ k 0 |  L ⇥ |x k x k 0 | . 9

A Markov Chain Control Perspec*ve Graves-‐Lai 1997 y x

p(x, y; u, ✓) reward: r (x, u)

-‐  Finite state space X and ac*on spaces ✓ 2 ⇥ -‐  Unknown parameter ⇥ : compact metric space

-‐  Control: finite set of irreducible control laws g : X ! U µg (✓) =

X

⇡✓g (x)r(x, g(x))

x2X -‐  Op*mal control law: g ? -‐  Regret: R ⇡ (T ) = T µg? (✓)

E

T X

r(Xt , g ⇡ (Xt ))

t=1

10

Regret lower bound -‐  KL number under policy g : X g p(x, y; g(x), ✓) g I (✓, ) = ⇡ (x)p(x, y; g(x), ✓) log ✓

p(x, y; g(x), )

x,y

-‐  Bad parameter set:

g?

?

B(✓) = { 2 ⇥ : g not opt., I (✓, ) = 0} R⇡ (T ) -‐  Lower bound: lim inf T !1 log(T ) c(✓) = inf

X

cg (µg? (✓)

g6=g ?

s.t.

c(✓)

inf

2B(✓)

X

µg (✓))

cg I g (✓, )

1

g6=g ? 11

Applica*on to Structured Bandits -‐  State space: set of possible rewards -‐  Control laws: constant mappings to the set of arms, e.g. g=k

-‐  Transi*ons (i.i.d. process): ⇢ ✓k p(x, y; k, ✓) = 1 ✓k I k (✓, ) = KL(✓k ,

-‐  Average rewards: g = k

if y = 1 if y = 0

k)

µg (✓) = ✓k = µk 12

Regret Lower Bound R⇡ (T ) -‐  Lower bound: lim inf T !1 log(T )

c(✓) =

c(✓)

inf

ck (µk? µk ) ck 0,8k X s.t. inf ck I k (✓, ) 2B(✓)

1

k6=k?

k?

B(✓) = { 2 ⇥ : I (✓, ) = 0, µ? ( ) > µk? ( )}

13

Regret Lower Bound R⇡ (T ) -‐  Lower bound: lim inf T !1 log(T )

c(✓) =

c(✓)

inf

ck (µk? µk ) ck 0,8k X s.t. inf ck I k (✓, ) 2B(✓)

1

k6=k?

k?

B(✓) = { 2 ⇥ : I (✓, ) = 0, µ? ( ) > µk? ( )}

-‐  Iden*fying the worst can be challenging -‐  Examples where it is explicit: unimodal, Lipschitz. In this case, the regret lower solves an LP -‐  Interpreta*on: when op*mal, an algorithm plays sub-‐ op*mal arm k c k log(T ) *mes

14

Asympto*cally Op*mal Algorithm -‐  Graves-‐Lai’s algorithm -‐  Uses the doubling trick -‐  Needs to solve the regret lower bound problem repeatedly -‐  Too complex, and inefficient for reasonable *me horizons

15

2-‐A.1. Discrete Unimodal Bandits Combes, Prou*ere. Unimodal Bandits: Regret Lower Bounds and Op*mal Algorithms, ICML 2014 Combes et al. Op*mal Rate Sampling in 802.11 Systems, IEEE Infocom 2014 16

Regret Lower Bound k

17

Regret Lower Bound N (k) k

Theorem: For any uniformly good algorithm ⇡ X µ? µk (✓) R⇡ (T ) lim inf cG (✓) cG (✓) = T !1 log(T ) KL(✓k , ✓k? ) ? k2N (k )

The performance limit does not depend on the size of the decision space! Structure could really help.

18

Proof inf

X

cg (µg? (✓)

g6=g ?

s.t. µi

inf

2B(✓)

X

µg (✓))

Example: classical unimodality cg I g (✓, )

1

g6=g ?

The most confusing

µ

1

2

3

4

5

6

7

8 ac*ons 19

Op*mal Ac*on Sampling tk (n)

1 X -‐  Empirical average reward: µ Xk (s) ˆk (n) = tk (n) s=1 -‐  Leader at *me n: L(n) 2 arg max µˆk (n) k

-‐  Number of *mes k has been the leader: lk (n) =

n X

1L(s)=k

s=1

-‐  Index of k: bk (n) = max {q 2 [0, 1] : tk (n)KL(ˆ µk (n), q)

 log(lL(n) (n)) + c log log(lL(n) (n))

20

Op*mal Ac*on Sampling Algorithm – Op*mal Ac*on Sampling (OAS) For n = 1, . . . , K , select ac*on k(n) = n For n K + 1 , select ac*on k(n) : k(n) =

(

L(n) arg max

k2N (L(n))

Theorem: For any µ 2 UG ,

bk (n)

if (lL(n) (n) otherwise.

1)/( + 1) 2 N,

ROAS (T ) lim sup  cG (✓). T !1 log(T )

21

Proof ROAS (T ) 

X

E[lk (T )]

k6=k?

+

X

k2N (k? )

(µ?

µk (✓))E[

T X

1L(t)=k? ,k(t)=k ]

t=1

First term  O(log log(T )) Second term  (1 + ✏)c(✓) log(T ) + O(log log(T ))

22

Proof ingredients 1.  Decomposi*on of the set of events 2.  Devia*on bounds (refined concentra*on inequali*es), e.g. Lemma. {Zt }t2Z independent random variables in [0, B]. Fn = P({Zt }tn ), F = (Fn )n2Z . Let s 2 N, n0 2 Z and T n0 . n Sn = t=n0 Bt (Zt E[Zt ]), where Bt 2 {0, 1} is previsible. Pn tn = t=n0 Bt . 2 {n0 , . . . , T + 1} a F-stopping time with: either t s or = T + 1. Then: P[S

t

,

2s 2  T ]  exp( ). 2 B

23

Non-‐sta*onary environments -‐  -‐  -‐  -‐ 

Average rewards may evolve over *me: ✓(t) Best decision at *me t: k? (t) Goal: track the best decision Regret: T X R⇡ (T ) = (µk? (t) (t) µk⇡ (t) (t))

t=1 -‐  Sub-‐linear regret cannot be achieved (Garivier-‐Moulines 2011) -‐  Assump*ons: ✓(t) σ-‐Lipschitz (w.r.t. *me), and separa*on T 1 X X lim sup 1|✓k (n) ✓k0 (n)|<  (K) T !1 T n=1 0 k,k 2N (k)

24

OAS with Sliding Window -‐  SW-‐OAS (applies OAS over a sliding window of size τ) -‐  Graphical unimodality holds at any *me -‐  Parameters: 3/4 ⌧= log(1/ )/8, = 1/4 log(1/ )

Theorem: Under ⇡ = SW-‐OAS R⇡ (T ) lim sup  C (K) T T

1 4

log(1/ )(1 + Ko(1)),

! 0+

25

OAS with Sliding Window -‐  Analysis made complicated by the smoothness of the rewards vs. *me (previous analysis by Garivier-‐Moulines assumes separa*on of rewards at any *me) -‐  Upper bound on regret per *me unit: -‐  Tends to zero when the evolu*on of average rewards gets smoother 1/4

log(1/ ) ! 0,

as

! 0+

-‐  Does not depend on the size of the decision space if

(K)  C

26

Applica*on: Rate adapta*on in 802.11

Adap*ng the modula*on/coding scheme to the radio environment

-‐  802.11 a/b/g Yes/No

rates Success probabili*es Throughputs

r 1 r 2 . . . rN ✓1 ✓2 . . . ✓N µ1 µ2 . . . µN

µ i = ri ✓ i

-‐  Structure: unimodality + ✓1 > ✓2 > . . . > ✓N G 6

9

12

18

24

36

48

54 (Mbit/s) 27

Rate adapta*on in 802.11 -‐  802.11 n/ac MIMO Rate + MIMO mode (32 combina*ons in n) -‐  Example: two modes, single-‐stream (SS) or double-‐stream (DS) 27

54

81

108

162

216

243

270

DS

G

13.5

27

40.5

54

81

108

121.5 135

SS 28

State-‐of-‐the-‐art -‐  ARF (Auto Rate Fallback): aOer n successive successes, probe a higher rate; aOer two consecu*ve failures reduce the rate -‐  AARF: vary n dynamically depending on the speed at which the radio environment evolves -‐  SampleRate: based on achieved throughputs over a sliding window, explore a new rate every 10 packets -‐  Measurement based approaches: Map SNR to packet error rate (does not work – OFDM): RBAR, OAR, CHARM, … -‐  802.11n MIMO: MiRA, RAMAS, … All exis*ng algorithms are heuris*cs. Rate adapta*on design: a graphically unimodal bandit with large strategy set 29

Op*mal Rate Sampling Algorithm – Op*mal Rate Sampling (ORS) For n = 1, . . . , K , select ac*on k(n) = n For n K + 1 , select ac*on k(n) : k(n) =

(

L(n) arg max

k2N (L(n))

bk (n)

if (lL(n) (n) otherwise.

1)/( + 1) 2 N,

ORS is asympto*cally op*mal (minimizes regret) Its performance does not depend on the number of possible rates! For non-‐sta*onary environments: SW-‐ORS (ORS with sliding window) 30

802.11g – sta*onary environment GRADUAL (success prob. smoothly decreases with rate) 4000 SampleRate SW−G−ORS G−ORS

3500 3000

Regret

2500 2000 1500 1000 500 0

0

20

40

60

80

100

Time (s) 31

802.11g – sta*onary environment STEEP (success prob. is either close to 1 or to 0) 2000 SampleRate SW−G−ORS G−ORS

1800 1600

Regret

1400 1200 1000 800 600 400 200 0

0

20

40

60

Time (s)

80

100 32

802.11g – non-‐sta*onary environment TRACES Instantaneous Throughput (Mbps)

55

54Mbps 48Mbps 36Mbps 24Mbps 18Mbps 12Mbps 9Mbps 6Mbps

50 45 40 35 30 25 20 15 10 5 0

0

50

100

150

200

250

300

Time (s) 33

802.11g – non-‐sta*onary environment RESULTS Instantaneous Throughput (Mbps)

55 Oracle SW−G−ORS SampleRate

50 45 40 35 30 25 20 15 10 5 0

0

50

100

150

Time (s)

200

250

300 34

2-‐A.2. Discrete Lipschitz Bandits

Combes, Magureanu, Prou*ere. Lipschitz Bandits: Regret Lower Bounds and Op*mal Algorithms, COLT 2014 35

Discrete Lipschitz Bandits µi

+L

L x1

x2

x3

x4

x5

x6

x7

x 8 arms i

Let x 1 < x 2 < . . . < x K denote the posi(ons of the arms. We assume that: |µ k µ k 0 |  L ⇥ |x k x k 0 | . 36

Related work -‐  Con*nuous set of ac*ons (e.g. [0,1]): Agrawal 1995, Kleinberg 2004, Kleinberg-‐Slivkins-‐Upfal 2008, Bubeck-‐Munos-‐Stolz-‐Szepesvári 2008, …

37

Regret lower bound Theorem: For any uniformly good algorithm ⇡ R⇡ (T ) lim inf C(✓) T !1 log(T ) where C(✓) is the minimal value of: min

ck 0,8k2K

X

k2K

s.t. 8k 2 K ,

ck ⇥ (✓?

X

ci I(✓i ,

✓k ) k i)

1.

i2K

38

Regret lower bound X

min

ck 0,8k2K

k2K

s.t. 8k 2 K ,

ck ⇥ (✓?

X

ci I(✓i ,

✓k ) k i)

1.

i2K

. . .. .. . . . . .. . . . .. x

x

x

x

x

x x

l

l

l l

l l

x

x

l

x

x

l

x

l

l

l l

x

l

x

x

x

x

l l l l 39

Algorithms bk (n) = sup{q 2 [✓ˆk (n), 1] : K X

tk0 (n)I + (✓ˆk0 (n),

q,k k0 )

k0 =1

. . .. .. . . . . .. . . . .. x

q

 log(n) + 3 log log(n)}.

x

x

x

x

x x

l

l

l l

l l

x

x

l

x

x

l

x

l

q,k

l

l l

x

l

x

x

x

x

l l l l 40

The OSLB algorithm -‐  Apparently op*mal arm sampling rate. Regret lower ˆ : ck (n) bound replacing ✓ by ✓(n) -‐  Set of arms apparently under-‐sampled: Ke (n) = {k 2 K (n) : tk (n)  ck (n) log(n)} k(n) = arg min tk (n) k2Ke (n)

k(n) = arg min tk (n) k

Algorithm -‐-‐ OSLB max bk (n) Select the leader if ✓ˆL(n) (n) k6=L(n) Else ✏ if t k(n) (n) < t k(n) (n) , select k(n) K else select k(n)

41

A Simplified Algorithm Algorithm -‐-‐ CKL-‐UCB Select the leader if it has the highest index Else select the least explored arm with an index higher than the leader

42

Regret under OSLB and CKL-‐UCB Theorem: For any ✓ 2 ⇥ L , under ⇡ = OSLB(✏) , we have: For all > 0 , and all T , R⇡ (T )  C (✓)(1 + ✏) log(T ) + C log log(T ) + K 3 ✏ 1 2 + 3K 1 where C (✓) ! C(✓) as ! 0+ . L , under ⇡ = CKL-UCB , we have: Theorem: For any ✓ 2 ⇥ R⇡ (T ) lim sup  C 0 (✓), T !1 log(T ) where C 0 (✓) is the minimal value of an op*miza*on problem “close” to that providing the regret lower bound.

2

43

Proof ingredients A concentra*on inequality for the sum of KL divergences: " # P

K X

k=1

tk (n)I (✓ˆk (n), ✓k ) +

e

✓

d log(n)e K

◆K

eK+1 .

44

Example 46 arms, T = 500,000 Normalized expected number of plays

1.4 θ CKL-UCB KL−UCB

1.2

1

0.8

0.6

0.4

0.2

0

0

0.2

0.4

0.6

0.8

1

Arm Index 45

Example 3500 KL−UCB CKL-UCB 3000

Regret

2500

2000

1500

1000

500

0

0

1

2

3 Time

4

5 5

x 10

46

Summary: Discrete Structured Bandits -‐  Regret lower bounds by Graves-‐Lai 1997: works for any structure -‐  When is the solu*on explicit? -‐  How does it scale with the dimension of the decision space? -‐  When explicit, provides guidelines on the design of op*mal algorithms – op*mally exploi*ng the known structure

-‐  Simple and efficient algorithm: Unimodal, and Lipschitz -‐  Other structures? Linear, Convex? -‐  Thompson Sampling -‐  Is it always asympto*cally op*mal? -‐  How to sample for the posterior?

-‐  Complexity vs. Performance?

47

2-‐B. Infinite Bandits

Bonald, Prou*ere. Two-‐Target Algorithm for Infinite-‐Armed Bandits, NIPS 2013 48

Ac*ons and rewards -‐  An infinite number of Bernoulli arms -‐  Decision in each round: take a new arm, or play arms previously selected -‐  Bayesian setng: the expected reward ✓ k of the k-‐th selected arm follows a known distribu*on F (u) = P[✓k > u] F (u) ⇠ ↵(1

-‐  Regret: R(T ) = T

E[

T X

u) ,

as u ! 1

Xt ]

t=1

-‐  More like a stopping *me problem … 49

Related work -‐  Mallows-‐Robbins 1964, Herschkorn-‐Pekoes-‐Ross 1996: no-‐regret policies -‐  Berry-‐Chen-‐Zame-‐Heat-‐Shepp 1997: uniformly distributed parameter, policy with regret 2√T, conjectured to be op*mal 1-‐failure policy: keep the first arm with more than √T successive 1’s rewards 110 10 11110 11111110101011100… arm 1 2 3 4

50

Related work -‐  Mallows-‐Robbins 1964, Herschkorn-‐Pekoes-‐Ross 1996: no-‐regret policies -‐  Berry-‐Chen-‐Zame-‐Heat-‐Shepp 1997: uniformly distributed parameter, policy with regret 2√T, conjectured to be op*mal 1-‐failure policy: keep the first arm with more than √T successive 1’s rewards 110 10 11110 11111110101011100… arm 1 2 3 4

1-‐failure policies are actually sub-‐op*mal …

51

Related work -‐  Wang-‐Audibert-‐Munos 2013: More general parameter distribu*on, regret scaling as T /( +1) up to log factors. Policy: select X arms and run UCB … Not a stopping rule. The number of arms tested does not depend on the realiza*ons of the rewards.

52

Regret lower bound Theorem: For any algorithm ⇡ knowing the *me horizon, 1 ✓ ◆ +1 ⇡ R (T ) +1 lim inf T !1 ↵ +1 T Conjecture: When the *me horizon is unknown, 1 ✓ ◆ +1 ⇡ R (T ) +1 lim inf T !1 ↵ T +1 p p Example: parameter unif. distributed, 2T , 2 T .

53

Two-‐target algorithms Explora*on of arm k: Run 1 Run2 Run 3 Run m+1 11111111110 11110 111111110 ……. 111111111110 L1 L2 If L 1 < ` 1 , explore a new arm Else if L 2 < ` 2 , explore a new arm else keep it forever 54

Two-‐target algorithms $✓

↵n

◆

1 +2

%

$

✓

↵n

Theorem: Select `1 = , `2 = m +1 +1 1 ✓ ◆ +1 ✓ ◆ ⇡ R (T ) +1 1  lim sup 1 + O( ) . ↵ m T !1 T +1 Example: parameters for unif. distribu*on, p 1/3 `1 ⇠ (n/2) , `2 ⇠ m n/2.

◆

1 +1

%

55

.

Numerical Example -‐  Beta(1,2) mean reward distribu*on -‐  Expected failure rate = mean regret per round

56

Summary: Infinite Bandits -‐  Regret lower bound and op*mal algorithms when the support of the reward distribu*on is 1, and the *me horizon is known -‐  What about unknown *me horizon? -‐  What if the support of the reward distribu*on does not include 1? -‐  What if the reward distribu*on is only par*ally known?

57

2-‐C. Con*nuous Structured Bandits

58

Con*nuous Structured Bandits -‐  Set of arms: [0, 1] -‐  Bernoulli reward for arm x of mean µ(x) -‐  Reward realiza*ons: (X n (x), n 1) i.i.d. over *me, independent over arms -‐  Algorithm ⇡ : selects arm x ⇡ (n) in round n -‐  Bandit feedback: Xn (x⇡ (n)) T -‐  Regret: X ⇡ ? ⇡ R (T ) = T µ µ(x (n)) n=1 µ? = sup µ(x) = µ(x? ) x2[0,1] -‐  Structure: x 7! µ(x) is unimodal, linear, concave, Lipschitz, … 59

2-‐C.1. Con*nuous Unimodal Bandits

Combes, Prou*ere. Unimodal Bandits without Smoothness, arxiv 2014 60

Con*nuous Unimodal Bandit µ(x) µ?

0

x?

1

x

The mapping x 7! µ(x) is unimodal. 61

Golden Sec*on Algorithm Kiefer 1953

µ(x)

µ?

0

-‐  -‐  -‐  -‐ 

x1

x?

x2

1

x

Determinis*c setng Evaluate the func*on in points x 1 , x2 If µ(x 1 ) < µ(x 2 ) , keep [x 1 , 1] , else keep [0, x2 ] Design choices: (i) the ra*o of the lengths of the old and new new intervals is always r and (ii) we need to evaluate the func*on once in each step 62

Golden Sec*on Algorithm Kiefer 1953

µ(x)

µ?

0

x?

x1

x2

1

x

r 1 r = =) r = r 1 r

1+ 2

p

r 5

⇡ 0.618 63

Stochas*c Setng – Related Work -‐  Smoothness assump*on: |µ(x)

?

x!x?

µ(x )| ⇠ C|x

x? | ↵ ,

↵>0

p -‐  Regret lower bound (Dani et al. 2008 – linear): ⌦( T ) p ˜ T) -‐  Exis*ng approaches yielding a regret O( p

-‐  Kleinberg 2004: discre*za*on with step (log(T )/ T )1/↵ -‐  Coppe 2009: stochas*c gradient, works for ↵ 2 only -‐  Yu-‐Mannor 2011: stochas*c version of the golden sec*on algorithm, assume the knowledge of ↵, C

-‐  Without any knowledge on the func*on p smoothness: interval trimming algorithm yielding a regret O( ˜ T ) , Combes-‐ Prou*ere 2014

64

Interval Trimming -‐  Idea: construct a sequence of intervals I T ⇢ . . . ⇢ I 0 = [0, 1] with x ? 2 \ Tt=0 I t with high probability -‐  Step t: start with I t = [x, x] -‐  Sample the func*on at K points x  x 1  . . .  x K  x un*l enough informa*on is gathered to eliminate either the leO or right part of I t 1

0.8

µ(x)

0.6 0.4 0.2 0 0

0.2

0.4

0.6 x

0.8

1 65

The Failure of Golden Sec*on Algorithm -‐-‐ Unknown Smoothness -‐  We need to sample at least 3 arms in the interior of the interval to be trimmed to guarantee that x ? 2 \ Tt=0 I t with high probability

µ

+ + + x =x 1

+ x 2

λϵ

+ +

+ x 3

ϵ

x4

+ + =x 66

Op*mal Interval Trimming -‐  -‐  -‐  -‐ 

Sample 3 points in the interior of the interval x1 < x2 < x3 If x ? > x 2 , and µ(x 1 ) < µ(x 2 ) -‐-‐ remove [x, x1 ] If x ? < x 1 , and µ(x 3 ) < µ(x 2 ) -‐-‐ remove [x3 , x] Sample long enough un*l µ ˆ (x 2 ) µ ˆ (x 1 ) or µ ˆ (x 2 ) µ ˆ (x 3 ) is large enough λ 2δ

µ

+ +

+ +

+

+⋆ x+

+ +

+

+ + + +

x

+ x 1

+ x 2

x

3

x

67

Op*mal Interval Trimming -‐  Loca*on test: ?

KL (µ1 , µ2 ) = 1µ1 0 . p Then the proposed algorithm has regret O( T log(T )) .

69

Examples 1

0.8

0.8

0.6

0.6 µ(x)

µ(x)

1

0.4

0.4

0.2

0.2

0.2

0.4

0.6

0.8

0 0

1

x

0.2

0.4

0.6

0.8

1

x

1 0.8 0.6 µ(x)

0 0

0.4 0.2 0 0

0.2

0.4

0.6 x

0.8

1

70

2-‐C.2. Con*nuous Lipschitz Bandits

71

Related work -‐  Con*nuous set of ac*ons (e.g. [0,1]): Agrawal 1995, Kleinberg 2004, Kleinberg-‐Slivkins-‐Upfal 2008, Bubeck-‐Munos-‐Stolz-‐Szepesvári 2008, …

-‐  For con*nuous bandits, algorithms should 1.  Adapt the subset of arms to sample from 2.  Op*mally exploit the Lipschitz structure to select the arm based on all past observa*ons

-‐  Exis*ng algorithms perform 1, but not 2. (for 2., simple UCB-‐like index are used …) -‐  Alterna*ve approach: op*mal algorithm for discrete bandits, and then op*mal discre*za*on of the set of arms

72

Zooming Algorithm -‐  Kleinberg-‐Slivkins-‐Upfal 2008

0

1

-‐  Maintains a set of ac*ve balls: At s log(T ) conft (B) = 4 1 + nt (B) domt (B) = B \ [B 0 2At :r(B 0 ) 0 Algorithm p 1/↵ 1. Discre*za*on of the set of arms: step size (log(T )/ T ) 2. Apply discrete bandit algorithms The above algorithm is order-‐op*mal, as (discre*za*on ˜ 1/2 ) +KL-‐UCB), HOO algorithms, regret O(T The zooming algorithm does not take the smoothness into ˜ 2/3 ) account – in general sub-‐op*mal, regret O(T 76

Example: Con*nuous set of arms Triangular reward func*on

1500 Zooming+ HOO+ Zooming HOO KL−UCB (T/log(T))1/2 arms CKL−UCB (T/log(T))1/2 arms

Regret

1000

500

0

0

0.5

1

1.5 Time

2

2.5 4

x 10

77

Example: Con*nuous set of arms Quadra*c reward func*on

1000 Zooming+ HOO+ Zooming HOO

900 800

KL−UCB (T/log(T))1/2 arms CKL−UCB (T/log(T))1/2 arms

700

CKL−UCB (T/log(T))1/4 arms

Regret

600 500 400 300 200 100 0

0

0.5

1

1.5 Time

2

2.5 4

x 10

78

Summary: Con*nuous Bandits -‐  State-‐of-‐the-‐art algorithms apply an appropriate discre*za*on of the set of arms, and op*mally exploit the structure -‐  Discre*za*on: depends on the smoothness of the expected reward func*on -‐  Without smoothness: op*mal loca*on test + interval trimming approach -‐  No problem-‐specific regret lower bound

79

2-‐D. Conclusions and Open Problems

80

Conclusions: Stochas*c Bandits -‐  Regret: the right performance metrics when dealing with uncertain and *me-‐varying (non-‐sta*onary) environment -‐  Tracking the best decision with minimum explora*on cost -‐  Many applica*ons

-‐  A well developed theory (essen*ally in the control and stat. communi*es, from the 70’s to the late 90’s) -‐  Further insights and new applica*ons (ML community) -‐  Many open ques*ons …

81

Any*me Regret Guarantees -‐  Classical unstructured discrete bandits: the asympto*c lower bound is not *ght for small *me horizons regret

T (K, ✓) rapidly grow with K!

lower bound KL-‐UCB *me

T (K, ✓)

-‐  Op*mality for small *me horizon? -‐  Preliminary result: Guha 2014 (COLT), Thompson sampling is 2-‐compe**ve for very specific problems

82

Discrete Structured Bandits -‐  Simple and yet asympto*cally op*mal algorithm for generic structure? -‐  Graves-‐Lai lower bound indicates the numbers of *mes sub-‐op*mal arms should be selected -‐  These numbers solve a complex op*miza*on problem -‐  … that we need to solve to get asympto*c op*mality -‐  What about the trade-‐off between complexity and regret?

-‐  How does the lower bound scale with the number of arms? -‐  Example: combinatorial bandits (e.g. rou*ng problems) -‐  Performance of Thompson sampling?

83

Con*nuous Structured Bandits -‐  Problem specific lower bounds? -‐  How to op*mally exploit the structure? Linear, convex, and other structure? -‐  The op*mal discre*za*on depends on the structure and the smoothness of the expected reward func*on: is there an algorithm learning the structure and the smoothness?

84

Bibliography -‐  Graves and Lai. Asympto*cally efficient adap*ve choice of control laws in controlled Markov chains, 1997 -‐  Garivier and Moulines. On Upper-‐Confidence Bound Policies for Non-‐ sta*onary Bandit Problems, 2011 -‐  Agrawal. The Con*nuum-‐Armed Bandit Problem, 1995 -‐  Kleinberg. Nearly *ght bounds for the con*nuum-‐armed bandit problem, 2004 -‐  Kleinberg, Slivkins, and Upfal, Mul*-‐armed bandits in metric spaces, 2008 -‐  Bubeck, Munos, Stoltz, Szepesvári. X-‐Armed Bandits, 2011 -‐  Mallows, Robbins. Some Problems of Op*mal Sampling Strategy, 1964 -‐  Berry, Chen, Zame, Heath, and Shepp, Bandit problems with infinitely many arms, 1997 -‐  Wang, Audibert, and Munos. Algorithms for infinitely many-‐armed bandits, 2008 85

Bibliography -‐  Kiefer. Sequen*al minimax search for a maximum, 1953 -‐  Guha and Munagala. Stochas*c Regret Minimiza*on via Thompson Sampling, 2014

86

Thanks! -‐  Richard Combes: hTps://dl.dropboxusercontent.com/u/19365883/site/ index.html -‐  Alexandre Prou*ere: hTp://people.kth.se/~alepro/

87

Bandit Op2miza2on: Theory and Applica2ons -â Part 2 -â

des documents recommandant

Bandit Op2miza2on: Theory and Applica2ons -â Part 2 -â