Bandit Op*miza*on: Theory and Applica*ons -‐ Part 2 -‐ R. Combes, A. Prou*ere
1
Part 2. Structured Bandits Discrete Structured Bandits 1. Regret lower bounds 2. Examples 3. Efficient algorithms for some structures: unimodal, Lipschitz
Infinite Bandits 1. Regret lower bounds 2. Op*mal algorithms
Con3nuous Structured Bandits 1. Regret lower bounds 2. Unimodal bandits 3. Lipschitz bandits
Conclusion and Open Problems
2
2-‐A. Discrete Structured Bandits
3
Discrete Structured Bandits -‐ -‐ -‐ -‐
K arms Reward distribu*ons parametrized by ✓ = (✓1 , . . . , ✓K ) Average reward of arm k: µk = µk (✓) Most oOen, reward distribu*ons are taken from a single parameter exponen*al family (e.g. Bernoulli, ✓ k = µ k ) -‐ K can be very large – yielding a prohibi*ve regret if arms are independent, i.e., ⇥(K log(T )) -‐ Structure maTers and has to be exploited! -‐ Nota*on: µ? (✓) = max µk (✓) = µk? (✓) k
4
Discrete Structured Bandits -‐ Unstructured bandits: average rewards are not related µ = (µ1 , . . . , µK ) 2 ⇥
⇥=
K Y
µ2 ⇥
[ai , bi ]
i=1
µ1
-‐ Structured bandits: the decision maker knows that average rewards are related, i.e., that µ 2 ⇥ µ2 ⇥ 6=
K Y
[ai , bi ]
⇥
i=1
µ1
-‐ The rewards observed for a given ac*on provide side-‐informa*on about the average rewards of other ac*ons -‐ How can we exploit this side-‐informa*on op*mally?
5
Example 1: Graphical Unimodality
G = (V, E)
Arms
6
Example 1: Graphical Unimodality k?
G = (V, E) j µ j > µi i
µ = (µi )i2V 2 UG
Graphical unimodality: from any vertex, there is a path with increasing rewards to the best vertex. 7
Example 1: Unimodality µi
1
2
3
4
5
6
7
8 Arms i
Classical unimodality, graph = line
8
Example 2: Lipschitz µi
+L
L x1
x2
x3
x4
x5
x6
x7
x 8 arms i
Let x 1 < x 2 < . . . < x K denote the posi(ons of the arms. We assume that: |µ k µ k 0 | L ⇥ |x k x k 0 | . 9
A Markov Chain Control Perspec*ve Graves-‐Lai 1997 y x
p(x, y; u, ✓) reward: r (x, u)
-‐ Finite state space X and ac*on spaces ✓ 2 ⇥ -‐ Unknown parameter ⇥ : compact metric space
-‐ Control: finite set of irreducible control laws g : X ! U µg (✓) =
X
⇡✓g (x)r(x, g(x))
x2X -‐ Op*mal control law: g ? -‐ Regret: R ⇡ (T ) = T µg? (✓)
E
T X
r(Xt , g ⇡ (Xt ))
t=1
10
Regret lower bound -‐ KL number under policy g : X g p(x, y; g(x), ✓) g I (✓, ) = ⇡ (x)p(x, y; g(x), ✓) log ✓
p(x, y; g(x), )
x,y
-‐ Bad parameter set:
g?
?
B(✓) = { 2 ⇥ : g not opt., I (✓, ) = 0} R⇡ (T ) -‐ Lower bound: lim inf T !1 log(T ) c(✓) = inf
X
cg (µg? (✓)
g6=g ?
s.t.
c(✓)
inf
2B(✓)
X
µg (✓))
cg I g (✓, )
1
g6=g ? 11
Applica*on to Structured Bandits -‐ State space: set of possible rewards -‐ Control laws: constant mappings to the set of arms, e.g. g=k
-‐ Transi*ons (i.i.d. process): ⇢ ✓k p(x, y; k, ✓) = 1 ✓k I k (✓, ) = KL(✓k ,
-‐ Average rewards: g = k
if y = 1 if y = 0
k)
µg (✓) = ✓k = µk 12
Regret Lower Bound R⇡ (T ) -‐ Lower bound: lim inf T !1 log(T )
c(✓) =
c(✓)
inf
ck (µk? µk ) ck 0,8k X s.t. inf ck I k (✓, ) 2B(✓)
1
k6=k?
k?
B(✓) = { 2 ⇥ : I (✓, ) = 0, µ? ( ) > µk? ( )}
13
Regret Lower Bound R⇡ (T ) -‐ Lower bound: lim inf T !1 log(T )
c(✓) =
c(✓)
inf
ck (µk? µk ) ck 0,8k X s.t. inf ck I k (✓, ) 2B(✓)
1
k6=k?
k?
B(✓) = { 2 ⇥ : I (✓, ) = 0, µ? ( ) > µk? ( )}
-‐ Iden*fying the worst can be challenging -‐ Examples where it is explicit: unimodal, Lipschitz. In this case, the regret lower solves an LP -‐ Interpreta*on: when op*mal, an algorithm plays sub-‐ op*mal arm k c k log(T ) *mes
14
Asympto*cally Op*mal Algorithm -‐ Graves-‐Lai’s algorithm -‐ Uses the doubling trick -‐ Needs to solve the regret lower bound problem repeatedly -‐ Too complex, and inefficient for reasonable *me horizons
15
2-‐A.1. Discrete Unimodal Bandits Combes, Prou*ere. Unimodal Bandits: Regret Lower Bounds and Op*mal Algorithms, ICML 2014 Combes et al. Op*mal Rate Sampling in 802.11 Systems, IEEE Infocom 2014 16
Regret Lower Bound k
17
Regret Lower Bound N (k) k
Theorem: For any uniformly good algorithm ⇡ X µ? µk (✓) R⇡ (T ) lim inf cG (✓) cG (✓) = T !1 log(T ) KL(✓k , ✓k? ) ? k2N (k )
The performance limit does not depend on the size of the decision space! Structure could really help.
18
Proof inf
X
cg (µg? (✓)
g6=g ?
s.t. µi
inf
2B(✓)
X
µg (✓))
Example: classical unimodality cg I g (✓, )
1
g6=g ?
The most confusing
µ
1
2
3
4
5
6
7
8 ac*ons 19
Op*mal Ac*on Sampling tk (n)
1 X -‐ Empirical average reward: µ Xk (s) ˆk (n) = tk (n) s=1 -‐ Leader at *me n: L(n) 2 arg max µˆk (n) k
-‐ Number of *mes k has been the leader: lk (n) =
n X
1L(s)=k
s=1
-‐ Index of k: bk (n) = max {q 2 [0, 1] : tk (n)KL(ˆ µk (n), q)
log(lL(n) (n)) + c log log(lL(n) (n))
20
Op*mal Ac*on Sampling Algorithm – Op*mal Ac*on Sampling (OAS) For n = 1, . . . , K , select ac*on k(n) = n For n K + 1 , select ac*on k(n) : k(n) =
(
L(n) arg max
k2N (L(n))
Theorem: For any µ 2 UG ,
bk (n)
if (lL(n) (n) otherwise.
1)/( + 1) 2 N,
ROAS (T ) lim sup cG (✓). T !1 log(T )
21
Proof ROAS (T )
X
E[lk (T )]
k6=k?
+
X
k2N (k? )
(µ?
µk (✓))E[
T X
1L(t)=k? ,k(t)=k ]
t=1
First term O(log log(T )) Second term (1 + ✏)c(✓) log(T ) + O(log log(T ))
22
Proof ingredients 1. Decomposi*on of the set of events 2. Devia*on bounds (refined concentra*on inequali*es), e.g. Lemma. {Zt }t2Z independent random variables in [0, B]. Fn = P({Zt }tn ), F = (Fn )n2Z . Let s 2 N, n0 2 Z and T n0 . n Sn = t=n0 Bt (Zt E[Zt ]), where Bt 2 {0, 1} is previsible. Pn tn = t=n0 Bt . 2 {n0 , . . . , T + 1} a F-stopping time with: either t s or = T + 1. Then: P[S
t
,
2s 2 T ] exp( ). 2 B
23
Non-‐sta*onary environments -‐ -‐ -‐ -‐
Average rewards may evolve over *me: ✓(t) Best decision at *me t: k? (t) Goal: track the best decision Regret: T X R⇡ (T ) = (µk? (t) (t) µk⇡ (t) (t))
t=1 -‐ Sub-‐linear regret cannot be achieved (Garivier-‐Moulines 2011) -‐ Assump*ons: ✓(t) σ-‐Lipschitz (w.r.t. *me), and separa*on T 1 X X lim sup 1|✓k (n) ✓k0 (n)|< (K) T !1 T n=1 0 k,k 2N (k)
24
OAS with Sliding Window -‐ SW-‐OAS (applies OAS over a sliding window of size τ) -‐ Graphical unimodality holds at any *me -‐ Parameters: 3/4 ⌧= log(1/ )/8, = 1/4 log(1/ )
Theorem: Under ⇡ = SW-‐OAS R⇡ (T ) lim sup C (K) T T
1 4
log(1/ )(1 + Ko(1)),
! 0+
25
OAS with Sliding Window -‐ Analysis made complicated by the smoothness of the rewards vs. *me (previous analysis by Garivier-‐Moulines assumes separa*on of rewards at any *me) -‐ Upper bound on regret per *me unit: -‐ Tends to zero when the evolu*on of average rewards gets smoother 1/4
log(1/ ) ! 0,
as
! 0+
-‐ Does not depend on the size of the decision space if
(K) C
26
Applica*on: Rate adapta*on in 802.11
Adap*ng the modula*on/coding scheme to the radio environment
-‐ 802.11 a/b/g Yes/No
rates Success probabili*es Throughputs
r 1 r 2 . . . rN ✓1 ✓2 . . . ✓N µ1 µ2 . . . µN
µ i = ri ✓ i
-‐ Structure: unimodality + ✓1 > ✓2 > . . . > ✓N G 6
9
12
18
24
36
48
54 (Mbit/s) 27
Rate adapta*on in 802.11 -‐ 802.11 n/ac MIMO Rate + MIMO mode (32 combina*ons in n) -‐ Example: two modes, single-‐stream (SS) or double-‐stream (DS) 27
54
81
108
162
216
243
270
DS
G
13.5
27
40.5
54
81
108
121.5 135
SS 28
State-‐of-‐the-‐art -‐ ARF (Auto Rate Fallback): aOer n successive successes, probe a higher rate; aOer two consecu*ve failures reduce the rate -‐ AARF: vary n dynamically depending on the speed at which the radio environment evolves -‐ SampleRate: based on achieved throughputs over a sliding window, explore a new rate every 10 packets -‐ Measurement based approaches: Map SNR to packet error rate (does not work – OFDM): RBAR, OAR, CHARM, … -‐ 802.11n MIMO: MiRA, RAMAS, … All exis*ng algorithms are heuris*cs. Rate adapta*on design: a graphically unimodal bandit with large strategy set 29
Op*mal Rate Sampling Algorithm – Op*mal Rate Sampling (ORS) For n = 1, . . . , K , select ac*on k(n) = n For n K + 1 , select ac*on k(n) : k(n) =
(
L(n) arg max
k2N (L(n))
bk (n)
if (lL(n) (n) otherwise.
1)/( + 1) 2 N,
ORS is asympto*cally op*mal (minimizes regret) Its performance does not depend on the number of possible rates! For non-‐sta*onary environments: SW-‐ORS (ORS with sliding window) 30
802.11g – sta*onary environment GRADUAL (success prob. smoothly decreases with rate) 4000 SampleRate SW−G−ORS G−ORS
3500 3000
Regret
2500 2000 1500 1000 500 0
0
20
40
60
80
100
Time (s) 31
802.11g – sta*onary environment STEEP (success prob. is either close to 1 or to 0) 2000 SampleRate SW−G−ORS G−ORS
1800 1600
Regret
1400 1200 1000 800 600 400 200 0
0
20
40
60
Time (s)
80
100 32
802.11g – non-‐sta*onary environment TRACES Instantaneous Throughput (Mbps)
55
54Mbps 48Mbps 36Mbps 24Mbps 18Mbps 12Mbps 9Mbps 6Mbps
50 45 40 35 30 25 20 15 10 5 0
0
50
100
150
200
250
300
Time (s) 33
802.11g – non-‐sta*onary environment RESULTS Instantaneous Throughput (Mbps)
55 Oracle SW−G−ORS SampleRate
50 45 40 35 30 25 20 15 10 5 0
0
50
100
150
Time (s)
200
250
300 34
2-‐A.2. Discrete Lipschitz Bandits
Combes, Magureanu, Prou*ere. Lipschitz Bandits: Regret Lower Bounds and Op*mal Algorithms, COLT 2014 35
Discrete Lipschitz Bandits µi
+L
L x1
x2
x3
x4
x5
x6
x7
x 8 arms i
Let x 1 < x 2 < . . . < x K denote the posi(ons of the arms. We assume that: |µ k µ k 0 | L ⇥ |x k x k 0 | . 36
Related work -‐ Con*nuous set of ac*ons (e.g. [0,1]): Agrawal 1995, Kleinberg 2004, Kleinberg-‐Slivkins-‐Upfal 2008, Bubeck-‐Munos-‐Stolz-‐Szepesvári 2008, …
37
Regret lower bound Theorem: For any uniformly good algorithm ⇡ R⇡ (T ) lim inf C(✓) T !1 log(T ) where C(✓) is the minimal value of: min
ck 0,8k2K
X
k2K
s.t. 8k 2 K ,
ck ⇥ (✓?
X
ci I(✓i ,
✓k ) k i)
1.
i2K
38
Regret lower bound X
min
ck 0,8k2K
k2K
s.t. 8k 2 K ,
ck ⇥ (✓?
X
ci I(✓i ,
✓k ) k i)
1.
i2K
. . .. .. . . . . .. . . . .. x
x
x
x
x
x x
l
l
l l
l l
x
x
l
x
x
l
x
l
l
l l
x
l
x
x
x
x
l l l l 39
Algorithms bk (n) = sup{q 2 [✓ˆk (n), 1] : K X
tk0 (n)I + (✓ˆk0 (n),
q,k k0 )
k0 =1
. . .. .. . . . . .. . . . .. x
q
log(n) + 3 log log(n)}.
x
x
x
x
x x
l
l
l l
l l
x
x
l
x
x
l
x
l
q,k
l
l l
x
l
x
x
x
x
l l l l 40
The OSLB algorithm -‐ Apparently op*mal arm sampling rate. Regret lower ˆ : ck (n) bound replacing ✓ by ✓(n) -‐ Set of arms apparently under-‐sampled: Ke (n) = {k 2 K (n) : tk (n) ck (n) log(n)} k(n) = arg min tk (n) k2Ke (n)
k(n) = arg min tk (n) k
Algorithm -‐-‐ OSLB max bk (n) Select the leader if ✓ˆL(n) (n) k6=L(n) Else ✏ if t k(n) (n) < t k(n) (n) , select k(n) K else select k(n)
41
A Simplified Algorithm Algorithm -‐-‐ CKL-‐UCB Select the leader if it has the highest index Else select the least explored arm with an index higher than the leader
42
Regret under OSLB and CKL-‐UCB Theorem: For any ✓ 2 ⇥ L , under ⇡ = OSLB(✏) , we have: For all > 0 , and all T , R⇡ (T ) C (✓)(1 + ✏) log(T ) + C log log(T ) + K 3 ✏ 1 2 + 3K 1 where C (✓) ! C(✓) as ! 0+ . L , under ⇡ = CKL-UCB , we have: Theorem: For any ✓ 2 ⇥ R⇡ (T ) lim sup C 0 (✓), T !1 log(T ) where C 0 (✓) is the minimal value of an op*miza*on problem “close” to that providing the regret lower bound.
2
43
Proof ingredients A concentra*on inequality for the sum of KL divergences: " # P
K X
k=1
tk (n)I (✓ˆk (n), ✓k ) +
e
✓
d log(n)e K
◆K
eK+1 .
44
Example 46 arms, T = 500,000 Normalized expected number of plays
1.4 θ CKL-UCB KL−UCB
1.2
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
Arm Index 45
Example 3500 KL−UCB CKL-UCB 3000
Regret
2500
2000
1500
1000
500
0
0
1
2
3 Time
4
5 5
x 10
46
Summary: Discrete Structured Bandits -‐ Regret lower bounds by Graves-‐Lai 1997: works for any structure -‐ When is the solu*on explicit? -‐ How does it scale with the dimension of the decision space? -‐ When explicit, provides guidelines on the design of op*mal algorithms – op*mally exploi*ng the known structure
-‐ Simple and efficient algorithm: Unimodal, and Lipschitz -‐ Other structures? Linear, Convex? -‐ Thompson Sampling -‐ Is it always asympto*cally op*mal? -‐ How to sample for the posterior?
-‐ Complexity vs. Performance?
47
2-‐B. Infinite Bandits
Bonald, Prou*ere. Two-‐Target Algorithm for Infinite-‐Armed Bandits, NIPS 2013 48
Ac*ons and rewards -‐ An infinite number of Bernoulli arms -‐ Decision in each round: take a new arm, or play arms previously selected -‐ Bayesian setng: the expected reward ✓ k of the k-‐th selected arm follows a known distribu*on F (u) = P[✓k > u] F (u) ⇠ ↵(1
-‐ Regret: R(T ) = T
E[
T X
u) ,
as u ! 1
Xt ]
t=1
-‐ More like a stopping *me problem … 49
Related work -‐ Mallows-‐Robbins 1964, Herschkorn-‐Pekoes-‐Ross 1996: no-‐regret policies -‐ Berry-‐Chen-‐Zame-‐Heat-‐Shepp 1997: uniformly distributed parameter, policy with regret 2√T, conjectured to be op*mal 1-‐failure policy: keep the first arm with more than √T successive 1’s rewards 110 10 11110 11111110101011100… arm 1 2 3 4
50
Related work -‐ Mallows-‐Robbins 1964, Herschkorn-‐Pekoes-‐Ross 1996: no-‐regret policies -‐ Berry-‐Chen-‐Zame-‐Heat-‐Shepp 1997: uniformly distributed parameter, policy with regret 2√T, conjectured to be op*mal 1-‐failure policy: keep the first arm with more than √T successive 1’s rewards 110 10 11110 11111110101011100… arm 1 2 3 4
1-‐failure policies are actually sub-‐op*mal …
51
Related work -‐ Wang-‐Audibert-‐Munos 2013: More general parameter distribu*on, regret scaling as T /( +1) up to log factors. Policy: select X arms and run UCB … Not a stopping rule. The number of arms tested does not depend on the realiza*ons of the rewards.
52
Regret lower bound Theorem: For any algorithm ⇡ knowing the *me horizon, 1 ✓ ◆ +1 ⇡ R (T ) +1 lim inf T !1 ↵ +1 T Conjecture: When the *me horizon is unknown, 1 ✓ ◆ +1 ⇡ R (T ) +1 lim inf T !1 ↵ T +1 p p Example: parameter unif. distributed, 2T , 2 T .
53
Two-‐target algorithms Explora*on of arm k: Run 1 Run2 Run 3 Run m+1 11111111110 11110 111111110 ……. 111111111110 L1 L2 If L 1 < ` 1 , explore a new arm Else if L 2 < ` 2 , explore a new arm else keep it forever 54
Two-‐target algorithms $✓
↵n
◆
1 +2
%
$
✓
↵n
Theorem: Select `1 = , `2 = m +1 +1 1 ✓ ◆ +1 ✓ ◆ ⇡ R (T ) +1 1 lim sup 1 + O( ) . ↵ m T !1 T +1 Example: parameters for unif. distribu*on, p 1/3 `1 ⇠ (n/2) , `2 ⇠ m n/2.
◆
1 +1
%
55
.
Numerical Example -‐ Beta(1,2) mean reward distribu*on -‐ Expected failure rate = mean regret per round
56
Summary: Infinite Bandits -‐ Regret lower bound and op*mal algorithms when the support of the reward distribu*on is 1, and the *me horizon is known -‐ What about unknown *me horizon? -‐ What if the support of the reward distribu*on does not include 1? -‐ What if the reward distribu*on is only par*ally known?
57
2-‐C. Con*nuous Structured Bandits
58
Con*nuous Structured Bandits -‐ Set of arms: [0, 1] -‐ Bernoulli reward for arm x of mean µ(x) -‐ Reward realiza*ons: (X n (x), n 1) i.i.d. over *me, independent over arms -‐ Algorithm ⇡ : selects arm x ⇡ (n) in round n -‐ Bandit feedback: Xn (x⇡ (n)) T -‐ Regret: X ⇡ ? ⇡ R (T ) = T µ µ(x (n)) n=1 µ? = sup µ(x) = µ(x? ) x2[0,1] -‐ Structure: x 7! µ(x) is unimodal, linear, concave, Lipschitz, … 59
2-‐C.1. Con*nuous Unimodal Bandits
Combes, Prou*ere. Unimodal Bandits without Smoothness, arxiv 2014 60
Con*nuous Unimodal Bandit µ(x) µ?
0
x?
1
x
The mapping x 7! µ(x) is unimodal. 61
Golden Sec*on Algorithm Kiefer 1953
µ(x)
µ?
0
-‐ -‐ -‐ -‐
x1
x?
x2
1
x
Determinis*c setng Evaluate the func*on in points x 1 , x2 If µ(x 1 ) < µ(x 2 ) , keep [x 1 , 1] , else keep [0, x2 ] Design choices: (i) the ra*o of the lengths of the old and new new intervals is always r and (ii) we need to evaluate the func*on once in each step 62
Golden Sec*on Algorithm Kiefer 1953
µ(x)
µ?
0
x?
x1
x2
1
x
r 1 r = =) r = r 1 r
1+ 2
p
r 5
⇡ 0.618 63
Stochas*c Setng – Related Work -‐ Smoothness assump*on: |µ(x)
?
x!x?
µ(x )| ⇠ C|x
x? | ↵ ,
↵>0
p -‐ Regret lower bound (Dani et al. 2008 – linear): ⌦( T ) p ˜ T) -‐ Exis*ng approaches yielding a regret O( p
-‐ Kleinberg 2004: discre*za*on with step (log(T )/ T )1/↵ -‐ Coppe 2009: stochas*c gradient, works for ↵ 2 only -‐ Yu-‐Mannor 2011: stochas*c version of the golden sec*on algorithm, assume the knowledge of ↵, C
-‐ Without any knowledge on the func*on p smoothness: interval trimming algorithm yielding a regret O( ˜ T ) , Combes-‐ Prou*ere 2014
64
Interval Trimming -‐ Idea: construct a sequence of intervals I T ⇢ . . . ⇢ I 0 = [0, 1] with x ? 2 \ Tt=0 I t with high probability -‐ Step t: start with I t = [x, x] -‐ Sample the func*on at K points x x 1 . . . x K x un*l enough informa*on is gathered to eliminate either the leO or right part of I t 1
0.8
µ(x)
0.6 0.4 0.2 0 0
0.2
0.4
0.6 x
0.8
1 65
The Failure of Golden Sec*on Algorithm -‐-‐ Unknown Smoothness -‐ We need to sample at least 3 arms in the interior of the interval to be trimmed to guarantee that x ? 2 \ Tt=0 I t with high probability
µ
+ + + x =x 1
+ x 2
λϵ
+ +
+ x 3
ϵ
x4
+ + =x 66
Op*mal Interval Trimming -‐ -‐ -‐ -‐
Sample 3 points in the interior of the interval x1 < x2 < x3 If x ? > x 2 , and µ(x 1 ) < µ(x 2 ) -‐-‐ remove [x, x1 ] If x ? < x 1 , and µ(x 3 ) < µ(x 2 ) -‐-‐ remove [x3 , x] Sample long enough un*l µ ˆ (x 2 ) µ ˆ (x 1 ) or µ ˆ (x 2 ) µ ˆ (x 3 ) is large enough λ 2δ
µ
+ +
+ +
+
+⋆ x+
+ +
+
+ + + +
x
+ x 1
+ x 2
x
3
x
67
Op*mal Interval Trimming -‐ Loca*on test: ?
KL (µ1 , µ2 ) = 1µ1 0 . p Then the proposed algorithm has regret O( T log(T )) .
69
Examples 1
0.8
0.8
0.6
0.6 µ(x)
µ(x)
1
0.4
0.4
0.2
0.2
0.2
0.4
0.6
0.8
0 0
1
x
0.2
0.4
0.6
0.8
1
x
1 0.8 0.6 µ(x)
0 0
0.4 0.2 0 0
0.2
0.4
0.6 x
0.8
1
70
2-‐C.2. Con*nuous Lipschitz Bandits
71
Related work -‐ Con*nuous set of ac*ons (e.g. [0,1]): Agrawal 1995, Kleinberg 2004, Kleinberg-‐Slivkins-‐Upfal 2008, Bubeck-‐Munos-‐Stolz-‐Szepesvári 2008, …
-‐ For con*nuous bandits, algorithms should 1. Adapt the subset of arms to sample from 2. Op*mally exploit the Lipschitz structure to select the arm based on all past observa*ons
-‐ Exis*ng algorithms perform 1, but not 2. (for 2., simple UCB-‐like index are used …) -‐ Alterna*ve approach: op*mal algorithm for discrete bandits, and then op*mal discre*za*on of the set of arms
72
Zooming Algorithm -‐ Kleinberg-‐Slivkins-‐Upfal 2008
0
1
-‐ Maintains a set of ac*ve balls: At s log(T ) conft (B) = 4 1 + nt (B) domt (B) = B \ [B 0 2At :r(B 0 ) 0 Algorithm p 1/↵ 1. Discre*za*on of the set of arms: step size (log(T )/ T ) 2. Apply discrete bandit algorithms The above algorithm is order-‐op*mal, as (discre*za*on ˜ 1/2 ) +KL-‐UCB), HOO algorithms, regret O(T The zooming algorithm does not take the smoothness into ˜ 2/3 ) account – in general sub-‐op*mal, regret O(T 76
Example: Con*nuous set of arms Triangular reward func*on
1500 Zooming+ HOO+ Zooming HOO KL−UCB (T/log(T))1/2 arms CKL−UCB (T/log(T))1/2 arms
Regret
1000
500
0
0
0.5
1
1.5 Time
2
2.5 4
x 10
77
Example: Con*nuous set of arms Quadra*c reward func*on
1000 Zooming+ HOO+ Zooming HOO
900 800
KL−UCB (T/log(T))1/2 arms CKL−UCB (T/log(T))1/2 arms
700
CKL−UCB (T/log(T))1/4 arms
Regret
600 500 400 300 200 100 0
0
0.5
1
1.5 Time
2
2.5 4
x 10
78
Summary: Con*nuous Bandits -‐ State-‐of-‐the-‐art algorithms apply an appropriate discre*za*on of the set of arms, and op*mally exploit the structure -‐ Discre*za*on: depends on the smoothness of the expected reward func*on -‐ Without smoothness: op*mal loca*on test + interval trimming approach -‐ No problem-‐specific regret lower bound
79
2-‐D. Conclusions and Open Problems
80
Conclusions: Stochas*c Bandits -‐ Regret: the right performance metrics when dealing with uncertain and *me-‐varying (non-‐sta*onary) environment -‐ Tracking the best decision with minimum explora*on cost -‐ Many applica*ons
-‐ A well developed theory (essen*ally in the control and stat. communi*es, from the 70’s to the late 90’s) -‐ Further insights and new applica*ons (ML community) -‐ Many open ques*ons …
81
Any*me Regret Guarantees -‐ Classical unstructured discrete bandits: the asympto*c lower bound is not *ght for small *me horizons regret
T (K, ✓) rapidly grow with K!
lower bound KL-‐UCB *me
T (K, ✓)
-‐ Op*mality for small *me horizon? -‐ Preliminary result: Guha 2014 (COLT), Thompson sampling is 2-‐compe**ve for very specific problems
82
Discrete Structured Bandits -‐ Simple and yet asympto*cally op*mal algorithm for generic structure? -‐ Graves-‐Lai lower bound indicates the numbers of *mes sub-‐op*mal arms should be selected -‐ These numbers solve a complex op*miza*on problem -‐ … that we need to solve to get asympto*c op*mality -‐ What about the trade-‐off between complexity and regret?
-‐ How does the lower bound scale with the number of arms? -‐ Example: combinatorial bandits (e.g. rou*ng problems) -‐ Performance of Thompson sampling?
83
Con*nuous Structured Bandits -‐ Problem specific lower bounds? -‐ How to op*mally exploit the structure? Linear, convex, and other structure? -‐ The op*mal discre*za*on depends on the structure and the smoothness of the expected reward func*on: is there an algorithm learning the structure and the smoothness?
84
Bibliography -‐ Graves and Lai. Asympto*cally efficient adap*ve choice of control laws in controlled Markov chains, 1997 -‐ Garivier and Moulines. On Upper-‐Confidence Bound Policies for Non-‐ sta*onary Bandit Problems, 2011 -‐ Agrawal. The Con*nuum-‐Armed Bandit Problem, 1995 -‐ Kleinberg. Nearly *ght bounds for the con*nuum-‐armed bandit problem, 2004 -‐ Kleinberg, Slivkins, and Upfal, Mul*-‐armed bandits in metric spaces, 2008 -‐ Bubeck, Munos, Stoltz, Szepesvári. X-‐Armed Bandits, 2011 -‐ Mallows, Robbins. Some Problems of Op*mal Sampling Strategy, 1964 -‐ Berry, Chen, Zame, Heath, and Shepp, Bandit problems with infinitely many arms, 1997 -‐ Wang, Audibert, and Munos. Algorithms for infinitely many-‐armed bandits, 2008 85
Bibliography -‐ Kiefer. Sequen*al minimax search for a maximum, 1953 -‐ Guha and Munagala. Stochas*c Regret Minimiza*on via Thompson Sampling, 2014
86
Thanks! -‐ Richard Combes: hTps://dl.dropboxusercontent.com/u/19365883/site/ index.html -‐ Alexandre Prou*ere: hTp://people.kth.se/~alepro/
87