Nonstochastic Bandit Bros: Vanilla, Partial, Delayed, Composite, Contextual Claudio Gentile INRIA and Google NY
[email protected]
Toulouse September 13th, 2018
Based on joint work with: N. Alon, N. Cesa-Bianchi, P. Gaillard, S. Gerchinovitz, Y. Mansour, S. Mannor, O. Shamir
1
Goal of this presentation Recent activity in the analysis of bandit problems in nonstochastic settings under various modeling assumptions, and kind of available feedback Outline : • Nonstochastic bandit game: – vanilla – delayed – composite anonymous – graph • Contextual bandits for nonparametric policies Examples thereof 2
Goal of this presentation Recent activity in the analysis of bandit problems in nonstochastic settings under various modeling assumptions, and kind of available feedback Outline : • Nonstochastic bandit game: – vanilla – delayed – composite anonymous – graph • Contextual bandits for nonparametric policies Examples thereof 3
Nonstochastic bandit game/1 N actions for Player
.
.
.
.
.
.
.
.
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: `t (It )
4
Nonstochastic bandit game/1 N actions for Player
.
.
.
.
.
.
.
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: `t (It )
5
Nonstochastic bandit game/1 N actions for Player 0.3
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: `t (It )
6
Nonstochastic bandit game/2 Goal [external regret]: Given T rounds, Player’s total loss T X
`t (It )
t=1
must be close to that of single best action in hindsight for Player (Pseudo) Regret of Player for T rounds: " T # T X X RT = max E `t (It ) − `t (i) i=1...N
t=1
t=1
Want : RT = o(T ) as T grows large (”no regret”) √ Lower bound: Ω( T N ) Regret: RT∗ = max
i=1...N
T X
`t (It ) −
t=1
T X
! `t (i)
t=1
Want : RT∗ = o(T ) as T grows large w.h.p 7
Nonstochastic bandit game/3: Exp3 Alg.
[Auer et al. 02]
At round t pick action It = i with probability proportional to ! t−1 X exp −η `bs (i) , i = 1...N s=1
( `bs (i) =
`s (i) Prs (`s (i)
is observed in round s)
0
if `s (i) is observed otherwise
• Only one nonzero component in `bt • Exponentially-weighted alg with (importance sampling) loss estimates `bt (i) ≈ `t (i) • Upper bound on regret: RT ≤ √ • Improved upper bound: O( T N )
√ T N ln N (the INF alg.)
[AB09] 8
Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥
[t > d]
p T (d + N )
9
Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥
[t > d]
p T (d + N )
10
Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥
[t > d]
p T (d + N )
11
Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥
[t > d]
p T (d + N )
12
Nonstochastic bandit game with delay/2
[CB+16]
Upper bound : • Use importance-sampling estimate within Exp3, and update as soon as loss becomes available: ( `t−d (i) if It−d = i Pr t−d (It−d =i) `bt (i) = 0 otherwise • Cumulative regret (matching lower bound up to logs): p ˜ RT = O T (d + N ) Unknown delays:
[Li+18]
Collect (delayed) loss observations at time t, but use Prt instead of Prt−d
13
Composite anonymous feedback/1 [D+14,A+15,PB+17,CB+18] • Loss of action is not charged immediatedly but spread arbitrarily over d consecutive steps • Generalizes d-delayed feedback • Several motivating examples in online businesses: – impression resulting in immediate clickthrough, later followed by conversion – user interacting with a recommended item (e.g. media content) multiple times over several days • Loss observed by player at time t is composite loss i.e. sum of d loss components (accumulated effect of d-many past actions) : (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )
`(s) t−s (It−s ) = s-th loss component from action It−s 14
Composite anonymous feedback/2 [D+14,A+15,PB+17,CB+18] N = 3 actions = {
1
,
2
,
}
3
d = 4 loss components time t
It-3 =
2
It-2 =
3
It-1 =
1
It =
2
0
t-3
(2)
1
(2)
2
(2)
3
(2)
t-3
t-3
t-3
0
1
2
(3)
t-2
(3)
(3)
[0,1] 3
(3)
t-2
t-2
t-2
0
1
2
(1)
t-1
(1)
(1)
[0,1] 3
(1)
t-1
t-1
t-1
0
1
2
t
(2)
t
(2)
t
(2)
[0,1] 3
t
(2)
[0,1]
[0,1]
15
Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t
3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )
16
Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t
3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )
17
Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t
3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )
18
Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t
3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `ot (It−d+1 . . . It ) = `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )
19
Composite Loss Wrapper
[CB+18]
• Take Base MAB(η) as input • I0 ∼ p1 = uniform on actions 1 . . . N Interleave update (up), draw (dr), stay (st) rounds : . . . up dr st st st . . . st up dr st st st . . . st up dr st st st . . . st . . . ³ 2d-2
³ 2d-2
³ 2d-2
Stretch of stay rounds: 2d − 2 + Geom(1/(2d)) long • draw round:
It ∼ pt without updating pt
• stay round:
It = It−1 without updating pt
• update round: It = It−1 , but pt → pt+1 by feeding Base MAB with average composite loss `¯t =
1 2d
t X
`oτ (Iτ −d+1 . . . Iτ )
time t
It-3 =
2
It-2 =
3
It-1 =
1
It =
2
0
t-3
(2)
1
(2)
t-3 0
(3)
t-2
2
(2)
t-3 1
(3)
t-2 0
(1)
t-1
3
(2)
[0,1]
t-3 2
(3)
t-2 1
(1)
t-1 0
t
(2)
3
(3)
[0,1]
t-2 2
(1)
t-1 1
t
(2)
3
(1)
[0,1]
t-1 2
t
(2)
3
t
(2)
[0,1]
τ =t−d+1
20
[0,1]
Stability and regret bounds
[CB+18]
Stability: Base MAB A(η) generating p1 , p2 . . . pt . . . ξ-stable if X E pt+1 (i) − pt (i) ≤ ξ i : pt+1 (i)>pt (i)
Regret of Base MAB: RA (T, N , η) =⇒ regret of Composite Loss Wrapper RT ≤ T ξ + O(d · RA (T /d, N , η)) Examples: • Exp3 ξ-stable with ξ = η
=⇒
RT = O
√
dN T log N
• Reduction is far more general (still pay factor
√
d):
– Combinatorial Bandits – Bandit/Linear Convex Optimization Lower bound (for vanilla MAB): RT = Ω
√
dN T
21
Feedback graphs/1
[MS11,A+13,K+15] .
N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included
.
. .
. .
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 22
Feedback graphs/1
[MS11,A+13,K+15] .
N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included
.
. .
. .
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 23
Feedback graphs/1
[MS11,A+13,K+15] .
N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included
.
.
. .
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 24
Feedback graphs/1
[MS11,A+13,K+15] 0.1
N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included
0.4
0.3
.
0.3 0.7
For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 25
Feedback graphs/2: Exp3-IX Alg.
[K+15]
At round t pick action It = i with probability proportional to ! t−1 X exp −η `bs (i) , i = 1...N s=1
( `bs (i) =
`s (i) γt + Prs (`s (i)
is observed in round s)
0
if `s (i) is observed otherwise
• Note: prob. of observing loss of action 6= prob. of playing action • Exponentially-weighted alg with γt -biased (importance sampling) loss estimates `bt (i) ≈ `t (i) √ • Bias is controlled by γt = 1/ t
26
Feedback graphs/3
[A+13,K+15]
Independence number α(Gt ) : disregard edge orientation
1 ≤ |{z} clique: full info game
α(Gt )
≤
N |{z} edgeless: bandit game
Regret analysis:
v u T uX RT = O ln(T N )t α(Gt )
t=1
If Gt = G ∀t:
p ˜ RT = O T α(G) 27
Feedback graphs/4: Simple example Revenue (= 1- Loss) Feedback structure Gt
revealed after play
b2
b1
b2
price
played I t
• Second-price auction with reserve (seller side) highest bid revealed to seller (e.g. AppNexus) • Auctioneer is third party • After seller plays reserve price It , both seller’s revenue and highest bid revealed to him/her • Seller/Player in a position to observe all revenues for prices j ≥ It √ (full info game up to logs) • α(G) = 1: RT = O ln(T N ) T 28
Goal of this presentation Recent activity in the analysis of bandit problems in nonstochastic settings under various modeling assumptions, and kind of available feedback Outline : • Nonstochastic bandit game: – vanilla – delayed – composite anonymous – graph • Contextual bandits for nonparametric policies Examples thereof 29
Learning against Lipschitz policies/1 Ingredients: • Context (metric) space X (e.g., X = Rn ) • Action (metric) space Y (e.g., Y = [0, 1]) • Class of Lipschitz (and bounded) policies F = {f : X → Y} • (One-sided) Lipschitz loss functions `t : Y → [0, 1] Learning protocol(s): • Opponent picks context xt ∈ X and loss `t (·) • Player observes xt , picks action yˆt ∈ Y, and incurs loss `t (ˆ yt ) • Player observes: – `t (ˆ yt )
only
[bandit info: contextual bandit]
– `t (y)
∀y ≥ yˆt
[one-sided full info: contextual one-sided expert]
– `t (y)
∀y ∈ Y
[full info: contextual expert] 30
Learning against Lipschitz policies/2 (Pseudo) Regret of Player for T rounds w.r.t. F : " T # T X X RT (F ) = max E `t (ˆ yt ) − `t (f (xt )) f ∈F
t=1
t=1
Want : RT = o(T ) as T grows large (”no regret”) for any sequence of contexts x1 , x2 , . . . xt , . . . Yardstick: Value of full info game
[RST15]
" VT (F ) = sup inf sup Eyˆ1 ∼q1 . . . sup inf sup EyˆT ∼qT x1 q1 ∈∆(Y) y1
xT qT ∈∆(Y) yT
T X
`(ˆ yt , yt ) − min
t=1
f ∈F
T X
# `(f (xt ), yt )
t=1
In particular: F = {f : [0, 1]n → [0, 1] , f is 1-Lipschitz} give
VT (F ) =
n−1
˜ n ) O(T √ ˜ T) O(
if n ≥ 2 if n = 1 31
Contextual bandit game: a folk algorithm
[K04,S14,...]
Y
Yε
x2 ε
x1
x3 x4
x5
X
Each newly created ball centered in xt hosts instance of EXP3 over discretized action space Y • If xt outside any ball so far, create new ball centered on xt • Determine active EXP3 instance by past center xs closest to xt • Draw action yˆt according to active EXP3 and update its weights only Remark: No. balls never exceeds T 32
Contextual bandit game: a folk algorithm
[K04,S14,...]
Y
Yε
x2 ε
x3
x1 x6
x4
x5
X
Each newly created ball centered in xt hosts instance of EXP3 over discretized action space Y • If xt outside any ball so far, create new ball centered on xt • Determine active EXP3 instance by past center xs closest to xt • Draw action yˆt according to active EXP3 and update its weights only Remark: No. balls never exceeds T 33
Contextual bandit game: regret bounds
[K04,S14,...]
• n = metric dimension of X • 1 = metric dimension of Y Then: n+2
• Lipschitz losses :
˜ n+3 ) O(T
• Convex losses :
˜ n+2 ) O(T
• Lower bound for n | = {z 0} :
Ω(T 3 )
n+1
2
[folk alg] [folk alg + BEL16] [B+11]
no context
In all cases: • Exploit finite coverability of X and Y • Set radius appropriately Very recent improvement in the finite action space case ˜ O(T
n n+1
[FK18]
) 34
Contextual one-sided expert game/1
[CB+17]
Using Exp3-IX-like combined with folk alg on -balls over X yields regret p RT (F ) . T ln N + T if the `t are (one sided-)Lipschitz n+1
. T n+2
when optimizing on
Revenue (= 1- Loss) Feedback structure Gt
revealed after play
b2
b1
b2
price
played I t
√ ˜ T) Remark 1: No context (n = 0) case : RT (F ) = O( Remark 2: More general notions of one sided Lipschitz recently being used in online optimization (dispersion condition) and regret analysis in auction algs (∆0 -Lipschitz) [F+18,B+18] We can do better in the Lipschitz case 35
Contextual one-sided expert game/2: Chaining/1 [CB+17] Ideas of the algorithm: Hierarchical covering of F = tree whose nodes are functions in F • The nodes at each depth m define a (2−m )-covering of F • Any function f ∗ ∈ F is represented by unique path/chain in the tree • Run an instance of Exp4 (adapted to one-sided expert feedback) on each node of tree • Instance Af at node f uses the predictions of child instances as expert advice
u
...
Level m + 1
2−(m+1) covering
.
2−m covering of F
..
..
Exp4 ...
v .
. . .
..
.
Exp4 ...
Level m
Exp4 ... w
Level M (leaves)
2−M covering 36
Contextual one-sided expert game/2: Chaining/2 [CB+17] Key issues (Lipschitz losses): • Small local ranges: losses associated with neighboring nodes are close • Local version of Exp4 scaling with loss range: possible because of richer feedback • Regret:
Z
1
RT (F ) . γT + γ
.T
n n+1
s
T ln N (F , )d γ
∀γ > 0
(when F are Lipschitz on [0, 1]n )
• Improvements when F = Lipschitz functions on [0, 1]n time efficient algorithm (wavelet-based approx.): n−1/3
– Improved regret rate T n+2/3 – Running time per round: ≈ T α , α < 2 37
Learning against Lipschitz policies Bounds abound ! Exponents of T : • Contextual bandits: – General Lipschitz losses:
n+2 n+3
– Convex losses:
n+1 n+2
– General Lipschitz but finite actions
n n+1
[FK18]
• Contextual one-sided: – General Lipschitz losses:
n n+1
– One-sided Lipschitz losses:
n+1 n+2
– Rectangular context space and general Lipschitz losses (n ≥ 1): • Contextual experts (n ≥ 2):
n−1/3 n+2/3 n−1 n
(tight)
[RST15] 38
Conclusions and open questions • Recent activity in nonstochastic bandits problems • Several combinations are possible
Some open questions In the composite anonymous feedback : • Time-varying delay d • fully adaptive adversaries (partially adaptive still possible) In learning with Lipschitz policies : • Tighter upper bounds with efficient alg: – folk approach need not capture complexity of F – Covering F in function space does the job but algs. not efficient • Lower bounds 39