Nonstochastic Bandit Bros: Vanilla, Partial, Delayed ... - Laurent Risser

Sep 13, 2018 - N (hidded to player). 2. Player picks action It (possibly using randomization) and incurs loss lt(It). 3. Player gets feedback information: lt(It). 4 ...
645KB taille 0 téléchargements 229 vues
Nonstochastic Bandit Bros: Vanilla, Partial, Delayed, Composite, Contextual Claudio Gentile INRIA and Google NY [email protected]

Toulouse September 13th, 2018

Based on joint work with: N. Alon, N. Cesa-Bianchi, P. Gaillard, S. Gerchinovitz, Y. Mansour, S. Mannor, O. Shamir

1

Goal of this presentation Recent activity in the analysis of bandit problems in nonstochastic settings under various modeling assumptions, and kind of available feedback Outline : • Nonstochastic bandit game: – vanilla – delayed – composite anonymous – graph • Contextual bandits for nonparametric policies Examples thereof 2

Goal of this presentation Recent activity in the analysis of bandit problems in nonstochastic settings under various modeling assumptions, and kind of available feedback Outline : • Nonstochastic bandit game: – vanilla – delayed – composite anonymous – graph • Contextual bandits for nonparametric policies Examples thereof 3

Nonstochastic bandit game/1 N actions for Player

.

.

.

.

.

.

.

.

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: `t (It )

4

Nonstochastic bandit game/1 N actions for Player

.

.

.

.

.

.

.

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: `t (It )

5

Nonstochastic bandit game/1 N actions for Player 0.3

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: `t (It )

6

Nonstochastic bandit game/2 Goal [external regret]: Given T rounds, Player’s total loss T X

`t (It )

t=1

must be close to that of single best action in hindsight for Player (Pseudo) Regret of Player for T rounds: " T # T X X RT = max E `t (It ) − `t (i) i=1...N

t=1

t=1

Want : RT = o(T ) as T grows large (”no regret”) √ Lower bound: Ω( T N ) Regret: RT∗ = max

i=1...N

T X

`t (It ) −

t=1

T X

! `t (i)

t=1

Want : RT∗ = o(T ) as T grows large w.h.p 7

Nonstochastic bandit game/3: Exp3 Alg.

[Auer et al. 02]

At round t pick action It = i with probability proportional to ! t−1 X exp −η `bs (i) , i = 1...N s=1

( `bs (i) =

`s (i) Prs (`s (i)

is observed in round s)

0

if `s (i) is observed otherwise

• Only one nonzero component in `bt • Exponentially-weighted alg with (importance sampling) loss estimates `bt (i) ≈ `t (i) • Upper bound on regret: RT ≤ √ • Improved upper bound: O( T N )

√ T N ln N (the INF alg.)

[AB09] 8

Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥

[t > d]

p T (d + N )

9

Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥

[t > d]

p T (d + N )

10

Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥

[t > d]

p T (d + N )

11

Nonstochastic bandit game with delay/1 For t = 1, 2 . . . : 1. Losses `t (i) are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets delayed feedback information: `t−d (It−d ) Lower bound : RT ≥

[t > d]

p T (d + N )

12

Nonstochastic bandit game with delay/2

[CB+16]

Upper bound : • Use importance-sampling estimate within Exp3, and update as soon as loss becomes available: ( `t−d (i) if It−d = i Pr t−d (It−d =i) `bt (i) = 0 otherwise • Cumulative regret (matching lower bound up to logs): p  ˜ RT = O T (d + N ) Unknown delays:

[Li+18]

Collect (delayed) loss observations at time t, but use Prt instead of Prt−d

13

Composite anonymous feedback/1 [D+14,A+15,PB+17,CB+18] • Loss of action is not charged immediatedly but spread arbitrarily over d consecutive steps • Generalizes d-delayed feedback • Several motivating examples in online businesses: – impression resulting in immediate clickthrough, later followed by conversion – user interacting with a recommended item (e.g. media content) multiple times over several days • Loss observed by player at time t is composite loss i.e. sum of d loss components (accumulated effect of d-many past actions) : (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )

`(s) t−s (It−s ) = s-th loss component from action It−s 14

Composite anonymous feedback/2 [D+14,A+15,PB+17,CB+18] N = 3 actions = {

1

,

2

,

}

3

d = 4 loss components time t

It-3 =

2

It-2 =

3

It-1 =

1

It =

2

0

t-3

(2)

1

(2)

2

(2)

3

(2)

t-3

t-3

t-3

0

1

2

(3)

t-2

(3)

(3)

[0,1] 3

(3)

t-2

t-2

t-2

0

1

2

(1)

t-1

(1)

(1)

[0,1] 3

(1)

t-1

t-1

t-1

0

1

2

t

(2)

t

(2)

t

(2)

[0,1] 3

t

(2)

[0,1]

[0,1]

15

Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t

3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )

16

Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t

3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )

17

Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t

3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )

18

Composite anonymous feedback/3 [D+14,A+15,PB+17,CB+18] For t = 1, 2 . . . : 1. Losses `t (i) ∈ [0, 1] are assigned (obliviously) by opponent to every action i = 1 . . . N (hidded to player) 2. Losses `t (i) broken up into d components (arbitrarily but obliviously): (d−1) (1) (i) `t (i) = `(0) t (i) + `t (i) + . . . + `t

3. Player picks action It (possibly using randomization) and incurs loss `t (It ) 4. Player gets composite loss feedback information: (1) (d−1) `ot (It−d+1 . . . It ) = `(0) t (It ) + `t−1 (It−1 ) + . . . + `t−d+1 (It−d+1 )

19

Composite Loss Wrapper

[CB+18]

• Take Base MAB(η) as input • I0 ∼ p1 = uniform on actions 1 . . . N Interleave update (up), draw (dr), stay (st) rounds : . . . up dr st st st . . . st up dr st st st . . . st up dr st st st . . . st . . . ³ 2d-2

³ 2d-2

³ 2d-2

Stretch of stay rounds: 2d − 2 + Geom(1/(2d)) long • draw round:

It ∼ pt without updating pt

• stay round:

It = It−1 without updating pt

• update round: It = It−1 , but pt → pt+1 by feeding Base MAB with average composite loss `¯t =

1 2d

t X

`oτ (Iτ −d+1 . . . Iτ )

time t

It-3 =

2

It-2 =

3

It-1 =

1

It =

2

0

t-3

(2)

1

(2)

t-3 0

(3)

t-2

2

(2)

t-3 1

(3)

t-2 0

(1)

t-1

3

(2)

[0,1]

t-3 2

(3)

t-2 1

(1)

t-1 0

t

(2)

3

(3)

[0,1]

t-2 2

(1)

t-1 1

t

(2)

3

(1)

[0,1]

t-1 2

t

(2)

3

t

(2)

[0,1]

τ =t−d+1

20

[0,1]

Stability and regret bounds

[CB+18]

Stability: Base MAB A(η) generating p1 , p2 . . . pt . . . ξ-stable if   X  E pt+1 (i) − pt (i) ≤ ξ i : pt+1 (i)>pt (i)

Regret of Base MAB: RA (T, N , η) =⇒ regret of Composite Loss Wrapper RT ≤ T ξ + O(d · RA (T /d, N , η)) Examples: • Exp3 ξ-stable with ξ = η

=⇒

RT = O



dN T log N

• Reduction is far more general (still pay factor





d):

– Combinatorial Bandits – Bandit/Linear Convex Optimization Lower bound (for vanilla MAB): RT = Ω



dN T

 21

Feedback graphs/1

[MS11,A+13,K+15] .

N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included

.

. .

. .

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 22

Feedback graphs/1

[MS11,A+13,K+15] .

N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included

.

. .

. .

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 23

Feedback graphs/1

[MS11,A+13,K+15] .

N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included

.

.

. .

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 24

Feedback graphs/1

[MS11,A+13,K+15] 0.1

N actions for Player Before game starts, sequence of . . feedback graphs Gt = (V, Et ) V = {1, . . . , N } generated by exogenous source (hidden to player) All self-loops included

0.4

0.3

.

0.3 0.7

For t = 1, 2, . . . : 1. Losses `t (i) ∈ [0, 1] are assigned by opponent to every action i = 1 . . . N (hidded to player) 2. Player picks action It (possibly using randomization) and incurs loss `t (It ) 3. Player gets feedback information: {`t (j) : (It , j) ∈ Et } 25

Feedback graphs/2: Exp3-IX Alg.

[K+15]

At round t pick action It = i with probability proportional to ! t−1 X exp −η `bs (i) , i = 1...N s=1

( `bs (i) =

`s (i) γt + Prs (`s (i)

is observed in round s)

0

if `s (i) is observed otherwise

• Note: prob. of observing loss of action 6= prob. of playing action • Exponentially-weighted alg with γt -biased (importance sampling) loss estimates `bt (i) ≈ `t (i) √ • Bias is controlled by γt = 1/ t

26

Feedback graphs/3

[A+13,K+15]

Independence number α(Gt ) : disregard edge orientation

1 ≤ |{z} clique: full info game

α(Gt )



N |{z} edgeless: bandit game

Regret analysis:

 v u T uX   RT = O ln(T N )t α(Gt ) 

t=1

If Gt = G ∀t:

p  ˜ RT = O T α(G) 27

Feedback graphs/4: Simple example Revenue (= 1- Loss) Feedback structure Gt

revealed after play

b2

b1

b2

price

played I t

• Second-price auction with reserve (seller side) highest bid revealed to seller (e.g. AppNexus) • Auctioneer is third party • After seller plays reserve price It , both seller’s revenue and highest bid revealed to him/her • Seller/Player in a position to observe all revenues for prices j ≥ It √  (full info game up to logs) • α(G) = 1: RT = O ln(T N ) T 28

Goal of this presentation Recent activity in the analysis of bandit problems in nonstochastic settings under various modeling assumptions, and kind of available feedback Outline : • Nonstochastic bandit game: – vanilla – delayed – composite anonymous – graph • Contextual bandits for nonparametric policies Examples thereof 29

Learning against Lipschitz policies/1 Ingredients: • Context (metric) space X (e.g., X = Rn ) • Action (metric) space Y (e.g., Y = [0, 1]) • Class of Lipschitz (and bounded) policies F = {f : X → Y} • (One-sided) Lipschitz loss functions `t : Y → [0, 1] Learning protocol(s): • Opponent picks context xt ∈ X and loss `t (·) • Player observes xt , picks action yˆt ∈ Y, and incurs loss `t (ˆ yt ) • Player observes: – `t (ˆ yt )

only

[bandit info: contextual bandit]

– `t (y)

∀y ≥ yˆt

[one-sided full info: contextual one-sided expert]

– `t (y)

∀y ∈ Y

[full info: contextual expert] 30

Learning against Lipschitz policies/2 (Pseudo) Regret of Player for T rounds w.r.t. F : " T # T X X RT (F ) = max E `t (ˆ yt ) − `t (f (xt )) f ∈F

t=1

t=1

Want : RT = o(T ) as T grows large (”no regret”) for any sequence of contexts x1 , x2 , . . . xt , . . . Yardstick: Value of full info game

[RST15]

" VT (F ) = sup inf sup Eyˆ1 ∼q1 . . . sup inf sup EyˆT ∼qT x1 q1 ∈∆(Y) y1

xT qT ∈∆(Y) yT

T X

`(ˆ yt , yt ) − min

t=1

f ∈F

T X

# `(f (xt ), yt )

t=1

In particular: F = {f : [0, 1]n → [0, 1] , f is 1-Lipschitz} give

 VT (F ) =

n−1

˜ n ) O(T √ ˜ T) O(

if n ≥ 2 if n = 1 31

Contextual bandit game: a folk algorithm

[K04,S14,...]

Y



x2 ε

x1

x3 x4

x5

X

Each newly created ball centered in xt hosts instance of EXP3 over discretized action space Y • If xt outside any ball so far, create new ball centered on xt • Determine active EXP3 instance by past center xs closest to xt • Draw action yˆt according to active EXP3 and update its weights only Remark: No. balls never exceeds T 32

Contextual bandit game: a folk algorithm

[K04,S14,...]

Y



x2 ε

x3

x1 x6

x4

x5

X

Each newly created ball centered in xt hosts instance of EXP3 over discretized action space Y • If xt outside any ball so far, create new ball centered on xt • Determine active EXP3 instance by past center xs closest to xt • Draw action yˆt according to active EXP3 and update its weights only Remark: No. balls never exceeds T 33

Contextual bandit game: regret bounds

[K04,S14,...]

• n = metric dimension of X • 1 = metric dimension of Y Then: n+2

• Lipschitz losses :

˜ n+3 ) O(T

• Convex losses :

˜ n+2 ) O(T

• Lower bound for n | = {z 0} :

Ω(T 3 )

n+1

2

[folk alg] [folk alg + BEL16] [B+11]

no context

In all cases: • Exploit finite coverability of X and Y • Set radius  appropriately Very recent improvement in the finite action space case ˜ O(T

n n+1

[FK18]

) 34

Contextual one-sided expert game/1

[CB+17]

Using Exp3-IX-like combined with folk alg on -balls over X yields regret p RT (F ) . T ln N + T  if the `t are (one sided-)Lipschitz n+1

. T n+2

when optimizing on 

Revenue (= 1- Loss) Feedback structure Gt

revealed after play

b2

b1

b2

price

played I t

√ ˜ T) Remark 1: No context (n = 0) case : RT (F ) = O( Remark 2: More general notions of one sided Lipschitz recently being used in online optimization (dispersion condition) and regret analysis in auction algs (∆0 -Lipschitz) [F+18,B+18] We can do better in the Lipschitz case 35

Contextual one-sided expert game/2: Chaining/1 [CB+17] Ideas of the algorithm: Hierarchical covering of F = tree whose nodes are functions in F • The nodes at each depth m define a (2−m )-covering of F • Any function f ∗ ∈ F is represented by unique path/chain in the tree • Run an instance of Exp4 (adapted to one-sided expert feedback) on each node of tree • Instance Af at node f uses the predictions of child instances as expert advice

u

...

Level m + 1

2−(m+1) covering

.

2−m covering of F

..

..

Exp4 ...

v .

. . .

..

.

Exp4 ...

Level m

Exp4 ... w

Level M (leaves)

2−M covering 36

Contextual one-sided expert game/2: Chaining/2 [CB+17] Key issues (Lipschitz losses): • Small local ranges: losses associated with neighboring nodes are close • Local version of Exp4 scaling with loss range: possible because of richer feedback • Regret:

Z

1

RT (F ) . γT + γ

.T

n n+1

s

T ln N (F , )d γ

∀γ > 0

(when F are Lipschitz on [0, 1]n )

• Improvements when F = Lipschitz functions on [0, 1]n time efficient algorithm (wavelet-based approx.): n−1/3

– Improved regret rate T n+2/3 – Running time per round: ≈ T α , α < 2 37

Learning against Lipschitz policies Bounds abound ! Exponents of T : • Contextual bandits: – General Lipschitz losses:

n+2 n+3

– Convex losses:

n+1 n+2

– General Lipschitz but finite actions

n n+1

[FK18]

• Contextual one-sided: – General Lipschitz losses:

n n+1

– One-sided Lipschitz losses:

n+1 n+2

– Rectangular context space and general Lipschitz losses (n ≥ 1): • Contextual experts (n ≥ 2):

n−1/3 n+2/3 n−1 n

(tight)

[RST15] 38

Conclusions and open questions • Recent activity in nonstochastic bandits problems • Several combinations are possible

Some open questions In the composite anonymous feedback : • Time-varying delay d • fully adaptive adversaries (partially adaptive still possible) In learning with Lipschitz policies : • Tighter upper bounds with efficient alg: – folk approach need not capture complexity of F – Covering F in function space does the job but algs. not efficient • Lower bounds 39