Bayesian optimization and Gaussian process bandits - Laurent Risser

Sample Bayesian (GP) model of f acc. to. Expected .... Thus, predictive distribution for some test point: 27 .... Key insight: Never need to sample where upper.Missing:
25MB taille 0 téléchargements 268 vues
Bayesian optimization and Gaussian process bandits: Theory and Applications Andreas Krause Workshop on Optimization and Learning CIMI Toulouse

Navigating the Protein Fitness Landscape [Romero, K, Arnold PNAS ‘13]

Want to engineer proteins with desirable properties Vaccine design Contrast agents ...

Need experiments! Sequence space is vast

How can we design experiments to find good sequences?

2

Designing P450s chimeras [Romero, K, Arnold PNAS ‘13]

3

Design space Parent sequences ABC

Candidate designs 1 2 3 ...

n

4

Thermostability

Protein Fitness Landscape

x x x

How can we experiment to learn and optimize thermostability? 5

Automatic Machine Learning

Validation Performance

[Cf. Snoek et al’12; Google Vizier, Golovin et al ’17]

x

x

How can we automatically tune model & hyperparameters?6

Recommender Systems

How can we recommend to maximize relevance (clicks)? 7

Explore-exploit in Recommendation Relevance (CTR)

[cf Li et al‘10, Vanchinathan et al ‘14]

x

x

Economics Politics

Sport

How can we recommend to learn and optimize relevance?

8

Exploration—Exploitation Tradeoffs Numerous applications require trading experimentation (exploration) and optimization (exploitation) Experimental design Recommender systems Online advertising Automatic ML Robotic control

Often: #alternatives >> #trials experiments are noisy & expensive similar alternatives have similar performance

Can one exploit this regularity? 9

Outline Motivating Examples and Problem Setting Review of Gaussian processes GP Bandits and Bayesian optimization More complex settings Parallelization Multi-task / contextual optimization Level sets Multi-objective optimization High dimensions Constraints and “Safe” Bayesian optimization 10

k-armed (stochastic) bandits

f1 f2

f3



fk

Sequentially allocate T tokens to k “arms” of a slot machine Each time: pick arm i; get iid payoff with (unknown) mean fi Want to maximize the expected cumulative reward Classical model of exploration – exploitation tradeoff Has been extensively studied (since Robbins ‘52) In same cases, can calculate optimal allocation (Gittins ’79) Tight bounds on cumulative regret (Auer et al ‘02, …) Very successful in applications (e.g., drug trials, scheduling, …)

Typically assume every “arm” is tried multiple times

11

∞-armed bandits … f1 f2 …

f∞

In many domains, number of choices is very large Space of parameters for possible lab experiments or NN architectures Recommender systems Policy parameters for robotic control

Can’t even try every choice once! Classical algorithms don’t scale, and guarantees become useless Substantial work on “structured” bandits (linear, Lipschitz, combinatorial, networked, etc.) 12

Another viewpoint: Bayesian Optimization [Močkus ’75]

Acquisition function

xt

f

yt = f (xt ) + ✏t Expected/most prob. improvement [Močkus et al. ’78,’89], Information gain about maximum [Villemonteix et al. ’09], Knowledge gradient [Powell et al. ’10], Predictive Entropy Search [Hernández-Lobato et al. ’14], TruVaR [Bogunovic et al’17], Max Value Entropy Search [Wang et al’17]

13

Bandits vs Bayesian optimization (Stochastic) Bandits

Bayesian optimization

Finite [Robbins ‘52, Gittins ‘79, Auer et al ‘02...] Linear objectives [Dani et al. ’08; Rusmevichientong & Tsitsiklis ‘08 ], Lipschitz objectives [Slivkins et al. ’08, Bubeck et al. ‘08], ...

Sample Bayesian (GP) model of f acc. to Expected Improvement [Močkus et al. ’78], Most Probable Improvement [Močkus ’89], Information gain about maximum [Villemonteix et al. ’09], Knowledge gradient [Powell et al. ’10],…

Strong theory Not as „flexible“ (Often) Frequentist Contextual, dueling, ...

Little theory Highly configurable Bayesian Parallel, multi-fidelity, ...

Combine insights to get best of both worlds

14

Learning to optimize Given: Set of possible inputs D; noisy black-box access to unknown function f 2 F, f : D ! R Task: Choose inputs x1 , ..., xT from D After each selection, observe yt = f(xt) + εt Cumulative regret: RT = Sublinear if

T ⇣ X t=1

max f (x) x

RT /T ! 0

Simple regret:

ST =

Note that

ST  RT /T

min

t2{1,...T }



max f (x) x

f (xt )



f (xt )



Brief review of Gaussian Processes

16

Gaussian processes

[c.f. Rasmussen & Williams 2006]

f(x)

Prior P(f)

Likelihood: P(data | f) è Posterior: P(f | data) f(x)

unlikely

likely

+

likely

+

+ +

x

x unlikely

Predictive uncertainty + tractable inference 17

Gaussian Processes 2 0.4 1

0.3 0.2

0

0.1

+

+

-1

0 -2

+

+

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2

Normal dist. Multivariate normal (1-D Gaussian) (n-D Gaussian)

Gaussian process (∞-D Gaussian)

Gaussian process (GP) = normal distribution over functions Finite marginals are multivariate Gaussians Closed form formulae for Bayesian posterior update exist Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 18

Gaussian process A Gaussian Process (GP) is an (infinite) set of random variables, indexed by some set X i.e., for each x in X there’s a random variable Yx There exists functions µ : X ! R K : X ⇥ X ! R such that for all , A ✓ X A = {x1 , . . . , xk } it holds that where

YA = [Yx1 , . . . , Yxk ] ⇠ N (µA , ⌃AA ) k

K is called kernel (covariance) function µ is called mean function

19

Predictive confidence in GPs f(x)

f(x’)

x x’ Typically, only care about marginals, i.e.,

P (f (x)) = N (f (x); µt (x),

P(f(x’))

2 t (x))

Parameterized by covariance function K(x,x’) = Cov(f(x),f(x’)) 20

Kernel functions K must be symmetric K(x,x’) = K(x’,x) for all x, x’ K must be positive definite For all A: SAA is positive definite matrix Kernel function K: assumptions about correlation! Decades of research in ML on kernels for different data types (vectors, graphs, sets, sequences, …)

21

Kernel functions: Examples Squared exponential kernel K(x,x’) = exp(-(x-x’)2/h2)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

Samples from P(f) 3

100

200

300

400

500

Distance |x-x’|

600

700

2

2.5 1

2 1.5

0

1 0.5

-1

0 -2

-0.5 -1

-3

-1.5 -2

0

0.2

0.4

0.6

Bandwidth h=.3

0.8

1

-4

0

0.2

0.4

0.6

0.8

Bandwidth h=.1

1

22

Kernel functions: Examples Exponential kernel K(x,x’) = exp(-|x-x’|/h)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

100

200

300

400

500

600

Distance |x-x’|

700

2.5

1.5

2 1 1.5 0.5

1

0

0.5

-0.5

0 -0.5

-1 -1 -1.5

-1.5

-2 -2.5

-2 -2.5 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Bandwidth h=1

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Bandwidth h=.3

1

23

Kernel functions: Examples Linear kernel: K(x,x’) = xT x’ 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Corresponds to (Bayesian) linear regression! 24

Kernel functions: Examples Linear kernel with features: K(x,x’) = F(x)TF(x’) E.g., F(x) = [0,x,x2]

E.g., F(x) = sin(x)

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

-1.5

-1.5

-2

-2

-2.5

-2.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

25

Application: Protein Engineering [with Romero, Arnold, PNAS ‘13]

26

Making predictions with GPs Suppose

P (f ) = GP (f ; µ, k)

and we observe yi = f (xi ) + ✏i Then the posterior is also a GP:

✏i ⇠ N (0,

2

)

A = {x1 , . . . , xn }

P (f | x1 , . . . , xn , y1 , . . . , yn ) = GP (f ; µ0 , k 0 )

µ0 (x) = µ(x) + ⌃x,A (⌃AA +

k 0 (x, x0 ) = k(x, x0 )

2

I)

⌃x,A (⌃AA +

1

(yA 2

I)

µA )

1

⌃A,x0

Thus, predictive distribution for some test point: 0 0 P (f (x) | x1 , . . . , xk , y1 , . . . , yk ) = N (f (x); µ (x), k (x, x)) 27

f(x)

GP Inference Illustration

x

28

Where do we get the kernel (parameters) from? Prior knowledge Empirical Bayes (maximizing marginal likelihood) Integrating over hyperparameters For now, assume kernel is given

29

Active Learning and Optimization with Gaussian Processes

30

How do we quantify utility? Information gain [c.f., Lindley ‘56]

Set D of points to evaluate f at Find S ⊆ D maximizing information gain: 1 F (S) = H(f ) H(f | yS ) = I(f ; yS ) = log |I + 2 Uncertainty of f Uncertainty of f before evaluation after evaluation

prior

high infogain

2

⌃SS |

Noisy obs. at locations S

low infogain

31

Optimizing mutual information [cf Shewry & Wynn ’87]

Mutual information F(S) is NP-hard to optimize Simple strategy: Greedy algorithm. For St = {x1 , . . . , xt } xt+1 = arg max F (St [ {x}) x2D

= arg max H(yx | ySt ) x2D

= arg max

H(yx | f )

2 x|St

Constant for fixed noise variance Entropy is monotonic in variance x2D

32

Side note: Submodularity of Mutual Information [cf K & Guestrin ‘05] Mutual information F(S) is monotone submodular: 8x 2 D 8A ✓ B ✓ D : F (A [ {x})

F (A)

F (B [ {x})

F (B)

Greedy algorithm provides constant-factor approximation [Nemhauser et al’78] ⇣ 1⌘ F (ST ) 1 max F (S) e S✓D,|S|T I.e., uncertainty sampling is near-optimal!

33

Active Learning: Uncertainty sampling xt = arg max

Pick:

x2D

2 t 1 (x)

In active learning, we reduce uncertainty everywhere In Bayesian optimization, only care about maximum! f(x)

x

Wastes samples by exploring f everywhere!

34

Exploiting only Pick:

xt = arg max µt x2D

1 (x)

f(x)

x

Gets stuck in local optima! 35

Bayesian Optimization [Močkus et al. ’78]

Acquisition function

xt

f

yt = f (xt ) + ✏t Expected/most prob. improvement [Močkus et al. ’78,’89], Information gain about maximum [Villemonteix et al. ’09], Knowledge gradient [Powell et al. ’10], Predictive Entropy Search [Hernández-Lobato et al. ’14], TruVaR [Bogunovic et al’17], Max Value Entropy Search [Wang et al’17]

36

Gaussian process bandit optimization f(x) Goal: Pick inputs x1,x2,… s.t. T X 1 [f (x⇤ ) f (xt )] ! 0 T t=1 Average regret

x How should we pick samples to minimize our regret?

37

Avoiding unnecessary samples f(x)

UCB Best lower bound

x Key insight: Never need to sample where upper confidence limit < best lower bound!

38

Upper confidence sampling (GP-UCB) [use in Bandits: e.g., Auer et al ’02, Dani’08, …] Pick input that maximizes upper confidence bound:

xt = arg max µt x2D

f(x)

1 (x)

+

t ⇥t 1 (x)

How should we choose βt?

x

Naturally trades off exploration and exploitation Does not waste samples (with high probability)

39

Information capacity of GPs Will see that regret bounds depend on how quickly we can gain information Mathematically: T = max I(f ; yA ) |A| T

I(f ; yA ) = H(f ) Optimized in active learning/ uncertainty sampling

H(f | yA )

T

T 40

Performance of GP-UCB Theorem [Srinivas, Krause, Kakade, Seeger IEEE IT’12] If we choose βt = O(log t), then

1 T

T X

[f (x ) ⇤

t=1

Hereby

T

f (xt )] = O



⇣r

= max I(f ; yA )

T

T



|A| T

Information capacity / DOF …

41

Proof Sketch True function contained in confidence bounds w.h.p. Instantaneous regret bounded by confidence interval at UCB action

By Cauchy Schwartz, cumulative regret is bounded by sum of squared predictive variances at eval. points Latter is bounded by the log determinant (= mutual information) of selected points

42

Performance of GP-UCB Theorem [Srinivas, Krause, Kakade, Seeger IEEE IT’12] If we choose βt = O(log t), then

1 T

T X

[f (x ) ⇤

t=1

Hereby

T

f (xt )] = O



⇣r

= max I(f ; yA )

T

T



|A| T

Information capacity / DOF …

The slower γT grows, the easier is f to learn Key question: How quickly does γT grow??

43

Growth of information gain

Bound on !T

250

Hard: little/no diminishing returns

Independent

200

Matern (" = 2.5)

2.5

2

150

1.5

Squared exponential

1 0.5 0 -0.5 -1 -1.5 -2

100

-2.5

50 0

Linear (d=4) 10

20

30

40

50

T

Can exploit submodularity of mutual info. to compute tight data-dependent bounds

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Easy: strong diminishing returns 3

2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2

0

0.2

0.4

0.6

0.8

1

44

Bounds for common kernels

[Srinivas, Krause, Kakade, Seeger ICML’10; IEEE Trans. IT’12] Theorem: For the following kernels, we have: ✓ ◆ RT d ⇤ Linear: ; =O T = O(d log T ) T T d+1 Squared-exponential: ; = O((log T ) ) T ✓ ◆ d+1 RT (log T ) ⇤ =O T T d(d+1) Matérn with , ; > 2 T = O(T 2 +d(d+1) log T ) ⇣ +d(d+1) ⌘ RT 1 ⇤ 2 +d(d+1) =O T T

Smoothness of f helps battle curse of dimensionality! Our bounds crucially rely on submodularity of T

45

Robustness? So far, have assumed objective f is drawn from a known GP prior Noise is iid Gaussian with known variance

Robustness w.r.t. these assumptions??

46

Reproducing Kernel Hilbert Spaces (RKHS) Given kernel k : D ⇥ D ! R , consider functions X f (x) = ↵i k(xi , x) where ↵i 2 R, xi 2 D i X f g f g hf, gi = ↵ ↵ k(x , x with inner product i j i j) i,j

p and norm ||f || = hf, f i

A Reproducing Kernel Hilbert Space (RKHS) is n

Hk (D) = f : D ! R, f (x) =

X i

↵i k(xi , x) s.t. ||f || < 1

o 47

Frequentist confidence intervals for GPs? Theorem: [w Srinivas, Kakade, Seeger’10; Want: cf Abbasi-Yadkori’12] ⇣

Pr 8x, t : f (x) 2 [µt (x) ±



O ||f ||k

t t (x)]

3 log t log t

“Complexity” of f (RKHS norm)

T



1

1⌘

= max I(f ; yA ) |A|T

Information capacity T

T

48

What if f is not from a GP?

[Srinivas, Krause, Kakade, Seeger ICML’10; IEEE Trans. IT’12] In practice, f may not be Gaussian Theorem: Let f lie in the RKHS of kernel K with , kf k2K  B and let the noise be bounded almost surely by . 3 Choose . Then w. high probability = O(2B + ⇥ log t) t t ! r RT T ⇥T =O T T Don’t need to know the “true prior” Intuitively, the bound depends on the “complexity” of the function through its RKHS norm 49

Side note: Lower Bounds? Upper bounds tight for Gaussian kernel [Scarlett ICML ’18] Unclear whether they can be improved for Matern kernel

50

Side note: Optimizing the acquisition function GP-UCB requires solving the problem

xt = arg max µt x2D

1 (x)

+

t ⇥t 1 (x)

This is generally non-convex L In low-D, can use Lipschitz-optimization (DIRECT, etc.) In high-D, can use gradient ascent (based on random initialization)

51

Beyond Basic BO: More Complex Settings

52

Confidence based sampling Key idea behind GP-UCB Utilize high-probability bounds on function value to constrain sampling Information-capacity bounds problem complexity

Can generalize to more complex settings Parallelizing exploration tradeoffs Context / Side-information Multi-objective optimization Level-set identification High-dimensions Constraints 53

Beyond Basic BO: Parallel Exploration/ Delayed Feedback 54

Parallelizing exploration—exploitation tradeoffs [cf Azimi et al ‘10; Ginsbourger et al ’10]

Basic algorithm is fully sequential Needs y1 , . . . , yt to choose xt+1

In many applications, wish to perform batch of multiple (say B) evaluations in parallel

How should we choose the batch? How much “informational speedup” can we get?

55

Naïve approaches Pick a single query and run this experiment B times? Intuitively seems like we could do better

Pick top B queries in GP-UCB criterion? Problem: likely close to one another, no diversity

f(x) *

*** *

x

56

Batch Mode GP Optimization

[cf. “Kriging Believer”, Ginsbourger et al ‘10] we will be making; this is exactly equivalent

h

(1/2) cinated” x observation. This decision rule is = arg max µ (x) + t

x2D

tb 1

t

t 1 (x)

1/2 Update after each selection [µtb 1 (x) + ⇥t ⌅t 1 (x)] This scheme anticipates information we are gaining re. If we restrict ⇥t to be constant across Must be careful to avoid overconfidence! ce interval is collapsing during the course of

uired any information; we are in fact being dence interval lies. One way to remedy this make sure that under our sampling pattern ce intervals at the tb +B 1th iteration (that ions Y(Xtb :tb +B 2 ) and are the last which within this batch) satisfy our requirements 1/2 (x), µ (x) + ⇥ ⌅tb +B 2 (x)] with prob2 tb 1

i

57

⇥t ⇥ ⇤d log (t + 1) Mat`ern: GP-BUCB mode guarantees ⇥t ⇥ ⌅t Theorem [Desautels, K, JMLR ‘14] Square Exponential: For sufficiently regular kernels, can choose ⇥t ⇥ ⇤(log (t + 1))d B = O(log(T ))

(⌅(B

O(B(log

and nevertheless obtain regret batch RT

= c(d, K) ·

seq RT

1

+ O(polylog(T, B))

Independent of B and T!

è Near-linear speedup in convergence rate! 58

Simulation Results 0.5 0.45 Repeat 5 times

Optimality Gap Minimum Regret

0.4 0.35

Top 5 UCB

0.3 0.25

GP−BUCB (B=5)

0.2 0.15 0.1 0.05 GP−UCB 0 0

20

40 60 Time (Actions)

80

100 59

Application: Protein Engineering [with Romero, Arnold, PNAS ‘13]

Task: Design cytochrome P450s chimeras Action: Experiment with protein sequence Feedback: Thermostability T50 Kernel: Structure-based kernel function

60

Wet-lab results

[w Romero, Arnold PNAS ’13]

Identification of new thermostable P450s chimera 5.3C more stable than best published sequence! 61

Other strategies Multi-point expected improvement [Schonlau ‘97] Simulation matching [Azimi et al ’10] DPP sampling [Kathuria et al ‘16]

62

Beyond Basic BO: Multi-task/ Contextual BO 63

Contextual GP bandits In each round t do: Observe context Pick Observe Incur regret

zt 2 Z Modeled as GP xt 2 D yt = f (xt , zt ) + ✏t rt = sup f (x, zt ) f (xt , zt ) x

Cumulative contextual regret RT =

T X

rt

t=1

Obtaining sublinear regret RT /T ! 0 requires learning optimal mapping from contexts to actions! 64

CGP-UCB

[generalizes LinUCB: Li et al’10] Pick input that maximizes upper confidence bound at current context

xt = arg max µt

1 (x, zt )

x2D

+

t t 1 (x, zt )

8 6 4

Payoffs

2 0 −2 −4 −6 −8 −10 1 0.5

1 0

0.5 0

−0.5

−0.5 −1

Contexts

−1

Actions

65

Where do we get the kernel from? In principle, can choose any kernel on Suppose we have kernels

D⇥Z

0 and 0 kD (x, x ) kZ (z, z )

Can compose to kernel on through D⇥Z Multiplication: k (x, z), (x0 , z 0 ) = kD (x, x0 ) · kZ (z, z 0 ) Addition: k (x, z), (x0 , z 0 ) = kD (x, x0 ) + kZ (z, z 0 )

66

Examples 8 6

3

4

2 1

0

Payoffs

Payoffs

2

−2 −4 −6

0 −1 −2

−8

−3 1

−10 1

0.5

0.5

1

1 0

0.5

0

0

−0.5

−0.5 −1

Contexts

0.5 0

−0.5

−0.5

−1

Actions

Product

Contexts

−1

−1

Actions

Addition

Can bound the information gain for composite kernels based on that of constituent kernels [cf K & Ong NIPS ‘11] 67

Performance of CGP-UCB Theorem [K, Ong NIPS‘11] If we choose βt = O(log t), then

1 T

T X

⇤ [f (xt , zt )

t=1

Hereby

T

f (xt , zt )] = O



⇣r

T

T



= max I(f ; yA ) |A| T

Thus, information gain even bounds stronger notion of contextual regret! 68

Book recommendation results [w Nikolic, Vanchinathan, de Bona, RecSys ‘14]

Expected Clicks

CGP-UCB UCB1

Previous

0

6

12

18 24 Time (days)

30

36

42 69

Beyond Basic BO: Multiple objectives 70

Multi-objective performance optimization [w Zuluaga, Sergent, Püschel, JMLR ‘16]

Protein structure optimization Trade binding affinity & thermostability

Empirical algorithmics Trade performance & memory footprint

Design of special purpose hardware Trade area, throughput, energy, runtime ...

... Evaluation can be costly and noisy

71

High-level synthesis for high-performance computing Throughput Throughput

Throughput

High-level specification

Best

HLS Area

Latency Area Area Energy

72

Exploring the Design Space Sorting Networks n = 256 Throughput Pareto best solutions optimal

10 9 8 7 6 5

low level synthesis: minutes to days

4 3 2

4

5

6

7

8

9

10

11

12

Area 73

Goal Sample as few designs as possible to predict Pareto optimal designs Sample as few designs as possible Throughput 10

Pareto

9

8

samples

7

6

5

4

3

2

4

5

6

7

8

9

10

11

12

Area 74

Running the Algorithm: Modeling Throughput

Confidence regions (from GP model)

evaluated Area

75

Running the Algorithm: Classification unclassified

Throughput

Pareto-optimal not Pareto-optimal

Area

76

Running the Algorithm: Sampling Throughput

next evaluation Area

77

Running the Algorithm: Evaluation Throughput

Area

78

Running the Algorithm: Modeling Throughput

Area

79

Running the Algorithm: Classification Throughput

unclassified

Area

80

Running the Algorithm: Tolerances Throughput

ε-classification

Area

81

Running the Algorithm: Termination Throughput

Area

82

Active Learning for Multi-Objective Optimization

of E. The models predict 1 Æ i Æ n, allowing us to e Pareto-optimal with high at has not been sampled, is = (µi (x))16i6n . Additions interpreted as the uncerWe capture this uncertainty e4

‡(x) ∞ y ∞ µ(x)+— 1/2 ‡(x)}, (5) eter to be chosen later. ormation to guide our samabilistic assumption on the x. Since we are only interints, our algorithm aims at E such that the predictions e accurate for points x that mal. orithm uses the predictions s ‡ t (x) to classify a point t Pareto-optimal. However, nclassified. Then, the next design space is chosen to of unclassified points. The ed when all points are clasas Pareto-optimal are then Pˆ for P .

The PAL Algorithm Algorithm 1 The PAL algorithm.

Input: design space E; GP prior µ0,i , ‡0 , ki for all 1 Æ i Æ n; ‘; —t for t œ N Output: predicted-Pareto set Pˆ 1: P0 = ÿ, N0 = ÿ, U0 = E {classification sets} 2: S0 = ÿ {evaluated set} 3: R0 (x) = Rn for all x œ E 4: t = 0 5: repeat Modeling 6: 7: Obtain µt (x) and ‡ t (x) for all x œ E {µt (x) = y(x) and ‡ t (x) = 0 for all x œ St } 8: Rt (x) = Rt≠1 (x) fl Qµt ,‡t ,—t+1 (x) for all x œ E 9: Classification 10: Pt = Pt≠1 , Nt = Nt≠1 , Ut = Ut≠1 11: for all x œ Ut do 12: if there is no xÕ ”= x such that min(Rt (x)) + ‘ ∞ max(Rt (xÕ )) ≠ ‘ then 13: Pt = Pt fi {x}, Ut = Ut \ {x} 14: else if there exists xÕ ”= x such that max(Rt (x)) ≠ ‘ ∞ max(Rt (xÕ )) + ‘ then 15: Nt = Nt fi {x}, Ut = Ut \ {x} 16: end if 17: end for 18: Sampling 19: Find wt (x) for all x œ (Ut fi Pt ) \ St 20: Choose xt+1 = arg maxxœ(Ut fiPt )\St {wt (x)} 21: t=t+1 22: Sample y t (xt ) = f (xt ) + ‹ t 23: until Ut = ÿ 24: Pˆ = Pt

83

s for n = 2 and ‘ = 0 an at some iteration t. assification is done, a new ampling with the following x œ E is assigned a value

a consequence “T grows sublinearly (exhibits a strong diminishing returns effect). In contrast, if k imposes little regularity (e.g., is close to diagonal), “T grows almost linearly with T . Srinivas et al. (2010; 2012) established “T as key quantity in bounding the regret in single-objective GP optimization. Here, we show that this quantity more broadly governs convergence in the much more general problem of predicting the Pareto-optimal set in multi-criterion optimization. The following theorem is our main theoretical result.

Theoretical Guarantee

Given a target error η, PAL is guaranteed to diagonal of its uncertainty points x œ P fistop in less than T iterations: U , the one

ax

Rt (x)

||y ≠ y Õ ||2 , t

t

hosen as the next sample xt to wt (xt ) as wt . s the sampling towards exng the model for, the points ptimal.

x) of a point classified as Pareto-optimal x) of a point classified as not-Pareto optimal x) of an unclassified point pled points classified as not-Pareto optimal t sample

d (max(Rt (x)) + d wTd + 2 2 d (min(R t (x)) +

1 (x)

Theorem 1. Let ” œ (0, 1). Running PAL with —t = 2 log(n|E|fi 2 t2 /(6”)), the following holds with probability 1 ≠ ”. To achieve a maximum hypervolume error of ÷, it is sufficient to choose

÷(n ≠ 1)! , (7) 2nan≠1  where a = maxxœE,1ÆiÆn { —1 ki (x, x)}. In this case, the algorithm terminates after at most T iterations, where T is the smallest number satisfying Û T nan≠1 Ø . (8) C1 —T “T ÷(n ≠ 1)! ‘=

sampling example for n = 2

Here, C1 = 8/ log(1 ≠ ‡ ≠2 ), and “T depends on the type of kernel used.

training process stops after, points in E are classified, diction returned is Pˆ = PT .

This means that by specifying ” and a target hypervolume error ÷, PAL can be configured T through the parameter ‘ to stop when the target error is achieved with

Same

!

84

Experiments: Data Sets Sorting networks

Network on Chip

SNW (|E| =206)

Compiler settings

NoC (|E| =259)

SW-LLVM (|E| =1023)

16 14

Pareto frontier

10 8

4.5

3.0

log(f2 )

log(f2 )

log(f2 )

12

Pareto frontier

5.0

Pareto frontier

3.5

2.5

4.0

6 4

3.5

2.0

2 0.06

0.08

0.10

log(f1 )

0.12

0.14

0

2

4

6

8

10

12

14

16

log(f1 ) Pareto frontier

0.124

0.126

0.128

0.130

0.132

log(f1 )

Marcela Zuluaga, Andreas Krause, Peter Milder, Markus Püschel. Streaming Sorting Networks. LCTES 2012 Oscar Almer, Nigel Topham, Björn Franke. A Learning- Based Approach to the Automated Design of MP-SoC Networks. ARCS 2011 N. Siegmund, S. Kolesnikov, C. Kastner , S. Apel, D. Batory, M. Rosenmuller, and G. Saake Predicting Performance via Automated Feature-Interaction Detection. ICSI 2012

85

20

20

15

15

10

10

5

5

0

40

25 30

ParEGO!

20 0

5

10

10

0

PAL!

PAL!

0 0 40 20 040 0 20 40 6020 40 60 80

3

402

1

1

= 0.001% SNWSNW (|E| SNW =206) (|E| =206) (|E| = =206) 0.002%

5

0

Evaluations

80

0

100

PAL! Evaluations 2 PAL!

120

140 ✏

5

0

= 0.001% ✏ = 0.002% ✏ = 0.004% ✏ = 0.008%

= 0.016% NoC (|E| NoC NoC (|E| =259) (|E| =259) = =259) 0.032%

7

7

7

6

6

6

5

5

5

10

4

0

5

20

40

60

S

5

80

100

Evaluations

120 0

7

320 140

40 6

60

0

Evaluations ✏ = 0.008% 0 80 ✏ =100 0 0.016% ✏ = 0.008% ✏ = 0.032% PAL! 1 0.016% 1 ✏1 = ✏ = 0.064% PAL! ✏0 = ✏ = 0.128% 0 0.032% 0 4

3 ✏= 3 0.002% 3 20 ✏3 = 40 0.004%60 2 2 2Evaluations 2 1 0

100 60 100 80 120 100 80 120 140 100 120 140 120 140

Evaluations Evaluations Evaluations Evaluations

= 0.064% = 0.128%

140

= 0.256% SW-LLVM SW-LLVM SW-LLVM (|E| =1023) (|E| =1023) (|E| =1023) = 0.512% 7

7

7

5

Percentage error: ET,2

Percentage error: ET,2

Percentage error: ET,2

7. Conclusions

Percentage error: ET,2

30

Percentage error: ET,2

30

Percentage error: ET,2

Percentage error: ET,2

Percentage error: ET,2

40 40 5.40Average percentage error vs. number of evaluations after termination; for PAL different values for ‘ are used. Figure 6 6 6 30

5

5

4 4 ing bounds for the4 required number of evaluations to 20 20 3 3 3 3 3problem achieve the target3 accuracy. These bounds depend on In20this paper we addressed the challenging the information gain 2 2due2 to sampling which depends 2 2 of predicting the set of Pareto-optimal2 solutions in a 10 10 10 on the type of kernel Finally, we demonstrated design space from the evaluations of only of 1 1used. 1 1 1a subset 1 the effectiveness of0 our approach on three case studies the We proposed the PAL algorithm, 0 designs. 0 0 0 0 0 which 0 0 86 0 20 0 20 0 40 20 40 60 40 60 80 60 100 80 100 80 120 100 120 140 120 140 140 0 0 20 0 20 40 20 40 60 40 60 80 60 80 100 80 100 100 0 20 0 20 0 40 20 40 60 40 60 80 60 100 80 100 80 120 100 120 140 120 140 re140 obtained from real engineering applications. Our uses Gaussian processes to model the objective func-

Evaluations Evaluations Evaluations

4

4

80

✏ = 0. ✏20= 0 ✏=0 ✏=0 ✏=0

ParEGO!

0 20 80 0 20 20 04040 20 40 6060 40 60 80

0

Evaluations Evaluations Evaluations Evaluations

= 0.004% = 0.008%

6

PAL!

4

20 40 40 10060 120 0 200 0 20 20 4060 60 408060 80 80100140 80 100 100

0

Percentage err

Percentage Error

1406

10

Percentage error: ET,2

120

0

20

Evaluati 2 5 SW-LLVM SW-LLVM (|E| =1023) (|E| =1023) (|E| =1023) 0 SW-LLVM 320 0 SW-LLVM 40 (|E|60=1023) 80 100 120 140 ✏ = 0 1 4 7 7 7 4 Evaluations 2 ✏=0 0 0 3 3 6 76 6 0 20 40 60 80 100 20 ✏ = 40 1 120 140 ✏ = 0.001% 0 0. 6 2 Evaluations Eva 2 5 5 5 ✏ = 0.002% ✏ = 0. 0 ParEGO! 5 ParEGO! 20 ParEGO! 40 60 80 100 120 140 0 20 40 60 80 1 1 4 ✏= 4 0.001% 4 ✏ = 0.004% ✏ =100 0.

0

3

1

20

Evaluations

4

1✏

100

NoC (|E| NoC=259) NoC (|E| =259) (|E| =259) 10 20 NoC (|E| =259) 10

3602

80

Percentage error: ET,

60

= 0.001% 0 0 ✏0 = 0.002%

0

Evaluations Evaluations Evaluations Evaluations

6

40

4

20 2

60 60 100 80 80 100 80 120 100 100 120 140120120 140140140

5

7

4

3

0

15 20 10

4

ParEGO!

20

5

20

1

20 (|E| =259) 6

0

Percentage error: ET,2

25

30

5

30

2NoC

NoC

0

30 100

7 NoC 20 (|E| =259)

7

Percentage error: ET,2

25

Percentage error: ET,2

30

30 20

35

0

80

ParEGO!

30

3

Percentage Error: ET,1

Percentage Error: ET,1

30

140

60

Evaluations

4

Percentage Error: ET,1

Percentage Error: ET,1

Percentage Error: ET,1

0

35

SNW (|E| =206)

40

120

7

40 40

Percentage error: ET,2

100

Percentage Error: ET,1 Percentage error: ET,2

80

Percentage error: ET,2

60

Percentage Error: ET,1

40

Percentage Error: ET,1

20

30

40

0

20

40

SNW (|E| =206) PAL!

5

SNW 5(|E| =206)

SNW (|E| =206)

2 1

0

Percentage Error: ET,1

PAL!

10

40

0

35

100 ParEGO! 120 140

80

Experimental results 15

Evaluations SNWSNW (|E| SNW =206) (|E| =206) (|E| =206) 40

0 60

3 Evaluations

20

0

40

40

Percentage error: ET,2

25

20

Percentage error: ET,2

0

Percentage Error

0

ParEGO!

Percentage error: ET,2

Percentage Error

30

0

Percentage error: ET,2

8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7

4

Evaluations Evaluations Evaluations

Evaluations Evaluations Evaluations

Beyond Basic BO: Active learning of level sets 87

Robotic sensor platform

[www.limnobotics.ch] 88

Measuring along a transect

89

Measurements obtained

400

Length (m)

1,000

1,400 90

Level set estimation

400

Length (m)

1,000

1,400 91

Classification

.

0

400

Length (m)

1,000

1, 92

Confidence-based classification 4

Q4 (x0.5 )

3

Q4 (x1.5 ) 2

1

0

Q4 (x1 ) Q4 (x0 )

Q4 (x2 )

−1 −2

. 0

0.5

1

1.5

2

93

Sampling rule 4

3

2

1

0

−1 −2

. 0

0.5

at (x) =

1

1/2 t 1 (x) t

1.5

µt

1 (x)

2

h

94

95

0 8 −4 Depth (m)

6

4

−14 −18

2

. 0

. 400

Length (m)

1,000

400

Length (m)

1,000

1,400

0

Depth (m)

−4

−14 −18

. 0

1,400

96

Performance guarantee for LSE

Same

T

! 97

Experimental results 0 1.5

Depth (m)

−4 1

−14 −18

0.5

.

.

0

400

Length (m)

1,000

1,400

1

F1 -score

0.9

0.8

0.7

0.6 0.5 . 0

. . . 50

100

150

200 250 Samples

300

350

.453 .7"3 .-4& 400

98

Experimental results 0 8 −4 Depth (m)

6

4

−14 −18

2

.

.

0

400

Length (m)

1,000

1,400

1

F1 -score

0.9

0.8

0.7

0.6

0.5 . 0

. . . 50

100

150

200 250 Samples

300

350

.453 .7"3 .-4& 400

99

Implicit thresholds What if we don‘t know the threshold h? Implicitly define it!

h = ! max f (x) x

0
No existing algorithms for this problem Can extend LSE to this implicit setting!

100

00 8

−4−4 Depth (m) Depth (m)

6

4

−14 −14 −18. . −18 00

2

. 400 400

Length (m) Length (m)

1,000 1,000

1,400 1,400

400

Length (m)

1,000

1,400

0

Depth (m)

−4

−14 −18

. 0

101

Path planning For robotic sensor, need to account for travel cost Idea: Plan batches of samples, then connect with TSP

00

Depth Depth (m) (m)

−4 −4

−14 −14 −18 −18 .. 00

400 400 400

Length Length (m) (m) Length

1,000 1,000

1,400 1,400

102

Experiments on Lake Zurich

[with Hitz, Gotovos, Pomerleau, Garneau, Pradalier, Siegwart, ICRA‘14]

103

Beyond Basic BO: Constraints and “Safe” Exploration 104

Safe Controller Tuning

[with Berkenkamp, Schoellig ICRA ’16]

105

Illustration at = ⇡(st , ✓)

Executed for params 𝛉

Target trajectory

Tracking performance

Few experiments

Safety constraint

Safety for all experiments

10 6

Constrained Bayesian Optimization max f (x) s.t. g(x) x



Unknown; access only via noisy black box: At time t, pick xt given past observations; observe f(xt) + noise and g(xt) + noise Want to never evaluate infeasible x (i.e., ensure g(xt) ≥ 𝛕 ∀t) 107

Safe optimization f(x) =g(x)

Safe seed

*

Reachable optimum

O

Global optimum

X Safety threshold x 108

Safe optimization f(x) =g(x)

Safe seed +++ + ++ + + + + + X + ++ +

Safety threshold x 109

Starting point: Bayesian Optimization [Močkus ’75]

Acquisition function

xt

f

yt = f (xt ) + ✏t Unconstrained Expected/most prob. improvement [Močkus et al. ’78,’89], Information gain about maximum [Villemonteix et al. ’09], Knowledge gradient [Powell et al. ’10], Predictive Entropy Search [Hernández-Lobato et al. ’14], TruVaR [Bogunovic et al’17], Max Value Entropy Search [Wang et al’17] Constraints / Multiple Objectives [Snoek et al. ‘13, Gelbart et al. ’14, Gardner et al. ‘14, Zuluaga et al. ‘16] 110

Plausible maximizers f(x) Best lower bound

x Focus exploration where upper confidence bound ≥ best lower bound!

111

Certifying Safety g(x)

Safety threshold x Statistically certify safety where lower bound > threshold! 112

First Attempt: SafeUCB *

Maximize acquisition function (GP-UCB, EI, …) over certified safe domain è Gets stuck in local optima! 113

SAFEOPT

g(x)

f(x)

[Sui, Gotovos, Burdick, K ICML’15], [Berkenkamp, Schoellig K’16]

x

11 4

SAFEOPT Guarantees

[with Sui, Gotovos, Burdick ICML ’15; Berkenkamp Schoellig K’16]

Theorem (informal): Under suitable conditions on the kernel and on f,g, there exists a function T(ε,δ) such that for any ε>0 and δ>0, it holds with probability at least 1-δ that 1) SAFEOPT never makes an unsafe decision 2) After at most T(ε,δ) iterations, it found an ε-optimal reachable point ✓

3

For Gaussian kernel: T (", ) 2 O˜ (||f ||k + ||g||k ) log 1/ 2 " (fixed domain & dim.)

◆ 115

Safe controller tuning

[with Berkenkamp, Schoellig ICRA ’16]

116

Beyond Basic BO: Outlook & Further topics 117

Outlook: Further topics Dealing with high dimensions Efficient kernel approximations and beyond GPs Exploiting gradient information Heteroscedastic noise …

118

High-dimensions Statistical and computational challenges à need assumptions

3 2

Payoffs

1 0 −1 −2 −3 1 0.5

1 0.5

0 0

−0.5

Contexts

f (x) = g(Ax) A 2 Rd⇥D [Wang et al’13, Djolonga & K’13, …]

−0.5

−1

−1

f (x) =

X

Actions

fi (xi )

i [Kandasamy et al ’15, Rolland et al ‘18, Mutny & K ’18] 119

Efficient kernel approximations Naively, predictions in GPs require Cholesky decompositions of T x T matrices è O(T^3) Considerable work in efficient approximations Data-independent (Fourier features, …) Data-dependent (pseudo-inputs, Nystrom approximation, DNN basis functions…) Can provably reduce to O(T polylog(T)) for generalized additive GPs while still yielding no regret [Mutny & K NIPS ‘18]

Much recent work on replacing GPs with neural nets [cf Springenberg et al NIPS’16, Garnelo et al ICML’18] 120

Exploiting gradient information In some applications, (noisy) gradient information may be available These correspond to linear observations à posterior is still a Gaussian process May have to be careful in choice of acquisition function [Wu et al NIPS ’17]

121

Heteroscedastic Noise Var(y(x)) = g(x)

è Information Directed Sampling [w Kirschner ‘18, cf., Russo & van Roy ’14]

122

Conclusions Bridging bandits and Bayesian optimization Key idea: Exploit confidence bounds to constrain sampling Parallelization, Context, Multi-objective, Level sets, Active search and discovery, Safety constrains …

Performance bounds based on information capacity, bounded via submodularity Strong performance on real-world problems

123

References

124

Abbasi-Yadkori, Y. Online Learning for Linearly Parametrized Control Problems. PhD thesis, 2012. Abdelrahman, H., Berkenkamp, F., Poland, J., Krause, A.. Bayesian Optimization forMaximum Power Point Tracking in Photovoltaic Power Plants. ECC 2016 Almer, O., Topham, N., & Franke, B. (2011, February). A learning-based approach to the automated design of MPSoC networks. In ICACS (pp. 243-258). Springer, Berlin, Heidelberg. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3), 235-256. Azimi, Javad, Alan Fern, and Xiaoli Z. Fern. "Batch bayesian optimization via simulation matching." Advances in Neural Information Processing Systems. 2010. Berkenkamp, F., Schoellig, A. P., & Krause, A. (2016, May). Safe controller optimization for quadrotors with Gaussian processes. In Robotics and Automation (ICRA), 2016 Bogunovic, I., Scarlett, J., Krause, A., & Cevher, V. (2016). Truncated variance reduction: A unified approach to Bayesian optimization and level-set estimation. NIPS Bubeck, S., Munos, R., Stoltz, G., & Szepesvári, C. (2011). X-armed bandits. Journal of Machine Learning Research, 12(May), 1655-1695. Cesa-Bianchi, N., & Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and System Sciences, 78(5), 1404-1422. Dani, V., Hayes, T. P., & Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. Desautels, T., Krause, A., & Burdick, J. W. (2014). Parallelizing exploration-exploitation 125 tradeoffs in gaussian process bandit optimization. JMLR

Djolonga, J., Krause, A., & Cevher, V. (2013). High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems (pp. 1025-1033). Frazier, Peter, Warren Powell, and Savas Dayanik. "The knowledge-gradient policy for correlated normal beliefs." INFORMS journal on Computing 21.4 (2009): 599-613. Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., & Cunningham, J. P. (2014, June). Bayesian Optimization with Inequality Constraints. In ICML (pp. 937-945). Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M., & Teh, Y. W. (2018). Neural processes. arXiv preprint arXiv:1807.01622. Garnett, R., Osborne, M. A., & Hennig, P. (2013). Active learning of linear embeddings for Gaussian processes. arXiv preprint arXiv:1310.6740. Gelbart, M. A., Snoek, J., & Adams, R. P. (2014). Bayesian optimization with unknown constraints. arXiv preprint arXiv:1403.5607. Ginsbourger, D., Le Riche, R., & Carraro, L. (2010). Kriging is well-suited to parallelize optimization. In Computational intelligence in expensive optimization problems Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 148-177. Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., & Sculley, D. (2017, August). Google vizier: A service for black-box optimization. KDD Gotovos, A., Casati, N., Hitz, G., & Krause, A. (2013, August). Active learning for level set estimation. In IJCAI (pp. 1344-1350). Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the 126 American Mathematical Society, 58(5):527–535, 1952.

Hernández-Lobato, J. M., Gelbart, M. A., Hoffman, M. W., Adams, R. P., & Ghahramani, Z. (2015). Predictive entropy search for bayesian optimization with unknown constraints. Hernández-Lobato, J. M., Hoffman, M. W., & Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. NIPS Hitz, G., Gotovos, A., Garneau, M. É., Pradalier, C., Krause, A., & Siegwart, R. Y. (2014, May). Fully autonomous focused exploration for robotic environmental monitoring. ICRA Jones, Donald R., Matthias Schonlau, and William J. Welch. "Efficient global optimization of expensive black-box functions." Journal of Global optimization 13.4 (1998): 455-492. Kandasamy, K., Schneider, J., & Póczos, B. (2015, June). High dimensional Bayesian optimisation and bandits via additive models. ICML Kathuria, T., Deshpande, A., & Kohli, P. (2016). Batched gaussian process bandit optimization via determinantal point processes. NIPS Kirschner, J., & Krause, A. (2018). Information Directed Sampling and Bandits with Heteroscedastic Noise. COLT Kleinberg, R., Slivkins, A., & Upfal, E. (2008, May). Multi-armed bandits in metric spaces. STOC Krause, A. & Guestrin, C. (2005). Near-optimal Nonmyopic Value of Information in Graphical Models, Proc. Uncertainty in Artificial Intelligence (UAI) Krause, A., & Ong, C. S. (2011). Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems (pp. 2447-2455). Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010, April). A contextual-bandit approach to 127 personalized news article recommendation. WWW

Lindley, D. V. (1956). On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 986-1005. Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A. P., Krause, A., Schaal, S., & Trimpe, S. (2017, May). Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization. In Robotics and Automation (ICRA) Močkus, J. "On Bayesian methods for seeking the extremum." Optimization Techniques IFIP Technical Conference. Springer, Berlin, Heidelberg, 1975. Mockus, J. (1989). The Bayesian approach to local optimization. In Bayesian Approach to Global Optimization(pp. 125-156). Springer, Dordrecht. Mutny, M. & Krause, A. (2018) Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features. In Neural Information Processing Systems (NIPS) Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical programming, 14(1), 265-294. Rasmussen, C. E., & Williams, C. K. (2006). Gaussian processes for machine learning. 2006. The MIT Press, Cambridge, MA, USA, 38, 715-719. Rolland, P., Scarlett, J., Bogunovic, I., & Cevher, V. (2018). High-Dimensional Bayesian Optimization via Additive Models with Overlapping Groups. arXiv preprint arXiv:1802.07028. Romero, P. A., Krause, A., & Arnold, F. H. (2013). Navigating the protein fitness landscape with Gaussian processes. Proc. National Academy of Sciences, 110(3), E193-E201. Rusmevichientong, P., & Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2), 395-411. 128

Russo, D., & Van Roy, B. (2014). Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems (pp. 1583-1591). Scarlett, J. (2018). Tight Regret Bounds for Bayesian Optimization in One Dimension. ICML Schonlau, M. (1997). Computer experiments and global optimization. Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148-175. Shewry, M. C., & Wynn, H. P. (1987). Maximum entropy sampling. Journal of applied statistics, 14(2), 165-170. Siegmund, N., Kolesnikov, S. S., Kästner, C., Apel, S., Batory, D., Rosenmüller, M., & Saake, G. (2012, June). Predicting performance via automated feature-interaction detection. ICSE Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. (2012) "Practical bayesian optimization of machine learning algorithms." Advances in neural information processing systems. 2012. Springenberg, J. T., Klein, A., Falkner, S., & Hutter, F. (2016). Bayesian optimization with robust bayesian neural networks. NIPS Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. W. (2012). Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5), 3250-3265. Sui, Y., Gotovos, A., Burdick, J., & Krause, A. (2015, June). Safe exploration for optimization with Gaussian processes. In International Conference on Machine Learning (pp. 997-1005). Swersky, K., Snoek, J., & Adams, R. P. (2013). Multi-task bayesian optimization. In Advances in neural information processing systems (pp. 2004-2012).

129

Vanchinathan, H. P., Nikolic, I., De Bona, F., & Krause, A. (2014, October). Explore-exploit in top-n recommender systems via gaussian processes. ACM RecSys Villemonteix, J., Vazquez, E., & Walter, E. (2009). An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization, 44(4), 509. Wang, Z., & Jegelka, S. (2017). Max-value entropy search for efficient Bayesian optimization. arXiv preprint arXiv:1703.01968. Wang, Z., Zoghi, M., Hutter, F., Matheson, D., & De Freitas, N. (2013, August). Bayesian Optimization in High Dimensions via Random Embeddings. In IJCAI (pp. 1778-1784). Wu, J., Poloczek, M., Wilson, A. G., & Frazier, P. (2017). Bayesian optimization with gradients. In Advances in Neural Information Processing Systems (pp. 5267-5278). Zuluaga, M., Krause, A., & Püschel, M. (2016). ε-pal: an active learning approach to the multiobjective optimization problem. JMLR Zuluaga, M., Krause, A., Milder, P., & Püschel, M. (2012, June). Smart design space sampling to predict pareto-optimal solutions. In ACM SIGPLAN Notices (Vol. 47, No. 5, pp. 119-128).

130