Distributionally Ambiguous Optimization Techniques in Batch Bayesian Optimization

07/13/2017 ∙ by Nikitas Rontsis, et al. ∙ University of Oxford 0

We propose a novel, theoretically-grounded, acquisition function for batch Bayesian optimization informed by insights from distributionally ambiguous optimization. Our acquisition function is a lower bound on the well-known Expected Improvement function -- which requires a multi-dimensional Gaussian Expectation over a piecewise affine function -- and is computed by evaluating instead the best-case expectation over all probability distributions consistent with the same mean and variance as the original Gaussian distribution. Unlike alternative approaches including Expected Improvement, our proposed acquisition function avoids multi-dimensional integrations entirely, and can be computed exactly as the solution of a convex optimization problem in the form of a tractable semidefinite program (SDP). Moreover, we prove that the solution of this SDP also yields exact numerical derivatives, which enable efficient optimization of the acquisition function. Finally, it efficiently handles marginalized posteriors with respect to the Gaussian Process' hyperparameters. We demonstrate superior performance to heuristic alternatives and approximations of the intractable expected improvement, justifying this performance difference based on simple examples that break the assumptions of state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When dealing with numerical optimization problems in engineering applications, one is often faced with the optimization of an expensive process that depends on a number of tuning parameters. Examples include the outcome of a biological experiment, the training of large scale machine learning algorithms or the outcome of exploratory drilling. We are concerned with problem instances wherein there is the capacity to perform

experiments in parallel and, if needed, repeatedly select further batches with cardinality as part of some sequential decision making process. Given the cost of the process, we wish to select the parameters at each stage of evaluations carefully so as to optimize the process using as few experimental evaluations as possible.

It is common to assume a surrogate model for the outcome of the process to be optimized. This model, which is based on both prior assumptions and past function evaluations, is used to determine a collection of input points for the next set of evaluations. Bayesian Optimization provides an elegant surrogate model approach and has been shown to outperform other state-of-the-art algorithms on a number of challenging benchmark functions (Jones, 2001). It models the unknown function with a Gaussian Process (GP) (Rasmussen and Williams, 2005), a probabilistic function approximator which can incorporate prior knowledge such as smoothness, trends, etc.

A comprehensive introduction to GPs can be found in Rasmussen and Williams (2005)

. In short, modeling a function with a GP amounts to modeling the function as a realization of a stochastic process. In particular, we assume that the outcomes of function evaluations are normally distributed random variables with known

prior mean function and prior covariance function . Prior knowledge about , such as smoothness and trends, can be incorporated through judicious choice of the covariance function , while the mean function is commonly assumed to be zero. A training dataset of past function evaluations for , with is then used to calculate the posterior of .

The celebrated GP regression equations (Rasmussen and Williams, 2005) give the posterior

(1)

on a batch of test locations as a multi-variate normal distribution in a closed form formula. The posterior mean value and variance depend also on the dataset , but we do not explicitly indicate this dependence in order to simplify the notation. Likewise, the posterior is a normally distributed random variable whose mean and covariance depend on , but we do not make this explicit.

Given the surrogate model, we wish to identify some selection criterion for choosing the next batch of points to be evaluated. Such a selection criterion is known as an acquisition function in the terminology of Bayesian Optimization. Ideally, such an acquisition function would take into account the number of remaining evaluations that we can afford, e.g. by computing a solution via dynamic programming to construct an optimal sequence of policies for future batch selections. However, a probabilistic treatment of such a criterion is computationally intractable, involving multiple nested minimization-marginalization steps (Gonzalez et al., 2016b).

To avoid this computational complexity, myopic acquisition functions that only consider the one-step return are typically used instead. For example, one could choose to minimize the one-step Expected Improvement (described more fully in §1.1) over the best evaluation observed so far, or maximize the probability of having an improvement in the next batch over the best evaluation. Other criteria use ideas from the bandit (Desautels et al., 2014) and information theory (Shah and Ghahramani, 2015) literature. In other words, the intractability of the multistep lookahead problem has spurred instead the introduction of a wide variety of myopic heuristics for batch selection.

1.1 Expected improvement

We will focus on the (one-step) Expected Improvement criterion, which is a standard choice and has been shown to achieve good results in practice (Snoek et al., 2012). In order to give a formal description we first require some definitions related to the optimization procedure of the original process. At each step of the optimization procedure, define

as the vector of

past function values evaluated at the points , and as a candidate set of points for the next batch of evaluations. Then the classical expected improvement acquisition function is defined as

(2)

where is the element-wise minimum of , i.e. the minimum value of the function achieved so far by any known function input. In the above definition we assume perfect knowledge of , which implies a noiseless objective. A noisy objective requires the introduction of heuristics discussed in detail in Picheny et al. (2013). For the purposes of clarity, a noiseless objective is assumed for the rest of the document. This is not constraining, as most of the heuristics discussed in (Picheny et al., 2013) are compatible with the theoretical analysis presented in the rest of the paper.

Selection of a batch of points to be evaluated with optimal expected improvement amounts to finding some Unfortunately, direct evaluation of the acquisition function requires the –dimensional integration of a piecewise affine function; this is potentially a computationally expensive operation. This is particularly problematic for gradient-based optimization methods, wherein may be evaluated many times when searching for a minimizer. Regardless of the optimization method used, such a minimizer must also be computed again for every step in the original optimization process, i.e. every time a new batch of points is selected for evaluation. Therefore a tractable acquisition function should be used. In contrast to (2), the acquisition function we will introduce in Section 2 avoids expensive integrations and can be calculated efficiently with standard software tools.

1.2 Related work

Despite the intractability of (2), Chevalier and Ginsbourger (2013) presented an efficient way of approximating and its derivative (Marmin et al., 2015) by decomposing it into a sum of –dimensional Gaussian Cumulative Distributions, which can be calculated efficiently using the seminal work of Genz and Bretz (2009). There are two issues with this approach: First the number of calls to the –dimensional Gaussian Cumulative Distribution grows quadratically with respect to the batch size , and secondly, there are no guarantees about the accuracy of the approximation or its gradient. Indeed, approximations of the multi-point expected improvement, as calculated with the R package DiceOptim (Roustant et al., 2012) can exhibit points of failure in simple examples (see Figure 1 and Table 2). To avoid these issues, Gonzalez et al. (2016a) and Ginsbourger et al. (2009) rely on heuristics to derive a multi-point criterion. Both methods choose the batch points in a greedy, sequential way, which restricts them from exploiting the interactions between the batch points in a probabilistic manner. Finally Parallel Predictive Entropy Search (PPES), a non-greedy information theoretic acquisition function was derived by Shah and Ghahramani (2015).

2 Distributionally ambiguous optimization for Bayesian optimization

We now proceed to the main contribution of the paper. We draw upon ideas from the Distributionally Ambiguous Optimization community to derive a novel, tractable, acquisition function that lower bounds the expectation in (2). Our acquisition function:

  • is theoretically grounded;

  • is numerically reliable and consistent, unlike Expected Improvement-based alternatives (see §3);

  • is fast and scales well with the batch size; and

  • provides first and second order derivative information inexpensively.

In particular, we use the GP posterior (1) derived from the GP to determine the mean and variance of given a candidate batch selection , but we thereafter ignore the Gaussian assumption and consider only that has a distribution embedded within a family of distributions that share the mean and covariance calculated by the standard GP regression equations. In other words, we define

We denote the set simply as where the context is clear. Note in particular that for any choice of mean or covariance .

One can then construct lower and upper bound for the Expected Improvement by minimizing or maximizing over the set respectively, i.e. by writing

(3)

where the random vector and the function are chosen such that , i.e. and

(4)

Observe that the middle term in (3) is equivalent to the expectation in (2).

Perhaps surprisingly, both of the bounds in (3) are computationally tractable even though they seemingly require optimization over the infinite-dimensional (but convex) set of distributions . For either case, these bounds can be computed exactly via transformation of the problem to a tractable, convex semidefinite optimization problem using distributionally ambiguous optimization techniques (Zymler et al., 2013).

We will focus on the lower bound in (3), hence adopting an optimistic modeling approach. This is because the upper bound is of limited use as it can be shown to be trivial to evaluate and independent of

. Hence, we informally assume that the distribution of function values is such that it yields the lowest possible expectation compatible with the mean and covariance computed by the GP, which we put together in the second order moment matrix

of the posterior as

(5)

We will occasionally write this explicitly as to highlight the dependency of the second order moment matrix on .

The following result says that the lower (i.e. optimistic) bound in (3) can be computed via the solution of a convex optimization problem whose objective function is linear in :

For any the optimal value of the semi-infinite optimization problem

coincides with the optimal value of the following semidefinite program:

()
s.t.

where is the decision variable, and

(6)

are auxiliary matrices defined using and the standard basis vectors in .

See Appendix A.

Problem () is a semidefinite program (SDP). SDPs are convex optimization problems with a linear objective and convex conic constraints (i.e. constraints over the set of symmetric matrices , positive semidefinite/definite matrices /) and can be solved to global optimality with standard software tools (Sturm, 1999; O’Donoghue et al., 2016a). We therefore propose the computationally tractable acquisition function

which we will call Optimistic Expected Improvement (OEI), as it is an optimistic variant of the Expected Improvement function in (2).

This computational tractability comes at the cost of inexactness in the bounds (3), which is a consequence of minimizing over a set of distributions containing the Gaussian distribution as just one of its members. Indeed, it can be proved that the minimizing distribution is discrete with possible outcomes and can be constructed by the Lagrange multipliers of () (Parys, 2015). We show in §3 that this inexactness is of limited consequence in practice and it mainly renders the acquisition function more explorative. Nevertheless, there remains significant scope for tightening the bounds in (3) via imposition of additional convex constraints on the set , e.g. by restricting to symmetric or unimodal distributions (Parys et al., 2015). Most of the results in this work would still apply, mutatis mutandis, if such structural constraints were to be included.

In contrast to the side-effect of inexactness, the distributional ambiguity is useful for integrating out the uncertainty of the GP’s hyperparameters efficiently for our acquisition function. Given samples of the hyperparameters, resulting in second order moment matrices

, we can estimate the resulting second moment matrix

of the marginalized, non-Gaussian, posterior as

Due to the distributional ambiguity of our approach, both bounds of (3) can be calculated directly based on , hence avoiding multiple calls to the acquisition function.

Although the value of for any fixed is computable via solution of an SDP, the non-convexity of the GP posterior (1) that defines the mapping means that is still non-convex in . This is unfortunate, since we ultimately wish to minimize in order to identify the next batch of points to be evaluated experimentally.

However we can still optimize

locally via non-linear programming. We will establish that a second order method is applicable by showing that

is twice differentiable under mild conditions. Such an approach would also be efficient as the hessian of can be calculated inexpensively. To show these results we will begin by considering the differentiability of as a function of . The optimal value function defined in problem () is differentiable on its domain with , where is the unique optimal solution of () at . See Appendix B. The preceding result shows that is produced as a byproduct of evaluation of , since it is simply , the unique optimizer of (). The simplicity of this result suggests consideration of second derivative information of , i.e. derivatives of . The following result proves that this is well defined and tractable for any : The optimal solution function in problem () is differentiable on . Every directional derivative of is the unique solution to a sparse linear system with nonzeros. See Appendix C. We can now consider the differentiability of . The following Corollary establishes this under certain conditions. is twice differentiable on any for which and the mean and kernel functions of the underlying GP are twice differentiable. By examining the GP Regression equations (Rasmussen and Williams, 2005) and Equation (5), we conclude that is twice differentiable on if the kernel and mean functions of the underlying Gaussian Process are twice differentiable. Hence, is twice differentiable for any as a composition of twice differentiable functions. Examining (5) reveals that is equivalent to , which concludes the proof. A rank deficient implies perfectly correlated outcomes. At these points both OEI and QEI are, in general, non-differentiable as shown in Proposition D. This contradicts Marmin et al. (2015) which claims differentiability of QEI in the general setting. However, this is not constraining in practice as both QEI and OEI can be calculated by considering a smaller, equivalent problem. It is also not an issue for descent based methods for minimizing , as a descent direction can be obtained by an appropriate perturbation of the perfectly correlated points.

We are now in a position to derive expressions for the gradient and the hessian of . For simplicity of notation we consider derivatives over

. Application of the chain rule to

gives:

(7)

Note that the second term in the rightmost inner product above depends on the particular choice of covariance function and mean function . It is straightforward to compute (7) in modern graph-based autodiff frameworks, such as the TensorFlow-based GPflow. Differentiating again (7) gives the hessian of :

(8)

where is the directional derivative of across the perturbation . According to Theorem 2, each of these directional derivatives exists and can be computed via solution of a sparse linear system.

3 Empirical analysis

Key Description
OEI Optimistic Expected Improvement (Our novel algorithm)
QEI Multi-point Expected Improvement (Marmin et al., 2015)
QEI-CL Constant Liar (“mix” strategy) (Ginsbourger et al., 2010)
LP-EI Local Penalization Expected Improvement (Gonzalez et al., 2016a)
BLCB Batch Lower Confidence Bound (Desautels et al., 2014)
Table 1: List of acquisition functions

In this section we demonstrate the effectiveness of our acquisition function against a number of state-of-the-art alternatives. The acquisition functions we consider are listed in Table 1. We do not compare against PPES as it is substantially more expensive and elaborate than our approach and there is no publicly available implementation of this method.

We show that our acquisition function OEI achieves better performance than alternatives and highlight simple failure cases exhibited by competing methods. In making the following comparisons, extra care should be taken in the setup used. This is because Bayesian Optimization is a multifaceted procedure that depends on a collection of disparate elements (e.g. kernel/mean function choice, normalization of data, acquisition function, optimization of the acquisition function) each of which can have a considerable effect on the resulting performance (Snoek et al., 2012; Shahriari et al., 2016). For this reason we test the different algorithms on a unified testing framework, based on GPflow, available online at https://github.com/oxfordcontrol/Bayesian-Optimization.

Our acquisition function is evaluated via solution of a semidefinite program, and as such it benefits from the huge advances of the convex optimization field. A variety of standard tools exist for solving such problems, including MOSEK (MOSEK, ), SCS (O’Donoghue et al., 2016b) and CDCS (Zheng et al., 2017). We chose the first-order (Boyd and Vandenberghe, 2004), freely-available solver SCS, which scales well with batch size and allows for solver warm-starting between acquisition function evaluations.

Warm starting allows for a significant speedup since the acquisition function is evaluated repeatedly at nearby points by the non-linear solver. This results in solving () repeatedly for similar . Warm-starting the SDP solver with the previous solution reduces SCS’s iterations by when performing the experiments of Figure 3. Moreover, Theorem 2 provides the means for a first-order warm starting. Indeed, the derivative of the solution across the change of the cost matrix can be calculated, allowing us to take a step in the direction of the gradient and warm start from that point. This reduces SCS’s iterations by a further .

Indicative timing results for the calculation of OEI, QEI and their derivatives are listed in Table 2. The dominant operation for calculating OEI and its gradient is solving (). This makes OEI much faster than QEI, which is in line with the complexity of the dominant operation in SDP solvers based on first-order operator splitting methods such as SCS or CDCS which, for our problem, is . Assume that, given the solution of (), we want to also calculate the hessian of OEI. This would entail the following two operations:

Calculating given

According to Lemma 2 this can be obtained as a solution to a sparse linear system. We used Intel MKL PARDISO to solve efficiently these linear systems.

Calculate and apply chain rules

of (8) to get the hessian of the acquisition function given the gradient and hessian of . We used Tensorflow’s Automatic Differentiation for this part, without any effort to optimize its performance. Considerable speedups can be brought by e.g. running this part on GPU, or automatically generating low-level code optimized specifically for this operation (Vasilache et al., 2018).

Note that the computational tractability of the hessian is only allowed due to the novelty of Theorem 2 which exploits the “structure” of ()’s optimizer.


Batch Size QEI: OEI:
Solve () Tensorflow Part
2
3
6
10
20
40
Table 2: Average execution time of the acquisition function, its gradient and hessian when running BO in the Eggholder function on an Intel E5-2640v3 CPU. For batch size 40, QEI fails, i.e. it always returns 0 without any warning message. For the execution time of the hessian we assume knowledge of the solution of (). Its timing is split into two parts as described in the main text.

We chose the KNITRO v10.3 (Byrd et al., 2006) Sequential Quadratic Optimization (SQP) non-linear solver with the default parameters for the optimization of OEI. Explicitly providing the hessian on the experiments of Figure 3 reduces KNITRO’s iterations by as compared to estimating the hessian via the symmetric-rank-one (SR1) update method included in the KNITRO suite. Given the inexpensiveness of calculating the hessian and the fact that KNITRO requests the calculation of the hessian less than a third as often as it requires the objective function evaluation we conclude that including the hessian is beneficiary.

We will now present simulation results to demonstrate the performance of OEI in various scenarios.

3.1 Perfect modeling assumptions

We first demonstrate that the competing Expected Improvement based algorithms produce clearly suboptimal choices in a simple 1–dimensional example. We consider a 1–d Gaussian Process on the interval with a squared exponential kernel (Rasmussen and Williams, 2005) of lengthscale , variance , noise and a mean function . An example posterior of 10 observations is depicted in Figure 0(a). Given the GP and the 10 observations, we depict the optimal 3-point batch chosen by minimizing each acquisition function. Note that in this case we assume perfect modeling assumptions – the GP is completely known and representative of the actual process. We can observe in Figure 0(a) that the OEI choice is very close to the one proposed by QEI while being slightly more explorative, as OEI allows for the possibility of more exotic distributions than the Gaussian.

(a) Suggested 3-point batches of different algorithms for a GP posterior given 10 observations. The thick blue line depicts the GP mean, the light blue shade the uncertainty intervals ( sd) and the black dots the observations. The locations of the batch chosen by each algorithm are depicted with colored vertical lines at the bottom of the figure.
(b) Averaged one-step Expected Improvement on 200 GP posteriors of sets of 10 observations with the same generative characteristics (kernel, mean, noise) as the one in Figure 0(a) for different algorithms across varying batch size.

In contrast the LP-EI heuristic samples almost at the same point all three times. This can be explained as follows: LP-EI is based on a Lipschitz criterion to penalize areas around previous choices. However, the Lipschitz constant for this function is dominated by the end points of the function (due to the quadratic trend), which is clearly not suitable for the area of the minimizer (around zero), where the Lipschitz constant is approximately zero. On the other hand, QEI-CL favors suboptimal regions. This is because QEI-CL postulates outputs equal to the mean value of the observations which significantly alter the GP posterior.

Testing the algorithms on 200 different posteriors, generated by creating sets of 10 observations drawn from the previously defined GP, suggests OEI as the clear winner. For each of the 200 posteriors, each algorithm chooses a batch, the performance of which is evaluated by sampling the multipoint expected improvement (2). The averaged results are depicted in Figure 0(b). For a batch size of 1 all of the algorithms perform the same, except for OEI which performs slightly worse due to the relaxation of Gaussianity. For batch sizes 2-3, QEI is the best strategy, while OEI is a very close second. For batch sizes 4-5 OEI performs significantly better. The deterioration of the performance for QEI in batch sizes 4 and 5 can be explained in Figure 1. In particular, after batch size 3, the calculation of QEI via the R package DiceOptim exhibits some points of failure, leading to suboptimal choices. Figure 1 also explains the very good performance of OEI. Although always different from the sampled estimate, it is reliable and closely approximates the actual expected improvement in the sense that their optimizers and level sets are in close agreement.

Figure 1: Demonstration of a case of failure in the calculation of the Multi-Point Expected Improvement with the R package DiceOptim v2 (Roustant et al., 2012). The figure depicts how the expected improvement varies around a starting batch of 5 points, , when perturbing the selection across a specific direction i.e. . The DiceOptim’s calculation is almost always very close to the sampled estimate, except for a single breaking point. In contrast our acquisition function OEI, although always different from the sampled estimate, is reliable and closely approximates QEI in the sense that their optimizers and level sets nearly coincide.

3.2 Synthetic functions

Figure 2: BO of batch size 5 on synthetic functions. Red, blue, green, yellow and black dots depict runs of OEI, LP-EI, BLCB, QEI and Random algorithms respectively. Diamonds depict the median regret for each algorithm.
Figure 3: BO of batch size 20 on the challenging Hartmann 6-d Eggholder 2-d functions where OEI exhibits clearly superior performance. Red, blue, green and black dots depict runs of OEI, LP-EI, BLCB and Random algorithms respectively. Diamonds depict the median regret for each algorithm. Compare the above results with Figure 2 for the same runs with a smaller batch size. Note that QEI is not included in this case, as it does not scale to large batch sizes (see Table 2).

Next, we evaluate the performance of OEI in minimizing synthetic benchmark functions. The functions considered are: the Six-Hump Camel function defined on , the Hartmann 6 dimensional function defined on and the Eggholder function, defined on . We compare the performance of OEI against QEI, LP-EI and BLCB as well as random uniform sampling. The initial dataset consists of 10 random points for all the functions. A Matern 3/2 kernel is used for the GP modeling (Rasmussen and Williams, 2005). As all of the considered functions are noiseless, we set the likelihood variance to a fixed small number for numerical reasons. For the purpose of generality, the input domain of every function is scaled to while the observation dataset is normalized at each iteration, such that . The same transformations are applied to QEI, LP-EI and BLCB for reasons of consistency. All the acquisition functions except OEI are optimized with the quasi-Newton L-BFGS-B algorithm (Fletcher, 1987) with 20 random restarts. We use point estimates for the kernel’s hyperparameters obtained by maximizing the marginal likelihood via L-BFGS-B restarted on 20 random initial points.

First, we consider a small-scale scenario of batch size 5. The results of 40 runs of Bayesian Optimization on a mixture of Cosines, the Six-Hump Camel, Eggholder, and 6-d Hartmann functions are depicted in Figure 2

. In these functions, OEI approaches the minimums faster than QEI and BLCB while considerably outperforming LP-EI, which exhibits outliers with bad behavior. The explorative nature of OEI can be observed in the optimization of the Hartmann function. OEI quickly reaches the vicinity of the minimum, but then decides not to refine the solution further but explore instead the rest of the 6-d space. Increasing the

batch size to 20 for the challenging Eggholder and Hartmann functions shows a further advantage for OEI. Indeed, as we can observe in Figure 3, OEI successfully exploits the increased batch size. BLCB also improves its performance though not to the extent of OEI. In contrast, LP-EI fails to manage the increased batch size. This is partially expected due to the heuristic based nature of LP-EI: the Lipschitz constant estimated by LP-EI is rarely suitable for all the 20 locations of the suggested batch. Even worse, LP-EI’s performance is deteriorated as compared to smaller batch sizes. LP-EI is plagued by numerical issues in the calculation of its gradient, and suggests multiple re-evaluations of the same points. This multiple re-samplings affects the GP modeling, resulting in an inferior overall BO performance.

3.3 Real world example: Tuning a Reinforcement Learning Algorithm on various tasks

Figure 4: BO of batch size 20 on tuning PPO on a variety of robotic tasks. Red, blue, green and black dots depict runs of OEI, LP-EI, BLCB and Random algorithms respectively. Diamonds depict the median loss of all the runs for each algorithm.

Finally we perform Bayesian Optimization to tune Proximal Policy Optimization (PPO), a state-of-the-art Deep Reinforcement Learning algorithm that has been shown to outperform several policy gradient reinforcement learning algorithms

(Schulman et al., 2017). The problem is particularly challenging, as deep reinforcement learning can be notoriously hard to tune, without any guarantees about convergence or performance. We use the implementation Dhariwal et al. (2017) published by the authors of PPO and tune the reinforcement algorithm on 4 OpenAI Gym tasks (Hopper, InvertedDoublePendulum, Reacher and InvertedPendulumSwingup) using the Roboschool robot simulator. We tune a set of 5 hyper-parameters which are listed in Table 3. We define as objective function the negative average reward per episode over the entire training period ( timesteps), which favors fast learning (Schulman et al., 2017). All of the other parameters are the same as the original implementation (Schulman et al., 2017).

We run Bayesian Optimization with batch size of 20, with the same modeling, preprocessing and optimization choices as the ones used in the benchmark functions. The results of 20 runs are depicted in Figure 4. OEI outperforms BLCB, which performs comparably to Random search, and LP-EI, which exhibits high variance with occasional outliers stuck in inferior solutions.


Hyperparameter Range
Adam step-size
Clipping parameter
Batch size
Discount Factor
GAE parameter
Table 3: List of PPO’s Hyperparameters used for tuning. Items with asterisk are tuned in the log-space.

4 Conclusions

We have introduced a new acquisition function that is a tractable, probabilistic relaxation of the multi-point Expected Improvement, drawing ideas from the Distributionally Ambiguous Optimization community. Novel theoretical results allowed inexpensive calculation of first and second derivative information resulting in efficient Bayesian Optimization on large batch sizes.

This work was supported by the EPSRC AIMS CDT grant EP/L015987/1 and Schlumberger. The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. http://dx.doi.org/10.5281/zenodo.22558. Many thanks to Leonard Berrada for various useful discussions.

Appendix A Value of the Expected Improvement Lower Bound

In this section we provide a proof of the first of our main results, Theorem 2, which establishes that for any one can compute the value of our optimistic lower bound function

(9)

via solution of a convex optimization problem in the form of a semidefinite program.

Proof of Theorem 2:


Recall that the set is the set of all distributions with mean and covariance . Following the approach of Zymler et al. (2013, Lemma 1), we first remodel problem (9) as:

(10)
s.t.

where represents the cone of nonnegative Borel measures on . The optimization problem (10) is a semi-infinite linear program, with infinite dimensional decision variable and a finite collection of linear equalities in the form of moment constraints.

As shown by Zymler et al. (2013), the dual of problem (10) has instead a finite dimensional set of decision variables and an infinite collection of constraints, and can be written as

sup (11)
s.t.

with the decision variable and the second order moment matrix of (see (5)). Strong duality holds between problems (10) and (11) for any , i.e. there is zero duality gap and their optimal values coincide.

The dual decision variables in (11) form a matrix of Lagrange multipliers for the constraints in (10) that is block decomposable as

where are multipliers for the second moment constraint, multipliers for the mean value constraint, and a scalar multiplier for the constraint that should integrate to , i.e. that should be a probability measure.

For our particular problem, we have:

as defined in (4), so that (11) can be rewritten as

sup (12)
s.t.

The infinite dimensional constraints in (12) can be replaced by the equivalent conic constraints

and respectively, where are the standard basis vectors in . Substituting the above constraints in (12) results in (), which proves the claim.

Appendix B Gradient of the Expected Improvement Lower Bound

In this section we provide a proof of our second main result, Theorem 2, which shows that the gradient111Technically, the gradient is not defined, as is by definition symmetric. We can get around this technicality by a slight abuse of notation allowing for a non-symmetric such that . of our lower bound function (3) with respect to coincides with the optimal solution of the semidefinite program (). We will find it useful to exploit also the dual of this SDP, which we can write as

inf (13)
s.t.

and the Karush-Kuhn-Tucker conditions for the pair of primal and dual solutions , :

(14)
(15)
(16)
(17)

where denotes the Lagrangian of ().

Before proving Theorem 2 we require three ancillary results. The first of these results establishes that any feasible point for the optimization problem () has strictly negative definite principal minors in the upper left hand corner. For any feasible of () the upper left matrix is negative definite. Let

where and .

From (6) we can infer that , otherwise (6) would require , which is impossible.

Since , we have . Assume though, for now, that . Applying then a standard Schur complement identity in (6) results in:

Summing the above results in

which results in , since span span .

Finally, in the case where we have , since . Applying the above results in

The second auxiliary results lists some useful properties of the dual solution: The optimal Lagrange multipliers of () are of rank one with , where and denote the nullspace and the range of a matrix. Lemma B implies that (recall that is nonzero only in the last column or the last row), which means that . Due to the complementary slackness condition (16), the span of is orthogonal to the span of and consequently . However, according to (17) we have

which results in

and, using (16):

Our final ancillary result considers the (directional) derivative of the function when its argument is varied linearly along some direction . Given any and any moment matrix , define the scalar function as

Then is differentiable at with , where is the optimal solution of () at . Define the set as

i.e. the set of all for which problem () has a bounded solution given the moment matrix . In order to prove the result it is then sufficient to show:

  1. [label=]

  2. , and

  3. The solution of () at is unique.

The remainder of the proof then follows from Goldfarb and Scheinberg (1999, Lemma 3.3), wherein it is shown that implies that is a subgradient of at , and subsequent remarks in Goldfarb and Scheinberg (1999) establish that uniqueness of the solution ensure differentiability.

We will now show that both of the conditions (i) and (ii) above are satisfied.

(i): Proof that :


It is well-known that if both of the primal and dual problems () and (13) are strictly feasible then their optimal values coincide, i.e. Slater’s condition holds and we obtain strong duality; see (Boyd and Vandenberghe, 2004, Section 5.2.3) and (Ramana et al., 1997).

For () it is obvious that one can construct a strictly feasible point. For (13), defines a strictly feasible point for any . Hence () is solvable for any with sufficiently small. As a result, .

(ii): Proof that the solution to () at is unique:


We will begin by showing that the range of the dual variables remain the same for every primal-dual solution. Assume that there exists another optimal primal-dual pair denoted by , and . Due to Lemma B, there exist such that

(18)

Obviously, and, by definition, we have

(19)

Moreover, as is feasible we have , resulting in

(20)

Since and have the same objective value we conclude that . Moreover, according to (17) and (18) we can decompose as . Hence

Hence, is, like , a null vector of . Since the null space of is of rank one, we get for some , resulting in, .

Now we can show that the dual solution is unique. Assume that , i.e. for some . Feasibility of and gives

i.e. are linearly dependent. This contradicts Lemma B and (17) which suggest linear independence, as each is of rank one and together they span the whole space . Hence, , i.e. the dual solution is unique.

Finally, the uniqueness of the primal solution can be established by the uniqueness for the dual solution. Indeed, summing (16) gives

Proof of Theorem 2:


Given the preceding support results of this section, we are now in a position to prove Theorem 2.

First, we will show that is differentiable on its domain. First, note that is convex due to (Rockafellar and Wets, 2009, Corollary 3.32) and hence continuous on (Rockafellar and Wets, 2009, Theorem 2.35). Also, note that due to Lemma B the regular directional derivatives (Rockafellar and Wets, 2009, Theorem 8.22) of exist and are a linear map of the direction. Hence, according to (Rockafellar and Wets, 2009, Theorem 9.18 (a, f)) p is differentiable on .

Consider now the derivative of the solution of () when perturbing across a specific direction , i.e. with . Lemma B shows that when . The proof then follows element-wise from Lemma B by choosing a sparse symmetric matrix with the only nonzero elements.

Appendix C Derivative of the Optimal Solution

In this section we will provide a constructive proof of Theorem 2, and show in particular that , the directional derivative of when perturbing linearly across a direction , can be computed by solution of the following linear system

where

  • is defined such that

    , i.e. the non-zero eigenvector of the Lagrange multiplier

    , which was shown to be unique in Lemma (B).

  • , is the operator that stacks the columns of a matrix to a vector, and the operator that stacks only the upper triangular elements in a similar fashion

  • , is the matrix that maps where , and performing the inverse operation.

  • and denote the Kronecker product and sum respectively.

Lemma B of Appendix B, guarantees that the solutions of () and (13) are unique for any . Hence, according to Freund and Jarre (2004), the directional derivatives of along the perturbation exist and are given as the unique solution to the following overdetermined system:

(21)

The above linear system is over-determined, and has symmetric constraints. This renders standard solution methods, such as LU decomposition, inapplicable. Expressing the above system in a standard matrix form results in a matrix with zeros, which makes its solution very costly.

To avoid theses issues, we will exploit Lemma B of Appendix B to express the dual solution compactly as . One can choose a differentiable mapping , e.g. where

is the only positive eigenvalue of

and its corresponding unit-norm eigenvector. Differentiability of comes from differentiability of , , (Kato, 1976) and positivity of due to Lemma B. The chain rule then applies for . Hence (21) can be expressed as

(22)
(23)

Exploiting from (16) and that gives