BoTorch: Programmable Bayesian Optimization in PyTorch

10/14/2019 ∙ by Maximilian Balandat, et al. ∙ 30

Bayesian optimization provides sample-efficient global optimization for a broad range of applications, including automatic machine learning, molecular chemistry, and experimental design. We introduce BoTorch, a modern programming framework for Bayesian optimization. Enabled by Monte-Carlo (MC) acquisition functions and auto-differentiation, BoTorch's modular design facilitates flexible specification and optimization of probabilistic models written in PyTorch, radically simplifying implementation of novel acquisition functions. Our MC approach is made practical by a distinctive algorithmic foundation that leverages fast predictive distributions and hardware acceleration. In experiments, we demonstrate the improved sample efficiency of BoTorch relative to other popular libraries. BoTorch is open source and available at https://github.com/pytorch/botorch.

READ FULL TEXT VIEW PDF

Authors

page 3

Code Repositories

botorch

Bayesian optimization in PyTorch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The realization that time and resource-intensive machine learning (ML) experimentation could be addressed through efficient global optimization O’Hagan (1978); Jones et al. (1998); Osborne (2010) was brought to the attention of the broader ML community by Snoek et al. (2012), and the accompanying library, Spearmint

. In the years that followed, Bayesian optimization (BO) experienced an explosion of methodological advances for sample-efficient general-purpose sequential optimization of black-box functions, including hyperparameter tuning

(Feurer et al., 2015; Shahriari et al., 2016; Wu et al., 2019), robotic control (Calandra et al., 2016; Antonova et al., 2017), chemical design (Griffiths & Hernández-Lobato, 2017; Li et al., 2017), and internet-scale ranking systems and infrastructure (Agarwal et al., 2018; Letham et al., 2019; Letham & Bakshy, 2019).

Meanwhile, the broader field of ML has been undergoing its own revolution, driven largely by advances in systems and hardware, including the development of libraries such as PyTorch, Tensorflow, Caffe, and MXNet

(Paszke et al., 2017; Abadi et al., 2016; Jia et al., 2014; Chen et al., 2015). While BO has become rich with new methodologies, the programming paradigms to support new innovations have generally not fully exploited these computational advances.

In this paper, we introduce BoTorch (https://github.com/pytorch/botorch), a flexible, modular, and scalable programming framework for Bayesian optimization research, built around modern paradigms of computation. Its contributions include:

Systems:

  • [leftmargin=2.5ex,topsep=-0.5ex,noitemsep]

  • A set of flexible model-agnostic abstractions for Bayesian sequential decision making, together with a lightweight API for composing these primitives.

  • Robust implementations of models and acquisition functions for BO and active learning, including MC acquisition functions with support for parallel optimization, composite objectives, and asynchronous evaluation.


  • Scalable BO that utilizes modern numerical linear algebra from GPyTorch (Gardner et al., 2018)

    , including sparse interpolation and structure exploiting algebra

    (Wilson & Nickisch, 2015), and fast predictive distributions and sampling (Pleiss et al., 2018).

  • Auto-differentiation, GPU hardware acceleration, and integration with deep learning components via PyTorch.

Methodology:

  • [leftmargin=2.5ex,topsep=-0.5ex,noitemsep]

  • A novel approach to optimizing Monte-Carlo (MC) acquisition functions that effectively combines with deterministic higher-order optimization algorithms.

  • A simple and flexible formulation of the Knowledge Gradient—a powerful look-ahead algorithm known for being difficult to implement — with improved performance over the state-of-the-art (Wu & Frazier, 2016).

In the sections that follow, we demonstrate how BoTorch’s distinctive algorithmic approach, explicitly designed around differentiable programming and modern parallel processing capabilities (Sec. 3, 4, and 5), simplifies the development of complex acquisition functions (Sec. 6 and 9), while obtaining greater computational efficiency (Sec. 8) and optimization performance (Sec. 10).

2 Background and Related Work

In BO, we aim to solve the problem , where is an expensive-to-evaluate function, , and is a feasible set. BO consists of two main components: a probabilistic surrogate model of the observed function – most commonly, a Gaussian process (GP) – and an acquisition function that encodes a strategy for navigating the exploration vs. exploitation trade-off (Shahriari et al., 2016).

Popular libraries for BO include Spearmint (Snoek et al., 2012), GPyOpt (The GPyOpt authors, 2016), Cornell-MOE (Wu & Frazier, 2016), RoBO (Klein et al., 2017), Emukit (The Emukit authors, 2018), and Dragonfly (Kandasamy et al., 2019). We provide further discussion of these packages in Appendix A.

ProBO (Neiswanger et al., 2019) and GPFlowOpt (Knudde et al., 2017) are of particular relevance. ProBO is a recently suggested framework (no implementation is available at the time of this writing) for using general probabilistic programming in BO. While its distribution-agnostic approach is similar to ours, ProBO, unlike BoTorch, does not benefit from gradient-based optimization provided by differentiable programming, or algebraic methods designed to benefit from GPU acceleration. GPFlowOpt inherits support for auto-differentiation and hardware acceleration from TensorFlow (via GPFlow (Matthews et al., 2017)), but unlike BoTorch does not use algorithms designed to specifically exploit this potential (e.g. fast matrix multiplications or batched computation). Neither ProBO nor GPFlowOpt naturally support parallel optimization, MC acquisition functions, or fast batch evaluation, which form the core of BoTorch.

In contrast to all existing libraries, BoTorch pursues novel algorithmic approaches specifically designed to benefit from modern computing in order to achieve its high degree of flexibility, programmability, performance, and scalability. Robust implementation and testing makes BoTorch suitable for use in both research and production settings.

3 BoTorch Architecture

BoTorch

provides abstractions for combining BO primitives in a way that takes advantage of modern computing architectures, enabling BO with auto-differentiation, automatic parallelization, device-agnostic hardware acceleration, and generic neural network operators and modules. Implementing a new acquisition function can be as simple as defining a differentiable forward pass, after which gradient-based optimization comes for free. Figure 

1 outlines what BoTorch considers the basic primitives of Bayesian sequential decision making:

Model: In BoTorch, the Model is a PyTorch module. Recent work has produced packages such as GPyTorch (Gardner et al., 2018) and Pyro (Bingham et al., 2018) that enable high-performance differentiable Bayesian modeling. Given those models, our focus here is on constructing acquisition functions and optimizing them effectively, using modern computing paradigms. BoTorch is model-agnostic — the only requirement for a model is that, given a set of inputs, it can produce posterior draws of one or more outcomes (explicit posteriors, such as those provided by a GP, can also be used directly). BoTorch provides seamless integration with GPyTorch and its algorithmic approaches specifically designed for scalable GPs, fast predictive distributions (Pleiss et al., 2018), and GPU acceleration.

Objective: An Objective is a module that applies a transformation to model outputs. For instance, it may scalarize model outputs of multiple objectives for multi-objective optimization (Paria et al., 2018)

, or it could handle black-box constraints by weighting the objective outcome with probability of feasibility

(Gardner et al., 2014).

Acquisition function: An AcquisitionFunction implements a forward pass that takes a candidate set  of design points and computes their utility . Internally (indicated by the shaded block in Figure 1), MC acquisition functions operate on samples from the posterior evaluated under the objective, while analytic acquisition functions operate on the explicit posterior. The external API is the same for both.

Optimizer: Given the acquisition function, we must find a maximizer in order to proceed with the next iteration of BO. Auto-differentiation makes it straightforward to use gradient-based optimization even for complex acquisition functions and objectives, which typically performs better than derivative-free approaches.


Figure 1: High-level BoTorch primitives

4 Acquisition Functions

Figure 2: Monte Carlo acquisition functions in BoTorch. Samples from the posterior provided by the model at are evaluated in parallel and averaged as in (2). All operations are fully differentiable.

Suppose we have collected data , where and with some noise corrupting the true function value . We allow  to be multi-output, in which case . In some applications we may also have access to distributional information of the noise

, such as its (possibly heteroskedastic) (co-)variance. Suppose further that we have a

surrogate model that for any set of points provides a posterior over . In BO, the model traditionally is a GP, and the are assumed i.i.d. normal, in which case both and are multivariate normal. However, BoTorch makes no particular assumptions about the form of these posteriors.

The next step in BO is to optimize an acquisition function evaluated on the model posterior over the candidate set . Following Wilson et al. (2018), many myopic acquisition functions can be written as

(1)

where is a (composite) objective function, are parameters independent of , and is a utility function that defines the acquisition function.

4.1 Analytic Acquisition Functions

In some simple situations, the expectation in (1) and its gradient can be computed analytically. The classic case considered in the literature is a single-output () model with a single candidate () point , a Gaussian model posterior , and the identity objective . Expected Improvement (EI) is a popular acquisition function which maximizes the expected difference between the currently observed best value and the objective at the next query point, through the utility . EI and its gradient have a well-known analytic form (Jones et al., 1998). BoTorch implements popular analytic acquisition functions.

4.2 Monte-Carlo Acquisition Functions

In general, analytic expressions are not available for more general objective functions , utility functions , non-Gaussian model posteriors, or collections of points which are to be evaluated in a parallel or asynchronous fashion (Snoek et al., 2012; Wu & Frazier, 2016; Wang et al., 2016a; Wilson et al., 2017). Instead, using samples from the posterior, Monte Carlo integration can be used to approximate the expectation (1), as well as its gradient via the reparameterization trick (Kingma & Welling, 2013; Wilson et al., 2018).

An MC approximation  of the acquisition function (1) using samples is straightforward:

(2)

MC acquisition functions enable modularity by naturally handling a much broader class of models and objectives than could be expressed analytically. In particular, we show how MC acquisition functions may be used to support:

Arbitrary posteriors: Analytic acquisition functions are usually defined for univariate normal posteriors. In contrast, MC acquisition functions can be directly applied to any posterior from which samples can be drawn, which include probabilistic programs (Tran et al., 2017; Bingham et al., 2018), Bayesian neural networks (Neal, 1996; Saatci & Wilson, 2017; Maddox et al., 2019; Izmailov et al., 2019), and more general types of Gaussian processes (Damianou & Lawrence, 2013; Gardner et al., 2018).

Arbitrary objectives and constraints: MC acquisition functions can be used with arbitrary composite objective functions  of the outcome(s). This is not the case for analytic acquisition functions, as the transformation will generally change the distributional form of the posterior. This setup also enables generic support for including unknown (and to-be-modeled) outcome constraints by using a feasibility-weighted improvement criterion similar to  Schonlau et al. (1998); Gardner et al. (2014); Gelbart et al. (2014); Letham et al. (2019)

, implemented in a differentiable fashion by approximating the feasibility indicator by a sigmoid function on the sample level.

Parallel evaluation: Wilson et al. (2017) show how MC acquisition functions can be used to generalize a number of acquisition functions to the parallel BO setting where evaluations are made simultaneously. In BoTorch, MC acquisition functions by default consider multiple candidate points, and have a consistent interface for any value of .

Asynchronous evaluation: Since MC acquisition functions compute the joint acquisition value for a set of points, asynchronous candidate generation, in which a set  of pending points has been submitted for evaluation but not yet returned results, is easily handled in a generic fashion by automatically concatenating into the arguments of an acquisition function. We thus compute the joint acquisition value of all points , pending and new, but optimize only with respect to the new candidate points .

Fig. 2 summarizes the chain of operations involved in evaluating a MC acquisition function in BoTorch.

4.3 Computing Gradients

Effectively optimizing the acquisition function 

, especially in higher dimensions, typically requires using gradient information. In many cases, an unbiased estimate of

can be obtained via the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014). The basic idea is that a sample  from can be expressed as a suitable (differentiable) deterministic transformation of an auxiliary base sample drawn from some base distribution independent of . If is differentiable with respect to , then . For instance, if , then , with and . In BoTorch, these chains of derivatives are handled by auto-differentiation.

4.4 Variance Reduction via qMC Sampling

Quasi-Monte Carlo (qMC) methods are an established technique for reducing the variance of MC-integration (Caflisch, 1998). Instead of drawing i.i.d. samples from the base distribution, in many cases one can instead use qMC sampling to significantly lower the variance of the estimate (2) and its gradient (see Appendix B for additional details). BoTorch by default uses scrambled Sobol sequences (Owen, 2003) as the basis for evaluating MC acquisition functions.

5 API Abstractions

BoTorch provides a consistent, model-agnostic interface through the following API functions:

Model:

  • [leftmargin=2.5ex,topsep=-0.5ex,noitemsep]

  • posterior(, observation_noise=False): Given a candidate set , return a Posterior object representing

    (if observation_noise=True, return the posterior predictive

    ).

  • fantasize(, sampler): Given a candidate set  and an MCSampler sampler, construct a batched fantasy model over samples (specified in the sampler), i.e. a set of models , where with is a sample from the posterior predictive distribution at .

Posterior:

  • [leftmargin=2.5ex,topsep=-0.5ex,noitemsep]

  • rsample(, ): Given base samples , draw samples from the joint posterior over  points with  outcomes each.

MCSampler:

  • [leftmargin=2.5ex,topsep=-0.5ex,noitemsep]

  • forward(p): Given a MCSampler object sampler and a Posterior object p, use sampler(p) to draw samples from p by automatically constructing appropriate base samples . If sampler is constructed with resample=False, do not re-construct base samples between evaluations, ensuring that posterior samples are deterministic (conditional on the base samples).

MCObjective:

  • [leftmargin=2.5ex,topsep=-0.5ex,noitemsep]

  • forward(): Compute the (composite) objective of a number of (multi-dimensional) samples drawn from a (multi-output) Posterior.

6 Implementation Examples

To demonstrate the modularity and flexibility of BoTorch, we show how both existing approaches and novel acquisition functions can be implemented succinctly and easily using the API abstractions introduced in the previous section.

6.1 Composite Objectives

In many applications of BO, the objective may have some known functional form, such as the mean square error between simulator outputs and empirical data in the context of simulation calibration. Astudillo & Frazier (2019) show that modeling the individual components of the objective can be advantageous. They propose the use of composite functions and develop a MC-based variant of EI that achieves improvements in sample efficiency. Such composite functions are a special case of BoTorch’s Objective abstraction, and can be readily implemented as such.

We consider the model calibration example from Astudillo & Frazier (2019), where the goal is to minimize the error between the output of a model of pollutant concentrations and observed concentrations at 12 locations. Code Example 1 shows how the setup from Astudillo & Frazier (2019) can be extended to work with the Knowledge Gradient (KG), a sample-efficient acquisition function that, so far, has not been used with composite objectives. This is achieved simply by passing a multi-output BoTorch model modeling the individual components, pending points X_pending, and an objective callable via the GenericMCObjective module.

    model=model,
    objective=GenericMCObjective(
        lambda Y: -(Y - c_obs).pow(2).sum(dim=-1)
    )
    X_pending=X_pending,
)
Code Example 1 Model calibration via Objectives

Figure 3 presents results for random search, EI, and KG, for both the standard and composite function (CF) implementations (we consider the case here, additional results for parallel optimization are provided in Appendix E.1).


Figure 3: Composite function (CF) optimization for , showing log regret evaluated at the maximizer of the posterior mean averaged over 250 trials. The CF variant of BoTorch’s knowledge gradient algorithm, OKG-CF, achieves superior performance compared to that of EI-CF from Astudillo & Frazier (2019).

6.2 Parallel Noisy Expected Improvement

Letham et al. (2019) introduce Noisy EI (NEI), an extension of EI that is well-suited for settings with noisy function evaluations, such as A/B tests. In this setting, it is important to account for the uncertainty in the best function value at the points observed so far. That is, , where

is a random variable correlated with

.

Here, we propose a novel full MC formulation of NEI that extends Letham et al. (2019) to joint parallel optimization and generic objectives. Our implementation avoids the need to characterize explicitly by averaging improvements on samples from the joint posterior over new and previously evaluated points:

(3)

where . The implementation of (3) is given in Code Example 2 (here and in following examples, we omit the constructor for brevity; full implementations are provided in Appendix E.4). X_baseline is an appropriate subset of the points at which the function was observed.

We achieve support for asynchronous evaluation by automatically concatenating pending points into  (via the @concatenate_pending_points decorator). We can incorporate constraints on modeled outcomes, as discussed in Letham et al. (2019), using a differentiable relaxation in the objective  as we describe in Section 4.2.

    @concatenate_pending_points
    @t_batch_mode_transform()
    def forward(self, X: Tensor) -> Tensor:
        q = X.shape[-2]
        X_bl = match_batch_shape(self.X_baseline, X)
        X_full = torch.cat([X, X_bl], dim=-2)
        posterior = self.model.posterior(X_full)
        samples = self.sampler(posterior)
        obj = self.objective(samples)
        obj_n = obj[...,:q].max(dim=-1)[0]
        obj_p = obj[...,q:].max(dim=-1)[0]
        return (obj_n - obj_p).clamp_min(0).mean(dim=0)
Code Example 2 Parallel Noisy Expected Improvement

6.3 Active Learning

While BO aims to identify the extrema of a black-box function, the goal in active learning (e.g., Settles, 2009) is to explore the design space in a sample-efficient way in order to reduce model uncertainty in a more global sense. For example, one can maximize the negative integrated posterior variance (NIPV) (Seo et al., 2000; Chen & Zhou, 2014) of the model:

(4)

Here denotes the additional (random) set of observations collected at a candidate set . The integral in (4) must be evaluated numerically, for which we use MC integration.

We can implement (4) using standard BoTorch components, as shown in Code Example 3. Here mc_points is the set of points used for MC-approximating the integral. In the most basic case, one can use qMC samples drawn uniformly in . By allowing for arbitrary mc_points, we permit weighting regions of  using non-uniform sampling. Using mc_points as samples of the maximizer of the posterior, we recover the recently proposed Posterior Variance Reduction Search (Nguyen et al., 2017) for BO.

    @concatenate_pending_points
    @t_batch_mode_transform()
    def forward(self, X: Tensor) -> Tensor:
        fant_model = self.model.fantasize(
            X=X, sampler=self._dummy_sampler,
            observation_noise=True
        )
        sz = [1] * len(X.shape[:-2]) + [-1, X.size(-1)]
        mc_points = self.mc_points.view(*sz)
        with settings.propagate_grads(True):
            posterior = fant_model.posterior(mc_points)
        ivar = posterior.variance.mean(dim=-2)
        return -ivar.view(X.shape[:-2])
Code Example 3 Active Learning (NIPV)

This acquisition function supports both parallel selection of points and asynchronous evaluation. Since MC integration requires evaluating the posterior variance at a large number of points, this acquisition function benefits significantly from the fast predictive variance computations in GPyTorch (Pleiss et al., 2018; Gardner et al., 2018).

To illustrate how NIPV may be used in combination with scalable probabilistic modeling, we examine the problem of efficient allocation of surveys across a geographic region. Inspired by Cutajar et al. (2019), we utilize publicly-available data from The Malaria Atlas Project (2019)

dataset, which includes the yearly mean parasite rate (along with standard errors) of

Plasmodium falciparum at a grid spatial resolution across Africa. In particular, we consider the following active learning problem: given a spatio-temporal probabilistic model fit to data from 2011-2016, which geographic locations in and around Nigeria should one sample in 2017 in order to minimize the model’s error for 2017 across all of Nigeria?

We fit a heteroskedastic GP model to 2500 training points prior to 2017 (using a noise model that is itself a GP fit to the provided standard errors). We then select sample locations for 2017 using the NIPV acquisition function, and make predictions across the entirety of Nigeria using this new data. Compared to using no 2017 data, we find that our new dataset reduces MSE by 16.7% on average (SEM = 0.96%) across 60 subsampled datasets. By contrast, sampling the new 2017 points at a regularly spaced grid results only in a 12.4% reduction in MSE (SEM = 0.99%). The mean relative improvement in MSE reduction from NIPV optimization is 21.8% (SEM = 6.64%). Figure 4

shows the NIPV-selected locations on top of the base model’s estimated parasite rate and standard deviation.



Figure 4: Locations for 2017 samples produced by NIPV. Observe how the samples cluster in higher variance areas.

7 Optimizing Acquisition Functions

BoTorch uses differentiable programming to automatically evaluate chains of derivatives when computing gradients (such as the gradient of from Figure 2). This functionality enables gradient-based optimization without having to implement gradients for each step in the chain.111The samples drawn from the posterior must be differentiable w.r.t. to the input , as with all models provided by GPyTorch. Normally, any change to the model, objective, or acquisition function would require manually deriving and implementing the new gradient computation. In BoTorch that time-consuming and error-prone process is unnecessary, making it much easier to apply gradient-based acquisition function optimization to new formulations in a modular fashion, greatly improving research efficiency.

Evaluating MC acquisition functions yields noisy, unbiased gradient estimates. Thus maximization of these acquisition functions can be conveniently implemented in BoTorch using standard first-order stochastic optimization algorithms such as SGD (Wilson et al., 2018) available in the torch.optim module. However, it is often preferable to solve a biased, deterministic optimization problem instead: Rather than re-drawing the base samples for each evaluation, we can draw the base samples once, and hold them fixed between evaluations of the acquisition function. The resulting MC estimate, conditioned on the drawn samples, is deterministic and differentiable.

To optimize the acquisition function, one can now employ the full toolbox of deterministic optimization, including quasi-Newton methods that provide faster convergence speeds and are generally less sensitive to optimization hyperparameters than stochastic first-order methods. To this end, BoTorch provides an interface with support for any algorithm in the scipy.optimize module.

By default, we use multi-start optimization via L-BFGS-B in conjunction with an initialization heuristic that exploits fast batch evaluation of acquisition functions (see Appendix 

D for details).

We find that the added bias from fixing the base samples only has a minor effect on the performance relative to using the analytic ground truth, and can improve performance relative to stochastic optimizers that do not fix the base samples (Appendix B), while avoiding tedious tuning of optimization hyperparameters such as learning rates.

8 Exploiting Parallelism and Hardware Acceleration

BoTorch is the only Bayesian optimization framework with inference methods specifically designed for hardware acceleration and fast predictive distributions. In particular, BoTorch works with GPyTorch models that employ new stochastic Krylov subspace methods for inference, which perform all computations through matrix multiplication. Compared to the standard Cholesky-based approaches in all other available packages, these methods can be parallelized and massively accelerated on modern hardware such as GPUs (Gardner et al., 2018), with additional test-time scalability benefits (Pleiss et al., 2018) that are particularly relevant to online settings such as Bayesian optimization.

Figure 5: Wall times for batched evaluation of qEI
Figure 6: Speedups from using fast predictive distributions

8.1 Batch Evaluation

Batch evaluation, an important element of modern computing, enables automatic dispatch of independent operations across multiple computational resources (e.g. CPU and GPU cores) for parallelization and memory sharing. All BoTorch components support batch evaluation, which makes it easy to write concise and highly efficient code in a platform-agnostic fashion.

Specifically, batch evaluation provides fast queries of acquisition functions at a large number of candidate sets in parallel, enabling novel initialization heuristics and optimization techniques. Instead of sequentially evaluating an acquisition function at a number of candidate sets , where for each , BoTorch evaluates a batched tensor . Computation is automatically distributed so that, depending on the hardware used, speedups can be close to linear in the batch size . Batch evaluation is also heavily used in computing MC acquisition functions, with the effect that increasing the number of MC samples up to several thousands often has little impact on wall time.

Figure 6 reports wall times for batch-evaluation of qExpectedImprovement as a function of  for different MC samples sizes , on both CPU and GPU for a GPyTorch GP.222Reported times do not include one-time computation of the test caches, which need to be computed only once per BO step. Evaluation was performed on a Tesla M40 GPU. On more recent hardware we observe even more significant speedups. We observe significant speedups from running on the GPU, with scaling essentially linear in the batch size, except for very large and . There is a fixed cost due to communication overhead that renders CPU evaluation faster for small batch and sample sizes.

8.2 Fast Posterior Evaluation

While much of the literature on scalable GPs focuses on space-time complexity for training, it is fast test-time (predictive) distributions that are crucial for applications where the same model is evaluated many times, such as when optimizing acquisition function in BO. GPyTorch makes use of structure-exploiting algebra and local interpolation for computations in querying the predictive distribution, and for a posterior sample at points, compared to the standard and computations (Pleiss et al., 2018). Figure 6 shows between 10-40X speedups when using fast predictive covariance estimates over performing standard posterior inference in the setting from Section 8.1. Reported relative speedups grow slower on the GPU, whose cores does not saturate as quickly as on the CPU.

Together with the extreme speedups from batch evaluation, fast predictive distributions enable users to efficiently compute the acquisition function at a very large number (e.g., tens of thousands) of points in parallel. Such parallel prediction can be used as a simple and computationally efficient way of doing random search over the domain. The resulting candidate points can either be used directly in constructing a candidate set, or as initial conditions for further gradient-based optimization.

9 One-Shot Optimization of KG

At a basic level, BO aims to find solutions to a full stochastic dynamic programming problem over the remaining (possibly infinite) horizon — a problem that is generally intractable. Look-ahead acquisition functions are approximations to this problem that explicitly take into account the effect observations have on the model in subsequent optimization steps. Popular look-ahead acquisition functions include the Knowledge Gradient (KG) (Frazier et al., 2008), Predictive Entropy Search (Henrández-Lobato et al., 2014), Max-Value Entropy Search (Wang & Jegelka, 2017), as well as various heuristics (González et al., 2016b).

KG quantifies the expected increase in the maximum of  from obtaining the additional (random) data set . KG often shows improved BO performance relative to simpler acquisition functions such as EI (Scott et al., 2011), but in its traditional form it is computationally expensive and hard to implement, two caveats that we alleviate in this work.

A generalized variant of parallel KG (Wu & Frazier, 2016) is given by

(5)

where is the posterior at  conditioned on , the (random) dataset observed at , and . KG as in (5) quantifies the expected increase in the maximum posterior mean of after gathering samples at . For simplicity, we only consider the standard BO problem, but multi-fidelity extensions (Poloczek et al., 2017; Wu et al., 2019) can also easily be used.

The standard approach for optimizing parallel KG (where ) is to apply stochastic gradient ascent, with each gradient observation potentially being an average over multiple samples (Wu & Frazier, 2016; Wu et al., 2017). For each sample , the inner optimization problem for the posterior mean is solved numerically, either via another stochastic gradient ascent (Wu et al., 2017) or multi-start L-BFGS (Frazier, 2018). An unbiased stochastic gradient of KG can then be computed by leveraging the envelope theorem and the optimal points . Alternatively, the inner problem can be discretized and the gradient computation of Wu & Frazier (2016) can be used. We emphasize that these approaches require optimizing the inner and outer problems separately (in an alternating fashion). The associated computational expense can be quite large; our main insight is that these computations can also be unnecessary.

We propose to treat optimizing in (5) as an entirely deterministic optimization problem. We draw fixed base samples  for the outer expectation, sample fantasy data , and construct associated fantasy models . We then move the inner maximization outside of the sample average, resulting in the following optimization problem:

(6)

with and . If the inner expectation does not have an analytic expression, we also draw fixed base samples  and use an MC approximation of the form (2). In either case we are left with a deterministic optimization problem. Conditional on the base samples, the two maximization problems in (6) are equivalent. The key difference from the envelope theorem approach is that we do not solve the inner optimization problem to completion for every fantasy point for every gradient step with respect to . Instead, we solve (6) jointly over and the fantasy points . The resulting optimization problem is of higher dimension, namely instead of , but unlike the envelope theorem formulation it can be solved as a single optimization problem, using methods for deterministic optimization. Consequently, we dub the KG variant utilizing this optimization strategy the “One-Shot Knowledge Gradient” (OKG). The ability to auto-differentiate w.r.t. allows BoTorch to solve this problem effectively.

As discussed in the previous section, the bias introduced by fixing base samples diminishes with the number of samples, and our empirical results in Section 10 show that OKG can utilize sufficiently few base samples to both be computationally more efficient while at the same time improving closed-loop BO performance compared to the stochastic gradient approach implemented in other packages.

Code Example 4 shows a slightly simplified OKG implementation (for the full one see Appendix E.4). The fixed base samples are defined as part of the sampler (and possibly inner_sampler) modules. SimpleRegret computes from (6). By expecting forward’s input X to be the concatenation of and , OKG can be optimized using the same APIs as all other acquisition functions (in doing so, we differentiate through fantasize).

    def forward(self, X: Tensor) -> Tensor:
        splits = [X.size(-2) - self.Nf, self.N_f]
        X, X_fantasies = torch.split(X, splits, dim=-2)
        if self.X_pending is not None:
            X_p = match_batch_shape(self.X_pending, X)
            X = torch.cat([X, X_p], dim=-2)
        fmodel = self.model.fantasize(
            X=X, sampler=self.sampler,
            observation_noise=True
        )
        inner_acqf = SimpleRegret(
            fmodel, self.inner_sampler, self.objective,
        )
        with settings.propagate_grads(True):
            values = inner_acqf(X_fantasies)
        return values.mean(dim=0)
Code Example 4 One-Shot Knowledge Gradient

10 Optimization Experiments

So far, we have shown how BoTorch provides an expressive framework for implementing and optimizing acquisition functions, considering experiments in model calibration (Section 6.1), noisy objectives (Section 6.2), and epidemiology (Section 6.3). In this section, we compare (i) the empirical performance of standard algorithms implemented in BoTorch with implementations in other popular BO libraries, and (ii) our novel acquisition function, OKG, against other acquisition functions, both within BoTorch and in other packages. Specifically, we compare BoTorch with GPyOpt, Cornell MOE (MOE EI and MOE KG), and the recent Dragonfly library.333We were unable to install GPFlowOpt due to its incompatibility with current versions of GPFlow/TensorFlow. GPyOpt uses an extension of EI with a local penalization heuristic (henceforth GPyOpt LP-EI) for parallel optimization (González et al., 2016a). Dragonfly does not provide a modular API, and so we consider its default ensemble heuristic (henceforth Dragonfly GP Bandit) (Kandasamy et al., 2019).

Our results provide three main takeaways. First, we find that BoTorch’s algorithms tend to achieve greater sample efficiency compared to those of other packages (all packages use their default models and settings). Second, we find that OKG often outperforms all other acquisition functions. Finally, the ability of BoTorch to exploit conjugate gradient methods and GPU acceleration makes OKG more computationally scalable than MOE KG.

10.1 Synthetic Test Functions

We consider BO for parallel optimization of design points, on four noisy synthetic functions used in Wang et al. (2016a): Branin, Rosenbrock, Ackley, and Hartmann. We report results for Hartmann here; results for the other functions are qualitatively similar and are provided in Appendix C.1. Algorithms start from the same set of qMC sampled initial points for each trial, with the dimension of the design space. We evaluate based on the true noiseless function value at the “suggested point” (i.e., the point to be chosen if BO were to end at this batch). OKG, MOE KG, and NEI use “out-of-sample” suggestions, while the others use “in-sample” suggestions (Frazier, 2018). Figure 8

reports means and 95% confidence intervals over 100 trials. Empirical results for constrained BO using a differentiable relaxation of the feasibility indicator on the sample level are provided in Appendix 

C.2.

10.2 Computational Scaling

We measure the wall times incurred by OKG and MOE KG on Hartmann with parallel points. In the case of OKG, we compare Cholesky versus linear conjugate gradient (CG) solves, on both CPU and GPU. Figure 8 shows the computational scaling of BO over 10 trials.m OKG is able to achieve improved optimization performance over MOE KG (see Figure 8), while generally requiring less wall time. Although there is some constant overhead to utilizing the GPU, wall time scales very well to models with a large number of observations.

Figure 7: Hartmann (), noisy, best suggested point
Figure 8: KG wall times

10.3 Hyperparameter Optimization

In this section, we illustrate the performance of BoTorch on real-world applications, represented by three hyperparameter optimization (HPO) experiments. As HPO typically involves long and resource intensive training jobs, it is standard to select the configuration with the best observed performance, rather than to evaluate a “suggested” configuration (we cannot perform noiseless function evaluations).

10.3.1 DQN and Cartpole

We first consider the setting of reinforcement learning (RL), illustrating the case of tuning a deep Q-network (DQN) learning algorithm

(Mnih et al., 2013, 2015) on the classical Cartpole task. Our experiment uses the Cartpole environment from OpenAI gym (Brockman et al., 2016) and the default DQN agent implemented in Horizon (Gauci et al., 2018). We allow for a maximum of 60 training episodes or 2000 training steps, whichever occurs first. The five hyperparameters tuned by BO are the exploration parameter (“epsilon”), the target update rate, the discount factor, the learning rate, and the learning rate decay. To reduce noise, each “function evaluation” is taken to be an average of 10 independent training runs of DQN.

Figure 10 presents the optimization performance of various acquisition functions from the different packages, using 15 rounds of parallel evaluations of size , over 100 trials. While in later iterations all algorithms achieve reasonable performance, BoTorch OKG, EI, NEI, and GPyOpt LP-EI show faster learning early on.

Figure 9: DQN tuning benchmark (Cartpole)
Figure 10: NN surrogate model, best observed accuracy

10.3.2 Neural Network Surrogate

As a representative of a standard neural network parameter tuning problem, we consider the neural network surrogate model (for the UCI Adult data set) introduced by Falkner et al. (2018), which is available as part of HPOlib2 (Eggensperger et al., 2019). This is a six-dimensional problem over network parameters (number of layers, units per layer) and training parameters (initial learning rate, batch size, dropout, exponential decay factor for learning rate). We use a surrogate model to achieve a high level of precision in comparing the performance of the algorithms without incurring excessive computational training costs.

Figure 10 shows optimization performance in terms of best observed classification accuracy. Results are means and 95% confidence intervals computed from 200 trials with 75 iterations of size . All BoTorch algorithms perform quite similarly here, with OKG doing slightly better in earlier iterations. Notably, they all achieve significantly better accuracy than all other algorithms.

10.3.3 Stochastic Weight Averaging for CIFAR-10

Our final example considers the recently proposed Stochastic Weight Averaging (SWA) procedure of Izmailov et al. (2018). This example is indicative of how BO is used in practice: since SWA is a relatively new optimization procedure, it is less known which hyperparameters provide good performance. We consider the HPO problem on one of the CIFAR-10 experiments performed in Izmailov et al. (2018)

: 300 epochs of training on the VGG-16

(Simonyan & Zisserman, 2014) architecture. Izmailov et al. (2018) report the mean and standard deviation of the test accuracy over three runs to be and , respectively, which corresponds to a 95% confidence interval of .

Due to the significant computational expense of running a large number of trials, we provide a limited comparison of OKG versus random search, illustrating BoTorch’s ability to provide value in a computationally expensive real-world problem. We tune three SWA hyperparameters: learning rate, update frequency, and starting iteration. After 25 trials of BO with 8 initial points and 5 iterations of size , OKG achieved an average accuracy of , while random search resulted in . Full learning curves are provided in Appendix C.3.

11 Discussion and Outlook

BoTorch’s modular design and flexible API — in conjunction with algorithms specifically designed to exploit parallelization, auto-differentiation, and modern computing — provides a modern programming framework for Bayesian optimization and active learning. This framework is particularly valuable in helping developers to rapidly assemble novel acquisition functions. Specifically, the basic MC acquisition function abstraction provides generic support for batch optimization, asynchronous evaluation, qMC integration, and composite objectives (including outcome constraints).

We presented a strategy for effectively optimizing MC acquisition functions using deterministic optimization. As an example, we developed an extension of this approach for “one-shot” optimization of look-ahead acquisition functions, specifically of OKG. This feature itself constitutes a significant development of KG, allowing for generic composite objectives and outcome constraints.

Our empirical results show that besides increased flexibility, these advancements in both methodology and computational efficiency translate into significantly faster and more accurate closed-loop optimization performance on a range of standard problems. While we do not specifically consider other settings such as high-dimensional BO (Kandasamy et al., 2015; Wang et al., 2016b), these approaches can be readily implemented in BoTorch.

The new BoTorch procedures for optimizing acquisition functions, designed to exploit fast parallel function and gradient evaluations, help make it possible to significantly broaden the applicability of Bayesian optimization procedures in future work.

For example, if one can condition on gradient observations of the objective, then it may be possible to apply Bayesian optimization where traditional gradient-based optimizers are used — but with faster convergence and robustness to multimodality. These settings could include higher dimensional objectives, and objectives that are inexpensive to query. While acqusition functions have been developed to effectively leverage gradient observations (Wu et al., 2017), the addition of gradients incurs a significant computational burden. This burden can be eased by the scalable inference methods and hardware acceleration available in BoTorch.

One can also naturally generalize Bayesian optimization procedures to incorporate neural architectures in BoTorch. In particular, deep kernel architectures (Wilson et al., 2016), deep Gaussian processes (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017), and variational auto-encoders (Gómez-Bombarelli et al., 2018; Moriconi et al., 2019) are supported in BoTorch, and can be used for more expressive kernels in high-dimensions.

In short, BoTorch provides the research community with a robust and extensible basis for implementing new ideas and algorithms using modern computational paradigms.

References

  • Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.
  • Agarwal et al. (2018) Agarwal, D., Basu, K., Ghosh, S., Xuan, Y., Yang, Y., and Zhang, L. Online parameter selection for web-based ranking problems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pp. 23–32, New York, NY, USA, 2018. ACM.
  • Antonova et al. (2017) Antonova, R., Rai, A., and Atkeson, C. G. Deep kernels for optimizing locomotion controllers. In Proceedings of the 1st Conference on Robot Learning, CoRL, 2017.
  • Astudillo & Frazier (2019) Astudillo, R. and Frazier, P. Bayesian optimization of composite functions. Forthcoming, in Proceedings of the 35th International Conference on Machine Learning, 2019.
  • Bingham et al. (2018) Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research, 2018.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Buchholz et al. (2018) Buchholz, A., Wenzel, F., and Mandt, S. Quasi-Monte Carlo variational inference. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2018.
  • Caflisch (1998) Caflisch, R. E. Monte carlo and quasi-monte carlo methods. Acta numerica, 7:1–49, 1998.
  • Calandra et al. (2016) Calandra, R., Seyfarth, A., Peters, J., and Deisenroth, M. P. Bayesian optimization for learning gaits under uncertainty.

    Annals of Mathematics and Artificial Intelligence

    , 2016.
  • Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
  • Chen & Zhou (2014) Chen, X. and Zhou, Q. Sequential experimental designs for stochastic kriging. In Proceedings of the 2014 Winter Simulation Conference, WSC ’14, pp. 3821–3832, Piscataway, NJ, USA, 2014. IEEE Press.
  • Cutajar et al. (2019) Cutajar, K., Pullin, M., Damianou, A., Lawrence, N., and González, J. Deep gaussian processes for multi-fidelity modeling. arXiv preprint arXiv:1903.07320, 2019.
  • Damianou & Lawrence (2013) Damianou, A. and Lawrence, N. Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215, 2013.
  • Eggensperger et al. (2019) Eggensperger, K., Feurer, M., Klein, A., and Falkner., S. Hpolib2 (development branch), 2019. URL https://github.com/automl/HPOlib2.
  • Falkner et al. (2018) Falkner, S., Klein, A., and Hutter, F. BOHB: robust and efficient hyperparameter optimization at scale. CoRR, abs/1807.01774, 2018.
  • Feurer et al. (2015) Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. Efficient and robust automated machine learning. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2962–2970. Curran Associates, Inc., 2015.
  • Frazier (2018) Frazier, P. I. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  • Frazier et al. (2008) Frazier, P. I., Powell, W. B., and Dayanik, S. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
  • Gardner et al. (2014) Gardner, J., Kusner, M., Zhixiang, Weinberger, K., and Cunningham, J. Bayesian optimization with inequality constraints. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 937–945, Beijing, China, 22–24 Jun 2014. PMLR.
  • Gardner et al. (2018) Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems, pp. 7576–7586, 2018.
  • Gauci et al. (2018) Gauci, J., Conti, E., Liang, Y., Virochsiri, K., He, Y., Kaden, Z., Narayanan, V., and Ye, X. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260, 2018.
  • Gelbart et al. (2014) Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization with unknown constraints. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, UAI, 2014.
  • Gómez-Bombarelli et al. (2018) Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Science, 4(2):268–276, 02 2018.
  • González et al. (2016a) González, J., Dai, Z., Hennig, P., and Lawrence, N. D. Batch bayesian optimization via local penalization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 648–657, 2016a.
  • González et al. (2016b) González, J., Osborne, M., and Lawrence, N. D. GLASSES: Relieving the myopia of bayesian optimisation. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS, pp. 790–799, 2016b.
  • GPy (since 2012) GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
  • Griffiths & Hernández-Lobato (2017) Griffiths, R.-R. and Hernández-Lobato, J. M. Constrained bayesian optimization for automatic chemical design. arXiv preprint arXiv:1709.05501, 2017.
  • Hansen & Ostermeier (2001) Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evol. Comput., 9(2):159–195, June 2001.
  • Henrández-Lobato et al. (2014) Henrández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, pp. 918–926, Cambridge, MA, USA, 2014. MIT Press.
  • Hernández-Lobato et al. (2015) Hernández-Lobato, J. M., Gelbart, M. A., Hoffman, M. W., Adams, R. P., and Ghahramani, Z. Predictive entropy search for bayesian optimization with unknown constraints. In Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015.
  • Izmailov et al. (2018) Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  • Izmailov et al. (2019) Izmailov, P., Maddox, W., Garipov, T., Kirichenko, P., Vetrov, D., and Wilson, A. G. Subspace inference for Bayesian deep learning. In Uncertainty in Artificial Intelligence, 2019.
  • Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  • Jones et al. (1993) Jones, D. R., Perttunen, C. D., and Stuckman, B. E. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, Oct 1993.
  • Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492, 1998.
  • Kandasamy et al. (2015) Kandasamy, K., Schneider, J., and Póczos, B. High dimensional bayesian optimisation and bandits via additive models. In Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015.
  • Kandasamy et al. (2016) Kandasamy, K., Dasarathy, G., Oliva, J., Schneider, J., and P’oczos, B. Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, NIPS, 2016.
  • Kandasamy et al. (2019) Kandasamy, K., Raju Vysyaraju, K., Neiswanger, W., Paria, B., Collins, C. R., Schneider, J., Póczos, B., and Xing, E. P. Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly. arXiv e-prints, art. arXiv:1903.06694, Mar 2019.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. arXiv e-prints, pp. arXiv:1312.6114, Dec 2013.
  • Klein et al. (2016) Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. Fast bayesian optimization of machine learning hyperparameters on large datasets. CoRR, 2016.
  • Klein et al. (2017) Klein, A., Falkner, S., Mansur, N., and Hutter, F. Robo: A flexible and robust bayesian optimization framework in python. In NIPS 2017 Bayesian Optimization Workshop, December 2017.
  • Knudde et al. (2017) Knudde, N., van der Herten, J., Dhaene, T., and Couckuyt, I. GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint – arXiv:1711.03845, 2017.
  • Letham & Bakshy (2019) Letham, B. and Bakshy, E. Bayesian optimization for policy search via online-offline experimentation. arXiv preprint arXiv:1904.01049, 2019.
  • Letham et al. (2019) Letham, B., Karrer, B., Ottoni, G., and Bakshy, E. Constrained bayesian optimization with noisy experiments. Bayesian Analysis, 14(2):495–519, 06 2019. doi: 10.1214/18-BA1110.
  • Li et al. (2017) Li, C., de Celis Leal, D. R., Rana, S., Gupta, S., Sutti, A., Greenhill, S., Slezak, T., Height, M., and Venkatesh, S. Rapid bayesian optimisation for synthesis of short polymer fiber materials. Scientific reports, 7(1):5683, 2017.
  • Maddox et al. (2019) Maddox, W., Garipov, T., Izmailov, P., Vetrov, D., and Wilson, A. G. A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, 2019.
  • (47) Malaria Atlas Project. Malaria atlas project, 2019. URL https://map.ox.ac.uk/malaria-burden-data-download.
  • Matthews et al. (2017) Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., and Hensman, J. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, apr 2017.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Moriconi et al. (2019) Moriconi, R., Sesh Kumar, K. S., and Deisenroth, M. P. High-Dimensional Bayesian Optimization with Manifold Gaussian Processes. arXiv e-prints, pp. arXiv:1902.10675, Feb 2019.
  • Neal (1996) Neal, R. M. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 1996.
  • Neiswanger et al. (2019) Neiswanger, W., Kandasamy, K., Póczos, B., Schneider, J., and Xing, E. ProBO: a Framework for Using Probabilistic Programming in Bayesian Optimization. arXiv e-prints, pp. arXiv:1901.11515, January 2019.
  • Nguyen et al. (2017) Nguyen, V., Gupta, S., Rana, S., Li, C., and Venkatesh, S. Predictive variance reduction search. In NIPS 2017 Workshop on Bayesian Optimization, Dec 2017.
  • O’Hagan (1978) O’Hagan, A. On curve fitting and optimal design for regression. J. Royal Stat. Soc. B, 40:1–32, 1978.
  • Osborne (2010) Osborne, M. A. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. PhD thesis, Oxford University, UK, 2010.
  • Owen (2003) Owen, A. B. Quasi-monte carlo sampling. Monte Carlo Ray Tracing: Siggraph, 1:69–88, 2003.
  • Paria et al. (2018) Paria, B., Kandasamy, K., and Póczos, B. A Flexible Multi-Objective Bayesian Optimization Approach using Random Scalarizations. ArXiv e-prints, May 2018.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. 2017.
  • Pleiss et al. (2018) Pleiss, G., Gardner, J. R., Weinberger, K. Q., and Wilson, A. G. Constant-time predictive distributions for gaussian processes. In International Conference on Machine Learning, 2018.
  • Poloczek et al. (2017) Poloczek, M., Wang, J., and Frazier, P. Multi-information source optimization. In Advances in Neural Information Processing Systems, pp. 4288–4298, 2017.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D.

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1278–II–1286. JMLR.org, 2014.
  • Rowland et al. (2018) Rowland, M., Choromanski, K. M., Chalus, F., Pacchiano, A., Sarlos, T., Turner, R. E., and Weller, A. Geometrically coupled monte carlo sampling. In Advances in Neural Information Processing Systems 31, pp. 195–206. Curran Associates, Inc., 2018.
  • Saatci & Wilson (2017) Saatci, Y. and Wilson, A. G. Bayesian GAN. In Advances in neural information processing systems, pp. 3622–3631, 2017.
  • Salimbeni & Deisenroth (2017) Salimbeni, H. and Deisenroth, M. Doubly stochastic variational inference for deep gaussian processes. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 4588–4599. Curran Associates, Inc., 2017.
  • Schonlau et al. (1998) Schonlau, M., Welch, W. J., and Jones, D. R. Global versus local search in constrained optimization of computer models. Lecture Notes-Monograph Series, 34:11–25, 1998.
  • Scott et al. (2011) Scott, W., Frazier, P., and Powell, W. The correlated knowledge gradient for simulation optimization of continuous parameters using Gaussian process regression. SIAM Journal of Optimization, 21:996–1026, 2011.
  • Seo et al. (2000) Seo, S., Wallat, M., Graepel, T., and Obermayer, K. Gaussian process regression: active data selection and test point rejection. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, volume 3, pp. 241–246 vol.3, July 2000.
  • Settles (2009) Settles, B. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
  • Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104:1–28, 2016.
  • Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
  • Snoek et al. (2014) Snoek, J., Swersky, K., Zemel, R., and Adams, R. P. Input warping for bayesian optimization of non-stationary functions. In Proceedings of the 31st International Conference on Machine Learning, ICML’14, 2014.
  • Springenberg et al. (2016) Springenberg, J. T., Klein, A., S.Falkner, and Hutter, F. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems 29, December 2016.
  • The Emukit authors (2018) The Emukit authors. Emukit: Emulation and uncertainty quantification for decision making. https://github.com/amzn/emukit, 2018.
  • The GPyOpt authors (2016) The GPyOpt authors. GPyOpt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
  • Tran et al. (2017) Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., and Blei, D. M. Deep probabilistic programming. Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • Wang et al. (2016a) Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. Parallel bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149, 2016a.
  • Wang & Jegelka (2017) Wang, Z. and Jegelka, S. Max-value Entropy Search for Efficient Bayesian Optimization. ArXiv e-prints, pp. arXiv:1703.01968, March 2017.
  • Wang et al. (2016b) Wang, Z., Hutter, F., Zoghi, M., Matheson, D., and De Freitas, N. Bayesian optimization in a billion dimensions via random embeddings. J. Artif. Int. Res., 55(1):361–387, January 2016b.
  • Wilson & Nickisch (2015) Wilson, A. and Nickisch, H. Kernel interpolation for scalable structured gaussian processes (kiss-gp). In International Conference on Machine Learning, pp. 1775–1784, 2015.
  • Wilson et al. (2016) Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In Artificial Intelligence and Statistics, pp. 370–378, 2016.
  • Wilson et al. (2018) Wilson, J., Hutter, F., and Deisenroth, M. Maximizing acquisition functions for bayesian optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9905–9916. Curran Associates, Inc., 2018.
  • Wilson et al. (2017) Wilson, J. T., Moriconi, R., Hutter, F., and Deisenroth, M. P. The reparameterization trick for acquisition functions. ArXiv e-prints, December 2017.
  • Wu & Frazier (2016) Wu, J. and Frazier, P. The parallel knowledge gradient method for batch bayesian optimization. In Advances in Neural Information Processing Systems, pp. 3126–3134, 2016.
  • Wu et al. (2017) Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P. I. Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pp. 5267–5278, 2017.
  • Wu et al. (2019) Wu, J., Toscano-Palmerin, S., Frazier, P. I., and Wilson, A. G. Practical multi-fidelity bayesian optimization for hyperparameter tuning. CoRR, abs/1903.04703, 2019.

Appendix A Review of other BO packages

One of the earliest commonly-used packages is Spearmint (Snoek et al., 2012), which implements a variety of modeling techniques such as MCMC hyperparameter sampling and input warping (Snoek et al., 2014). Spearmint also supports parallel optimization via fantasies, and constrained optimization with the expected improvement and predictive entropy search acquisition functions (Gelbart et al., 2014; Hernández-Lobato et al., 2015). Spearmint was among the first libraries to make BO easily accessible to the end user.

GPyOpt (The GPyOpt authors, 2016) builds on the popular GP regression framework GPy (GPy, since 2012). It supports a similar set of features as Spearmint, along with a local penalization-based approach for parallel optimization (González et al., 2016a). It also provides the ability to customize different components through an alternative, more modular API.

Cornell-MOE (Wu & Frazier, 2016) implements the Knowledge Gradient (KG) acquisition function, which allows for parallel optimization, and includes recent advances such as large-scale models incorporating gradient evaluations (Wu et al., 2017) and multi-fidelity optimization (Wu et al., 2019). Its core is implemented in C++, which provides performance benefits but renders it hard to modify and extend.

RoBO (Klein et al., 2017) implements a collection of models and acquisition functions, including Bayesian neural nets (Springenberg et al., 2016) and multi-fidelity optimization (Klein et al., 2016).

Emukit (The Emukit authors, 2018) is a Bayesian optimization and active learning toolkit with a collection of acquisition functions, including for parallel and multi-fidelity optimization. It does not provide specific abstractions for implementing new algorithms, but rather specifies a model API that allows it to be used with the other toolkit components.

The recent Dragonfly (Kandasamy et al., 2019) library supports parallel optimization, multi-fidelity optimization (Kandasamy et al., 2016), and high-dimensional optimization with additive kernels (Kandasamy et al., 2015). It takes an ensemble approach and aims to work out-of-the-box across a wide range of problems, a design choice that makes it relatively hard to extend.

Appendix B Optimization with Fixed Samples

qMC methods have been used in other applications in machine learning, including variational inference (Buchholz et al., 2018) and evolutionary strategies (Rowland et al., 2018), but rarely in BO. Letham et al. (2019) use qMC in the context of a specific acquisition function. BoTorch’s abstractions make it straightforward (and mostly automatic) to use qMC integration with any acquisition function.

Fixing the base samples introduces a consistent bias in the function approximation. While i.i.d. re-sampling in each iteration ensures that and are conditionally independent given , this no longer holds when fixing the base samples.

[-3ex]

Figure 11: MC and qMC acquisition functions, with and without re-drawing the base samples between evaluations. The model is a GP fit on 15 points randomly sampled from and evaluated on the (negative) Hartmann6 test function. The acquisition functions are evaluated along the slice .

Figure 11 illustrates this behavior for EI (we consider the simple case of for which we have an analytic solution available). The top row shows the MC and qMC version, respectively, when re-drawing base samples for every evaluation. The solid lines correspond to a single realization, and the shaded region covers four standard deviations around the mean, estimated across 50 evaluations. It is evident that qMC sampling significantly reduces the variance of the estimate. The bottom row shows the same functions for 10 different realizations of fixed base samples. Each of these realizations is differentiable w.r.t. (and hence in the slice parameterization). In expectation (over the base samples), this function coincides with the true function (the dashed black line). Conditional on the base sample draw, however, the estimate displays a consistent bias. The variance of this bias is much smaller for the qMC versions.


Figure 12: Performance for optimizing qMC-based EI. Solid lines: fixed base samples, optimized via L-BFGS-B. Dashed lines: re-sampling base samples, optimized via Adam.

The reason for this is that while the function values may show noticeable bias, the bias of the maximizer (in ) is typically very small. Figure 12 illustrates this behavior, showing empirical cdfs of the relative gap and the distance over 250 optimization runs for different numbers of samples, where is the optimizer of the analytic function EI, and is the optimizer of the qMC approximation. The quality of the solution of the deterministic problem is excellent even for relatively small sample sizes, and generally better than of the stochastic optimizer.

A somewhat subtle point is that whether better optimization of the acquisition function results in improved closed-loop BO performance depends on the acquisition function as well as the underlying problem. More exploitative acquisition functions, such as EI, tend to show worse performance for problems with high noise levels. In these settings, not solving the EI maximization exactly adds randomness and thus induces additional exploration, which can improve closed-loop performance. While a general discussion of this point is outside the scope of this paper, BoTorch does provide a framework for optimizing acquisition functions well, so that these questions can be compartmentalized and acquisition function performance can be investigated independently from the quality of optimizing acquisition functions.

Perhaps the most significant advantage of using deterministic optimization algorithms is that, unlike for algorithms such as SGD that require tuning the learning rate, the optimization procedure is essentially hyperparameter-free. Figure 14 shows the closed-loop optimization performance of qEI for both deterministic and stochastic optimization for different optimizers and learning rates. While some of the stochastic variants (e.g. ADAM with learning rate 0.01) achieve performance similar to the deterministic optimization, the type of optimizer and learning rate matters. In fact, the rank order of SGD and ADAM w.r.t. to the learning rate is reversed, illustrating that selecting the right hyperparameters for the optimizer is itself a non-trivial problem.

Appendix C Additional Empirical Results

c.1 Synthetic Functions

All functions are evaluated with noise generated from a distribution. Figures 14-16 give the results for all synthetic functions from Section 10.1. The results show that BoTorch’s NEI and OKG acquisition functions provide highly competitive performance in all cases.

Figure 13: Stochastic/deterministic opt. of EI on Hartmann6
Figure 14: Branin ()
Figure 15: Rosenbrock ()
Figure 16: Ackley ()

c.2 Constrained Bayesian Optimization

We present results for constrained BO on a synthetic function. We consider a multi-output function and the optimization problem:

(7)

Both and are observed with noise and we model the two components using independent GP models. A constraint-weighted composite objective is used in each of the BoTorch acquisition functions EI, NEI, and OKG.

Results for the case of a Hartmann6 objective and two types of constraints are given in Figures 18-18 (we only show results for BoTorch’s algorithms, since the other packages do not natively support optimization subject to unknown constraints).

The regret values are computed using a feasibility-weighted objective, where “infeasible” is assigned an objective value of zero. For random search and EI, the suggested point is taken to be the best feasible noisily observed point, and for NEI and OKG, we use out-of-sample suggestions by optimizing the feasibility-weighted version of the posterior mean. The results displayed in Figure 18 are for the constrained Hartmann6 benchmark from Letham et al. (2019). Note, however, that the results here are not directly comparable to the figures in Letham et al. (2019) because (1) we use feasibility-weighted objectives to compute regret and (2) they follow a different convention for suggested points. We emphasize that our contribution of outcome constraints for the case of KG has not been shown before in the literature.

Figure 17: Constrained Hartmann6,
Figure 18: Constrained Hartmann6,

c.3 Stochastic Weight Averaging on CIFAR-10

Figure 19 shows the closed-loop performance of OKG versus random search on the SWA benchmark described in Section 10.3.3.


Figure 19: SWA algorithm on CIFAR10, best observed accuracy

Appendix D Batch Initialization for Multi-Start Optimization

For most acquisition functions, the optimization surface is highly non-convex, multi-modal, and (especially for “improvement-based” ones such as EI or KG) often flat (i.e. has zero gradient) in much of the domain . Therefore, optimizing the acquisition function is itself a challenging problem.

The simplest approach is to use zeroth-order optimizers that do not require gradient information, such as DIRECT or CMA-ES (Jones et al., 1993; Hansen & Ostermeier, 2001). These approaches are feasible for lower-dimensional problems, but do not scale to higher dimensions. Note that performing parallel optimization over candidates in a -dimensional feature space means solving a -dimensional optimization problem.

A more scalable approach incorporates gradient information into the optimization. As described in Section 7, BoTorch by default uses quasi-second order methods, such as L-BFGS-B. Because of the complex structure of the objective, the initial conditions for the algorithm are extremely important so as to avoid getting stuck in a potentially highly sub-optimal local optimum. To reduce this risk, one typically employs multi-start optimization (i.e. start the solver from multiple initial conditions and pick the best of the final solutions). To generate a good set of initial conditions, BoTorch heavily exploits the fast batch evaluation discussed in the previous section. Specifically, BoTorch by default uses candidates generated using the following heuristic:

  1. [leftmargin=4ex,topsep=0ex,itemsep=0ex]

  2. Sample quasi-random -tuples of points  from  using quasi-random Sobol sequences (see Section 4.4).

  3. Batch-evaluate the acquisition function at these candidate sets: .

  4. Sample candidate sets

    according to the weight vector

    , where and is a temperature parameter.444Acquisition functions that are known to be flat in large parts of  are handled with additional care in order to avoid starting in locations with zero gradients.

Sampling initial conditions this way achieves an exploration/exploitation trade-off controlled by the magnitude of . As we perform Sobol sampling, while means the initialization is chosen in a purely greedy fashion. The latter is generally not advisable, since for large the highest-valued points are likely to all be clustered together, which would run counter to the goal of multi-start optimization. Fast batch evaluation allows evaluating a large number of samples ( in the tens of thousands is feasible even for moderately sized models).

Appendix E Examples

e.1 Composite Objectives

Figure 20 shows results for the case of parallel optimization with . We find similar results as in the case of  in Figure 3. While for EI-CF performance is similar for  and , KG-CF reaches lower regret significantly faster for  compared to , suggesting that “looking ahead“ is beneficial in this context.

[-3ex]

Figure 20: Composite function results,

e.2 Active Learning

Figure 21 shows the location of the base grid in comparison to the samples obtained by minimizing IPV.


Figure 21: Locations for 2017 samples from IPV minimization and the base grid.

e.3 Generalized UCB

Code Example 5 presents a generalized version of parallel UCB from Wilson et al. (2017) supporting pending candidates, generic objectives, and qMC sampling. If no sampler is specified, a default qMC sampler is used. Similarly, if no objective is specified, the identity objective is assumed.

    def __init__(
        self,
        model: Model,
        beta: float,
        sampler: Optional[MCSampler] = None,
        objective: Optional[MCAcquisitionObjective] = None,
        X_pending: Optional[Tensor] = None,
    ) -> None:
        super().__init__(model, sampler, objective, X_pending)
        self.beta_prime = math.sqrt(beta * math.pi / 2)
    @concatenate_pending_points
    @t_batch_mode_transform()
    def forward(self, X: Tensor) -> Tensor:
        posterior = self.model.posterior(X)
        samples = self.sampler(posterior)
        obj = self.objective(samples)
        mean = obj.mean(dim=0)
        z = mean + self.beta_prime * (obj - mean).abs()
        return z.max(dim=-1)[0].mean(dim=0)
Code Example 5 Generalized Parallel UCB

e.4 Full Code Examples

In this section we provide full implementations for the code examples. Specifically, we include parallel Noisy EI (Code Example 6), OKG (Code Example 7), and (negative) Integrated Posterior Variance (Code Example 8).

    def __init__(
        self,
        model: Model,
        X_baseline: Tensor,
        sampler: Optional[MCSampler] = None,
        objective: Optional[MCAcquisitionObjective] = None,
        X_pending: Optional[Tensor] = None,
    ) -> None:
        super().__init__(model, sampler, objective, X_pending)
        self.register_buffer("X_baseline", X_baseline)
    @concatenate_pending_points
    @t_batch_mode_transform()
    def forward(self, X: Tensor) -> Tensor:
        q = X.shape[-2]
        X_bl = match_batch_shape(self.X_baseline, X)
        X_full = torch.cat([X, X_bl], dim=-2)
        posterior = self.model.posterior(X_full)
        samples = self.sampler(posterior)
        obj = self.objective(samples)
        obj_n = obj[...,:q].max(dim=-1)[0]
        obj_p = obj[...,q:].max(dim=-1)[0]
        return (obj_n - obj_p).clamp_min(0).mean(dim=0)
Code Example 6 Parallel Noisy EI (full)
    def __init__(
        self,
        model: Model,
        sampler: MCSampler,
        objective: Optional[Objective] = None,
        inner_sampler: Optional[MCSampler] = None,
        X_pending: Optional[Tensor] = None,
    ) -> None:
        super().__init__(
            model, sampler, objective, X_pending,
        )
        self.inner_sampler = inner_sampler
    def forward(self, X: Tensor) -> Tensor:
        splits = [X.size(-2) - self.Nf, self.N_f]
        X, X_fantasies = torch.split(X, splits, dim=-2)
        [...] # some shaping for batch eval purposes
        if self.X_pending is not None:
            X_p = match_batch_shape(self.X_pending, X)
            X = torch.cat([X, X_p], dim=-2)
        fmodel = self.model.fantasize(
            X=X, sampler=self.sampler,
            observation_noise=True
        )
        obj = self.objective
        if isinstance(obj, MCAcquisitionObjective):
            inner_acqf = SimpleRegret(
                fmodel, self.inner_sampler, obj,
            )
        else:
            inner_acqf = PosteriorMean(fmodel, obj)
        with settings.propagate_grads(True):
            values = inner_acqf(X_fantasies)
        return values.mean(dim=0)
Code Example 7 One-Shot Knowledge Gradient (full)
    AnalyticAcquisitionFunction
):
    def __init__(
        self,
        model: Model,
        mc_points: Tensor,
        X_pending: Optional[Tensor] = None,
    ) -> None:
        super().__init__(model=model)
        self._dummy_sampler = IIDNormalSampler(1)
        self.X_pending = X_pending
        self.register_buffer("mc_points", mc_points)
    @concatenate_pending_points
    @t_batch_mode_transform()
    def forward(self, X: Tensor) -> Tensor:
        fant_model = self.model.fantasize(
            X=X, sampler=self._dummy_sampler,
            observation_noise=True
        )
        sz = [1] * len(X.shape[:-2]) + [-1, X.size(-1)]
        mc_points = self.mc_points.view(*sz)
        with settings.propagate_grads(True):
            posterior = fant_model.posterior(mc_points)
        ivar = posterior.variance.mean(dim=-2)
        return -ivar.view(X.shape[:-2])
Code Example 8 Active Learning (full)