botorch
Bayesian optimization in PyTorch
view repo
Bayesian optimization provides sample-efficient global optimization for a broad range of applications, including automatic machine learning, molecular chemistry, and experimental design. We introduce BoTorch, a modern programming framework for Bayesian optimization. Enabled by Monte-Carlo (MC) acquisition functions and auto-differentiation, BoTorch's modular design facilitates flexible specification and optimization of probabilistic models written in PyTorch, radically simplifying implementation of novel acquisition functions. Our MC approach is made practical by a distinctive algorithmic foundation that leverages fast predictive distributions and hardware acceleration. In experiments, we demonstrate the improved sample efficiency of BoTorch relative to other popular libraries. BoTorch is open source and available at https://github.com/pytorch/botorch.
READ FULL TEXT VIEW PDFBayesian optimization in PyTorch
The realization that time and resource-intensive machine learning (ML) experimentation could be addressed through efficient global optimization O’Hagan (1978); Jones et al. (1998); Osborne (2010) was brought to the attention of the broader ML community by Snoek et al. (2012), and the accompanying library, Spearmint
. In the years that followed, Bayesian optimization (BO) experienced an explosion of methodological advances for sample-efficient general-purpose sequential optimization of black-box functions, including hyperparameter tuning
(Feurer et al., 2015; Shahriari et al., 2016; Wu et al., 2019), robotic control (Calandra et al., 2016; Antonova et al., 2017), chemical design (Griffiths & Hernández-Lobato, 2017; Li et al., 2017), and internet-scale ranking systems and infrastructure (Agarwal et al., 2018; Letham et al., 2019; Letham & Bakshy, 2019).Meanwhile, the broader field of ML has been undergoing its own revolution, driven largely by advances in systems and hardware, including the development of libraries such as PyTorch, Tensorflow, Caffe, and MXNet
(Paszke et al., 2017; Abadi et al., 2016; Jia et al., 2014; Chen et al., 2015). While BO has become rich with new methodologies, the programming paradigms to support new innovations have generally not fully exploited these computational advances.In this paper, we introduce BoTorch (https://github.com/pytorch/botorch), a flexible, modular, and scalable programming framework for Bayesian optimization research, built around modern paradigms of computation. Its contributions include:
Systems:
[leftmargin=2.5ex,topsep=-0.5ex,noitemsep]
A set of flexible model-agnostic abstractions for Bayesian sequential decision making, together with a lightweight API for composing these primitives.
Robust implementations of models and acquisition functions for BO and active learning, including MC acquisition functions with support for parallel optimization, composite objectives, and asynchronous evaluation.
Scalable BO that utilizes modern numerical linear algebra from GPyTorch (Gardner et al., 2018)
, including sparse interpolation and structure exploiting algebra
(Wilson & Nickisch, 2015), and fast predictive distributions and sampling (Pleiss et al., 2018).Auto-differentiation, GPU hardware acceleration, and integration with deep learning components via PyTorch.
Methodology:
[leftmargin=2.5ex,topsep=-0.5ex,noitemsep]
A novel approach to optimizing Monte-Carlo (MC) acquisition functions that effectively combines with deterministic higher-order optimization algorithms.
A simple and flexible formulation of the Knowledge Gradient—a powerful look-ahead algorithm known for being difficult to implement — with improved performance over the state-of-the-art (Wu & Frazier, 2016).
In the sections that follow, we demonstrate how BoTorch’s distinctive algorithmic approach, explicitly designed around differentiable programming and modern parallel processing capabilities (Sec. 3, 4, and 5), simplifies the development of complex acquisition functions (Sec. 6 and 9), while obtaining greater computational efficiency (Sec. 8) and optimization performance (Sec. 10).
In BO, we aim to solve the problem , where is an expensive-to-evaluate function, , and is a feasible set. BO consists of two main components: a probabilistic surrogate model of the observed function – most commonly, a Gaussian process (GP) – and an acquisition function that encodes a strategy for navigating the exploration vs. exploitation trade-off (Shahriari et al., 2016).
Popular libraries for BO include Spearmint (Snoek et al., 2012), GPyOpt (The GPyOpt authors, 2016), Cornell-MOE (Wu & Frazier, 2016), RoBO (Klein et al., 2017), Emukit (The Emukit authors, 2018), and Dragonfly (Kandasamy et al., 2019). We provide further discussion of these packages in Appendix A.
ProBO (Neiswanger et al., 2019) and GPFlowOpt (Knudde et al., 2017) are of particular relevance. ProBO is a recently suggested framework (no implementation is available at the time of this writing) for using general probabilistic programming in BO. While its distribution-agnostic approach is similar to ours, ProBO, unlike BoTorch, does not benefit from gradient-based optimization provided by differentiable programming, or algebraic methods designed to benefit from GPU acceleration. GPFlowOpt inherits support for auto-differentiation and hardware acceleration from TensorFlow (via GPFlow (Matthews et al., 2017)), but unlike BoTorch does not use algorithms designed to specifically exploit this potential (e.g. fast matrix multiplications or batched computation). Neither ProBO nor GPFlowOpt naturally support parallel optimization, MC acquisition functions, or fast batch evaluation, which form the core of BoTorch.
In contrast to all existing libraries, BoTorch pursues novel algorithmic approaches specifically designed to benefit from modern computing in order to achieve its high degree of flexibility, programmability, performance, and scalability. Robust implementation and testing makes BoTorch suitable for use in both research and production settings.
BoTorch
provides abstractions for combining BO primitives in a way that takes advantage of modern computing architectures, enabling BO with auto-differentiation, automatic parallelization, device-agnostic hardware acceleration, and generic neural network operators and modules. Implementing a new acquisition function can be as simple as defining a differentiable forward pass, after which gradient-based optimization comes for free. Figure
1 outlines what BoTorch considers the basic primitives of Bayesian sequential decision making:Model: In BoTorch, the Model is a PyTorch module. Recent work has produced packages such as GPyTorch (Gardner et al., 2018) and Pyro (Bingham et al., 2018) that enable high-performance differentiable Bayesian modeling. Given those models, our focus here is on constructing acquisition functions and optimizing them effectively, using modern computing paradigms. BoTorch is model-agnostic — the only requirement for a model is that, given a set of inputs, it can produce posterior draws of one or more outcomes (explicit posteriors, such as those provided by a GP, can also be used directly). BoTorch provides seamless integration with GPyTorch and its algorithmic approaches specifically designed for scalable GPs, fast predictive distributions (Pleiss et al., 2018), and GPU acceleration.
Objective: An Objective is a module that applies a transformation to model outputs. For instance, it may scalarize model outputs of multiple objectives for multi-objective optimization (Paria et al., 2018)
, or it could handle black-box constraints by weighting the objective outcome with probability of feasibility
(Gardner et al., 2014).Acquisition function: An AcquisitionFunction implements a forward pass that takes a candidate set of design points and computes their utility . Internally (indicated by the shaded block in Figure 1), MC acquisition functions operate on samples from the posterior evaluated under the objective, while analytic acquisition functions operate on the explicit posterior. The external API is the same for both.
Optimizer: Given the acquisition function, we must find a maximizer in order to proceed with the next iteration of BO. Auto-differentiation makes it straightforward to use gradient-based optimization even for complex acquisition functions and objectives, which typically performs better than derivative-free approaches.
Suppose we have collected data , where and with some noise corrupting the true function value . We allow to be multi-output, in which case . In some applications we may also have access to distributional information of the noise
, such as its (possibly heteroskedastic) (co-)variance. Suppose further that we have a
surrogate model that for any set of points provides a posterior over . In BO, the model traditionally is a GP, and the are assumed i.i.d. normal, in which case both and are multivariate normal. However, BoTorch makes no particular assumptions about the form of these posteriors.The next step in BO is to optimize an acquisition function evaluated on the model posterior over the candidate set . Following Wilson et al. (2018), many myopic acquisition functions can be written as
(1) |
where is a (composite) objective function, are parameters independent of , and is a utility function that defines the acquisition function.
In some simple situations, the expectation in (1) and its gradient can be computed analytically. The classic case considered in the literature is a single-output () model with a single candidate () point , a Gaussian model posterior , and the identity objective . Expected Improvement (EI) is a popular acquisition function which maximizes the expected difference between the currently observed best value and the objective at the next query point, through the utility . EI and its gradient have a well-known analytic form (Jones et al., 1998). BoTorch implements popular analytic acquisition functions.
In general, analytic expressions are not available for more general objective functions , utility functions , non-Gaussian model posteriors, or collections of points which are to be evaluated in a parallel or asynchronous fashion (Snoek et al., 2012; Wu & Frazier, 2016; Wang et al., 2016a; Wilson et al., 2017). Instead, using samples from the posterior, Monte Carlo integration can be used to approximate the expectation (1), as well as its gradient via the reparameterization trick (Kingma & Welling, 2013; Wilson et al., 2018).
An MC approximation of the acquisition function (1) using samples is straightforward:
(2) |
MC acquisition functions enable modularity by naturally handling a much broader class of models and objectives than could be expressed analytically. In particular, we show how MC acquisition functions may be used to support:
Arbitrary posteriors: Analytic acquisition functions are usually defined for univariate normal posteriors. In contrast, MC acquisition functions can be directly applied to any posterior from which samples can be drawn, which include probabilistic programs (Tran et al., 2017; Bingham et al., 2018), Bayesian neural networks (Neal, 1996; Saatci & Wilson, 2017; Maddox et al., 2019; Izmailov et al., 2019), and more general types of Gaussian processes (Damianou & Lawrence, 2013; Gardner et al., 2018).
Arbitrary objectives and constraints: MC acquisition functions can be used with arbitrary composite objective functions of the outcome(s). This is not the case for analytic acquisition functions, as the transformation will generally change the distributional form of the posterior. This setup also enables generic support for including unknown (and to-be-modeled) outcome constraints by using a feasibility-weighted improvement criterion similar to Schonlau et al. (1998); Gardner et al. (2014); Gelbart et al. (2014); Letham et al. (2019)
, implemented in a differentiable fashion by approximating the feasibility indicator by a sigmoid function on the sample level.
Parallel evaluation: Wilson et al. (2017) show how MC acquisition functions can be used to generalize a number of acquisition functions to the parallel BO setting where evaluations are made simultaneously. In BoTorch, MC acquisition functions by default consider multiple candidate points, and have a consistent interface for any value of .
Asynchronous evaluation: Since MC acquisition functions compute the joint acquisition value for a set of points, asynchronous candidate generation, in which a set of pending points has been submitted for evaluation but not yet returned results, is easily handled in a generic fashion by automatically concatenating into the arguments of an acquisition function. We thus compute the joint acquisition value of all points , pending and new, but optimize only with respect to the new candidate points .
Fig. 2 summarizes the chain of operations involved in evaluating a MC acquisition function in BoTorch.
Effectively optimizing the acquisition function
, especially in higher dimensions, typically requires using gradient information. In many cases, an unbiased estimate of
can be obtained via the reparameterization trick (Kingma & Welling, 2013; Rezende et al., 2014). The basic idea is that a sample from can be expressed as a suitable (differentiable) deterministic transformation of an auxiliary base sample drawn from some base distribution independent of . If is differentiable with respect to , then . For instance, if , then , with and . In BoTorch, these chains of derivatives are handled by auto-differentiation.Quasi-Monte Carlo (qMC) methods are an established technique for reducing the variance of MC-integration (Caflisch, 1998). Instead of drawing i.i.d. samples from the base distribution, in many cases one can instead use qMC sampling to significantly lower the variance of the estimate (2) and its gradient (see Appendix B for additional details). BoTorch by default uses scrambled Sobol sequences (Owen, 2003) as the basis for evaluating MC acquisition functions.
BoTorch provides a consistent, model-agnostic interface through the following API functions:
Model:
[leftmargin=2.5ex,topsep=-0.5ex,noitemsep]
posterior(, observation_noise=False): Given a candidate set , return a Posterior object representing
(if observation_noise=True, return the posterior predictive
).fantasize(, sampler):
Given a candidate set and an MCSampler sampler, construct a batched fantasy model over samples (specified in the sampler), i.e. a set of models , where with is a sample from the posterior predictive distribution at .
Posterior:
[leftmargin=2.5ex,topsep=-0.5ex,noitemsep]
rsample(, ): Given base samples , draw samples from the joint posterior over points with outcomes each.
MCSampler:
[leftmargin=2.5ex,topsep=-0.5ex,noitemsep]
forward(p): Given a MCSampler object sampler and a Posterior object p, use sampler(p) to draw samples from p by automatically constructing appropriate base samples . If sampler is constructed with resample=False, do not re-construct base samples between evaluations, ensuring that posterior samples are deterministic (conditional on the base samples).
MCObjective:
[leftmargin=2.5ex,topsep=-0.5ex,noitemsep]
forward(): Compute the (composite) objective of a number of (multi-dimensional) samples drawn from a (multi-output) Posterior.
To demonstrate the modularity and flexibility of BoTorch, we show how both existing approaches and novel acquisition functions can be implemented succinctly and easily using the API abstractions introduced in the previous section.
In many applications of BO, the objective may have some known functional form, such as the mean square error between simulator outputs and empirical data in the context of simulation calibration. Astudillo & Frazier (2019) show that modeling the individual components of the objective can be advantageous. They propose the use of composite functions and develop a MC-based variant of EI that achieves improvements in sample efficiency. Such composite functions are a special case of BoTorch’s Objective abstraction, and can be readily implemented as such.
We consider the model calibration example from Astudillo & Frazier (2019), where the goal is to minimize the error between the output of a model of pollutant concentrations and observed concentrations at 12 locations. Code Example 1 shows how the setup from Astudillo & Frazier (2019) can be extended to work with the Knowledge Gradient (KG), a sample-efficient acquisition function that, so far, has not been used with composite objectives.
This is achieved simply by passing a multi-output BoTorch model modeling the individual components, pending points X_pending, and an objective callable via the GenericMCObjective module.
Letham et al. (2019) introduce Noisy EI (NEI), an extension of EI that is well-suited for settings with noisy function evaluations, such as A/B tests. In this setting, it is important to account for the uncertainty in the best function value at the points observed so far. That is, , where
is a random variable correlated with
.Here, we propose a novel full MC formulation of NEI that extends Letham et al. (2019) to joint parallel optimization and generic objectives. Our implementation avoids the need to characterize explicitly by averaging improvements on samples from the joint posterior over new and previously evaluated points:
(3) |
where . The implementation of (3) is given in Code Example 2 (here and in following examples, we omit the constructor for brevity; full implementations are provided in Appendix E.4). X_baseline is an appropriate subset of the points at which the function was observed.
We achieve support for asynchronous evaluation by automatically concatenating pending points into (via the @concatenate_pending_points decorator). We can incorporate constraints on modeled outcomes, as discussed in Letham et al. (2019), using a differentiable relaxation in the objective as we describe in Section 4.2.
While BO aims to identify the extrema of a black-box function, the goal in active learning (e.g., Settles, 2009) is to explore the design space in a sample-efficient way in order to reduce model uncertainty in a more global sense. For example, one can maximize the negative integrated posterior variance (NIPV) (Seo et al., 2000; Chen & Zhou, 2014) of the model:
(4) |
Here denotes the additional (random) set of observations collected at a candidate set . The integral in (4) must be evaluated numerically, for which we use MC integration.
We can implement (4) using standard BoTorch components, as shown in Code Example 3. Here mc_points is the set of points used for MC-approximating the integral. In the most basic case, one can use qMC samples drawn uniformly in . By allowing for arbitrary mc_points, we permit weighting regions of using non-uniform sampling. Using mc_points as samples of the maximizer of the posterior, we recover the recently proposed Posterior Variance Reduction Search (Nguyen et al., 2017) for BO.
This acquisition function supports both parallel selection of points and asynchronous evaluation. Since MC integration requires evaluating the posterior variance at a large number of points, this acquisition function benefits significantly from the fast predictive variance computations in GPyTorch (Pleiss et al., 2018; Gardner et al., 2018).
To illustrate how NIPV may be used in combination with scalable probabilistic modeling, we examine the problem of efficient allocation of surveys across a geographic region. Inspired by Cutajar et al. (2019), we utilize publicly-available data from The Malaria Atlas Project (2019)
dataset, which includes the yearly mean parasite rate (along with standard errors) of
Plasmodium falciparum at a grid spatial resolution across Africa. In particular, we consider the following active learning problem: given a spatio-temporal probabilistic model fit to data from 2011-2016, which geographic locations in and around Nigeria should one sample in 2017 in order to minimize the model’s error for 2017 across all of Nigeria?We fit a heteroskedastic GP model to 2500 training points prior to 2017 (using a noise model that is itself a GP fit to the provided standard errors). We then select sample locations for 2017 using the NIPV acquisition function, and make predictions across the entirety of Nigeria using this new data. Compared to using no 2017 data, we find that our new dataset reduces MSE by 16.7% on average (SEM = 0.96%) across 60 subsampled datasets. By contrast, sampling the new 2017 points at a regularly spaced grid results only in a 12.4% reduction in MSE (SEM = 0.99%). The mean relative improvement in MSE reduction from NIPV optimization is 21.8% (SEM = 6.64%). Figure 4
shows the NIPV-selected locations on top of the base model’s estimated parasite rate and standard deviation.
BoTorch uses differentiable programming to automatically evaluate chains of derivatives when computing gradients (such as the gradient of from Figure 2). This functionality enables gradient-based optimization without having to implement gradients for each step in the chain.^{1}^{1}1The samples drawn from the posterior must be differentiable w.r.t. to the input , as with all models provided by GPyTorch. Normally, any change to the model, objective, or acquisition function would require manually deriving and implementing the new gradient computation. In BoTorch that time-consuming and error-prone process is unnecessary, making it much easier to apply gradient-based acquisition function optimization to new formulations in a modular fashion, greatly improving research efficiency.
Evaluating MC acquisition functions yields noisy, unbiased gradient estimates. Thus maximization of these acquisition functions can be conveniently implemented in BoTorch using standard first-order stochastic optimization algorithms such as SGD (Wilson et al., 2018) available in the torch.optim module. However, it is often preferable to solve a biased, deterministic optimization problem instead: Rather than re-drawing the base samples for each evaluation, we can draw the base samples once, and hold them fixed between evaluations of the acquisition function. The resulting MC estimate, conditioned on the drawn samples, is deterministic and differentiable.
To optimize the acquisition function, one can now employ the full toolbox of deterministic optimization, including quasi-Newton methods that provide faster convergence speeds and are generally less sensitive to optimization hyperparameters than stochastic first-order methods. To this end, BoTorch provides an interface with support for any algorithm in the scipy.optimize module.
By default, we use multi-start optimization via L-BFGS-B in conjunction with an initialization heuristic that exploits fast batch evaluation of acquisition functions (see Appendix
D for details).We find that the added bias from fixing the base samples only has a minor effect on the performance relative to using the analytic ground truth, and can improve performance relative to stochastic optimizers that do not fix the base samples (Appendix B), while avoiding tedious tuning of optimization hyperparameters such as learning rates.
BoTorch is the only Bayesian optimization framework with inference methods specifically designed for hardware acceleration and fast predictive distributions. In particular, BoTorch works with GPyTorch models that employ new stochastic Krylov subspace methods for inference, which perform all computations through matrix multiplication. Compared to the standard Cholesky-based approaches in all other available packages, these methods can be parallelized and massively accelerated on modern hardware such as GPUs (Gardner et al., 2018), with additional test-time scalability benefits (Pleiss et al., 2018) that are particularly relevant to online settings such as Bayesian optimization.
Batch evaluation, an important element of modern computing, enables automatic dispatch of independent operations across multiple computational resources (e.g. CPU and GPU cores) for parallelization and memory sharing. All BoTorch components support batch evaluation, which makes it easy to write concise and highly efficient code in a platform-agnostic fashion.
Specifically, batch evaluation provides fast queries of acquisition functions at a large number of candidate sets in parallel, enabling novel initialization heuristics and optimization techniques. Instead of sequentially evaluating an acquisition function at a number of candidate sets , where for each , BoTorch evaluates a batched tensor . Computation is automatically distributed so that, depending on the hardware used, speedups can be close to linear in the batch size . Batch evaluation is also heavily used in computing MC acquisition functions, with the effect that increasing the number of MC samples up to several thousands often has little impact on wall time.
Figure 6 reports wall times for batch-evaluation of qExpectedImprovement as a function of for different MC samples sizes , on both CPU and GPU for a GPyTorch GP.^{2}^{2}2Reported times do not include one-time computation of the test caches, which need to be computed only once per BO step. Evaluation was performed on a Tesla M40 GPU. On more recent hardware we observe even more significant speedups. We observe significant speedups from running on the GPU, with scaling essentially linear in the batch size, except for very large and . There is a fixed cost due to communication overhead that renders CPU evaluation faster for small batch and sample sizes.
While much of the literature on scalable GPs focuses on space-time complexity for training, it is fast test-time (predictive) distributions that are crucial for applications where the same model is evaluated many times, such as when optimizing acquisition function in BO. GPyTorch makes use of structure-exploiting algebra and local interpolation for computations in querying the predictive distribution, and for a posterior sample at points, compared to the standard and computations (Pleiss et al., 2018). Figure 6 shows between 10-40X speedups when using fast predictive covariance estimates over performing standard posterior inference in the setting from Section 8.1. Reported relative speedups grow slower on the GPU, whose cores does not saturate as quickly as on the CPU.
Together with the extreme speedups from batch evaluation, fast predictive distributions enable users to efficiently compute the acquisition function at a very large number (e.g., tens of thousands) of points in parallel. Such parallel prediction can be used as a simple and computationally efficient way of doing random search over the domain. The resulting candidate points can either be used directly in constructing a candidate set, or as initial conditions for further gradient-based optimization.
At a basic level, BO aims to find solutions to a full stochastic dynamic programming problem over the remaining (possibly infinite) horizon — a problem that is generally intractable. Look-ahead acquisition functions are approximations to this problem that explicitly take into account the effect observations have on the model in subsequent optimization steps. Popular look-ahead acquisition functions include the Knowledge Gradient (KG) (Frazier et al., 2008), Predictive Entropy Search (Henrández-Lobato et al., 2014), Max-Value Entropy Search (Wang & Jegelka, 2017), as well as various heuristics (González et al., 2016b).
KG quantifies the expected increase in the maximum of from obtaining the additional (random) data set . KG often shows improved BO performance relative to simpler acquisition functions such as EI (Scott et al., 2011), but in its traditional form it is computationally expensive and hard to implement, two caveats that we alleviate in this work.
A generalized variant of parallel KG (Wu & Frazier, 2016) is given by
(5) |
where is the posterior at conditioned on , the (random) dataset observed at , and . KG as in (5) quantifies the expected increase in the maximum posterior mean of after gathering samples at . For simplicity, we only consider the standard BO problem, but multi-fidelity extensions (Poloczek et al., 2017; Wu et al., 2019) can also easily be used.
The standard approach for optimizing parallel KG (where ) is to apply stochastic gradient ascent, with each gradient observation potentially being an average over multiple samples (Wu & Frazier, 2016; Wu et al., 2017). For each sample , the inner optimization problem for the posterior mean is solved numerically, either via another stochastic gradient ascent (Wu et al., 2017) or multi-start L-BFGS (Frazier, 2018). An unbiased stochastic gradient of KG can then be computed by leveraging the envelope theorem and the optimal points . Alternatively, the inner problem can be discretized and the gradient computation of Wu & Frazier (2016) can be used. We emphasize that these approaches require optimizing the inner and outer problems separately (in an alternating fashion). The associated computational expense can be quite large; our main insight is that these computations can also be unnecessary.
We propose to treat optimizing in (5) as an entirely deterministic optimization problem. We draw fixed base samples for the outer expectation, sample fantasy data , and construct associated fantasy models . We then move the inner maximization outside of the sample average, resulting in the following optimization problem:
(6) |
with and . If the inner expectation does not have an analytic expression, we also draw fixed base samples and use an MC approximation of the form (2). In either case we are left with a deterministic optimization problem. Conditional on the base samples, the two maximization problems in (6) are equivalent. The key difference from the envelope theorem approach is that we do not solve the inner optimization problem to completion for every fantasy point for every gradient step with respect to . Instead, we solve (6) jointly over and the fantasy points . The resulting optimization problem is of higher dimension, namely instead of , but unlike the envelope theorem formulation it can be solved as a single optimization problem, using methods for deterministic optimization. Consequently, we dub the KG variant utilizing this optimization strategy the “One-Shot Knowledge Gradient” (OKG). The ability to auto-differentiate w.r.t. allows BoTorch to solve this problem effectively.
As discussed in the previous section, the bias introduced by fixing base samples diminishes with the number of samples, and our empirical results in Section 10 show that OKG can utilize sufficiently few base samples to both be computationally more efficient while at the same time improving closed-loop BO performance compared to the stochastic gradient approach implemented in other packages.
Code Example 4 shows a slightly simplified OKG implementation (for the full one see Appendix E.4). The fixed base samples are defined as part of the sampler (and possibly inner_sampler) modules. SimpleRegret computes from (6). By expecting forward’s input X to be the concatenation of and , OKG can be optimized using the same APIs as all other acquisition functions (in doing so, we differentiate through fantasize).
So far, we have shown how BoTorch provides an expressive framework for implementing and optimizing acquisition functions, considering experiments in model calibration (Section 6.1), noisy objectives (Section 6.2), and epidemiology (Section 6.3). In this section, we compare (i) the empirical performance of standard algorithms implemented in BoTorch with implementations in other popular BO libraries, and (ii) our novel acquisition function, OKG, against other acquisition functions, both within BoTorch and in other packages. Specifically, we compare BoTorch with GPyOpt, Cornell MOE (MOE EI and MOE KG), and the recent Dragonfly library.^{3}^{3}3We were unable to install GPFlowOpt due to its incompatibility with current versions of GPFlow/TensorFlow. GPyOpt uses an extension of EI with a local penalization heuristic (henceforth GPyOpt LP-EI) for parallel optimization (González et al., 2016a). Dragonfly does not provide a modular API, and so we consider its default ensemble heuristic (henceforth Dragonfly GP Bandit) (Kandasamy et al., 2019).
Our results provide three main takeaways. First, we find that BoTorch’s algorithms tend to achieve greater sample efficiency compared to those of other packages (all packages use their default models and settings). Second, we find that OKG often outperforms all other acquisition functions. Finally, the ability of BoTorch to exploit conjugate gradient methods and GPU acceleration makes OKG more computationally scalable than MOE KG.
We consider BO for parallel optimization of design points, on four noisy synthetic functions used in Wang et al. (2016a): Branin, Rosenbrock, Ackley, and Hartmann. We report results for Hartmann here; results for the other functions are qualitatively similar and are provided in Appendix C.1. Algorithms start from the same set of qMC sampled initial points for each trial, with the dimension of the design space. We evaluate based on the true noiseless function value at the “suggested point” (i.e., the point to be chosen if BO were to end at this batch). OKG, MOE KG, and NEI use “out-of-sample” suggestions, while the others use “in-sample” suggestions (Frazier, 2018). Figure 8
reports means and 95% confidence intervals over 100 trials. Empirical results for constrained BO using a differentiable relaxation of the feasibility indicator on the sample level are provided in Appendix
C.2.We measure the wall times incurred by OKG and MOE KG on Hartmann with parallel points. In the case of OKG, we compare Cholesky versus linear conjugate gradient (CG) solves, on both CPU and GPU. Figure 8 shows the computational scaling of BO over 10 trials.m OKG is able to achieve improved optimization performance over MOE KG (see Figure 8), while generally requiring less wall time. Although there is some constant overhead to utilizing the GPU, wall time scales very well to models with a large number of observations.
In this section, we illustrate the performance of BoTorch on real-world applications, represented by three hyperparameter optimization (HPO) experiments. As HPO typically involves long and resource intensive training jobs, it is standard to select the configuration with the best observed performance, rather than to evaluate a “suggested” configuration (we cannot perform noiseless function evaluations).
We first consider the setting of reinforcement learning (RL), illustrating the case of tuning a deep Q-network (DQN) learning algorithm
(Mnih et al., 2013, 2015) on the classical Cartpole task. Our experiment uses the Cartpole environment from OpenAI gym (Brockman et al., 2016) and the default DQN agent implemented in Horizon (Gauci et al., 2018). We allow for a maximum of 60 training episodes or 2000 training steps, whichever occurs first. The five hyperparameters tuned by BO are the exploration parameter (“epsilon”), the target update rate, the discount factor, the learning rate, and the learning rate decay. To reduce noise, each “function evaluation” is taken to be an average of 10 independent training runs of DQN.Figure 10 presents the optimization performance of various acquisition functions from the different packages, using 15 rounds of parallel evaluations of size , over 100 trials. While in later iterations all algorithms achieve reasonable performance, BoTorch OKG, EI, NEI, and GPyOpt LP-EI show faster learning early on.
As a representative of a standard neural network parameter tuning problem, we consider the neural network surrogate model (for the UCI Adult data set) introduced by Falkner et al. (2018), which is available as part of HPOlib2 (Eggensperger et al., 2019). This is a six-dimensional problem over network parameters (number of layers, units per layer) and training parameters (initial learning rate, batch size, dropout, exponential decay factor for learning rate). We use a surrogate model to achieve a high level of precision in comparing the performance of the algorithms without incurring excessive computational training costs.
Figure 10 shows optimization performance in terms of best observed classification accuracy. Results are means and 95% confidence intervals computed from 200 trials with 75 iterations of size . All BoTorch algorithms perform quite similarly here, with OKG doing slightly better in earlier iterations. Notably, they all achieve significantly better accuracy than all other algorithms.
Our final example considers the recently proposed Stochastic Weight Averaging (SWA) procedure of Izmailov et al. (2018). This example is indicative of how BO is used in practice: since SWA is a relatively new optimization procedure, it is less known which hyperparameters provide good performance. We consider the HPO problem on one of the CIFAR-10 experiments performed in Izmailov et al. (2018)
: 300 epochs of training on the VGG-16
(Simonyan & Zisserman, 2014) architecture. Izmailov et al. (2018) report the mean and standard deviation of the test accuracy over three runs to be and , respectively, which corresponds to a 95% confidence interval of .Due to the significant computational expense of running a large number of trials, we provide a limited comparison of OKG versus random search, illustrating BoTorch’s ability to provide value in a computationally expensive real-world problem. We tune three SWA hyperparameters: learning rate, update frequency, and starting iteration. After 25 trials of BO with 8 initial points and 5 iterations of size , OKG achieved an average accuracy of , while random search resulted in . Full learning curves are provided in Appendix C.3.
BoTorch’s modular design and flexible API — in conjunction with algorithms specifically designed to exploit parallelization, auto-differentiation, and modern computing — provides a modern programming framework for Bayesian optimization and active learning. This framework is particularly valuable in helping developers to rapidly assemble novel acquisition functions. Specifically, the basic MC acquisition function abstraction provides generic support for batch optimization, asynchronous evaluation, qMC integration, and composite objectives (including outcome constraints).
We presented a strategy for effectively optimizing MC acquisition functions using deterministic optimization. As an example, we developed an extension of this approach for “one-shot” optimization of look-ahead acquisition functions, specifically of OKG. This feature itself constitutes a significant development of KG, allowing for generic composite objectives and outcome constraints.
Our empirical results show that besides increased flexibility, these advancements in both methodology and computational efficiency translate into significantly faster and more accurate closed-loop optimization performance on a range of standard problems. While we do not specifically consider other settings such as high-dimensional BO (Kandasamy et al., 2015; Wang et al., 2016b), these approaches can be readily implemented in BoTorch.
The new BoTorch procedures for optimizing acquisition functions, designed to exploit fast parallel function and gradient evaluations, help make it possible to significantly broaden the applicability of Bayesian optimization procedures in future work.
For example, if one can condition on gradient observations of the objective, then it may be possible to apply Bayesian optimization where traditional gradient-based optimizers are used — but with faster convergence and robustness to multimodality. These settings could include higher dimensional objectives, and objectives that are inexpensive to query. While acqusition functions have been developed to effectively leverage gradient observations (Wu et al., 2017), the addition of gradients incurs a significant computational burden. This burden can be eased by the scalable inference methods and hardware acceleration available in BoTorch.
One can also naturally generalize Bayesian optimization procedures to incorporate neural architectures in BoTorch. In particular, deep kernel architectures (Wilson et al., 2016), deep Gaussian processes (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017), and variational auto-encoders (Gómez-Bombarelli et al., 2018; Moriconi et al., 2019) are supported in BoTorch, and can be used for more expressive kernels in high-dimensions.
In short, BoTorch provides the research community with a robust and extensible basis for implementing new ideas and algorithms using modern computational paradigms.
Annals of Mathematics and Artificial Intelligence
, 2016.Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1278–II–1286. JMLR.org, 2014.One of the earliest commonly-used packages is Spearmint (Snoek et al., 2012), which implements a variety of modeling techniques such as MCMC hyperparameter sampling and input warping (Snoek et al., 2014). Spearmint also supports parallel optimization via fantasies, and constrained optimization with the expected improvement and predictive entropy search acquisition functions (Gelbart et al., 2014; Hernández-Lobato et al., 2015). Spearmint was among the first libraries to make BO easily accessible to the end user.
GPyOpt (The GPyOpt authors, 2016) builds on the popular GP regression framework GPy (GPy, since 2012). It supports a similar set of features as Spearmint, along with a local penalization-based approach for parallel optimization (González et al., 2016a). It also provides the ability to customize different components through an alternative, more modular API.
Cornell-MOE (Wu & Frazier, 2016) implements the Knowledge Gradient (KG) acquisition function, which allows for parallel optimization, and includes recent advances such as large-scale models incorporating gradient evaluations (Wu et al., 2017) and multi-fidelity optimization (Wu et al., 2019). Its core is implemented in C++, which provides performance benefits but renders it hard to modify and extend.
RoBO (Klein et al., 2017) implements a collection of models and acquisition functions, including Bayesian neural nets (Springenberg et al., 2016) and multi-fidelity optimization (Klein et al., 2016).
Emukit (The Emukit authors, 2018) is a Bayesian optimization and active learning toolkit with a collection of acquisition functions, including for parallel and multi-fidelity optimization. It does not provide specific abstractions for implementing new algorithms, but rather specifies a model API that allows it to be used with the other toolkit components.
The recent Dragonfly (Kandasamy et al., 2019) library supports parallel optimization, multi-fidelity optimization (Kandasamy et al., 2016), and high-dimensional optimization with additive kernels (Kandasamy et al., 2015). It takes an ensemble approach and aims to work out-of-the-box across a wide range of problems, a design choice that makes it relatively hard to extend.
qMC methods have been used in other applications in machine learning, including variational inference (Buchholz et al., 2018) and evolutionary strategies (Rowland et al., 2018), but rarely in BO. Letham et al. (2019) use qMC in the context of a specific acquisition function. BoTorch’s abstractions make it straightforward (and mostly automatic) to use qMC integration with any acquisition function.
Fixing the base samples introduces a consistent bias in the function approximation. While i.i.d. re-sampling in each iteration ensures that and are conditionally independent given , this no longer holds when fixing the base samples.
Figure 11 illustrates this behavior for EI (we consider the simple case of for which we have an analytic solution available). The top row shows the MC and qMC version, respectively, when re-drawing base samples for every evaluation. The solid lines correspond to a single realization, and the shaded region covers four standard deviations around the mean, estimated across 50 evaluations. It is evident that qMC sampling significantly reduces the variance of the estimate. The bottom row shows the same functions for 10 different realizations of fixed base samples. Each of these realizations is differentiable w.r.t. (and hence in the slice parameterization). In expectation (over the base samples), this function coincides with the true function (the dashed black line). Conditional on the base sample draw, however, the estimate displays a consistent bias. The variance of this bias is much smaller for the qMC versions.
The reason for this is that while the function values may show noticeable bias, the bias of the maximizer (in ) is typically very small. Figure 12 illustrates this behavior, showing empirical cdfs of the relative gap and the distance over 250 optimization runs for different numbers of samples, where is the optimizer of the analytic function EI, and is the optimizer of the qMC approximation. The quality of the solution of the deterministic problem is excellent even for relatively small sample sizes, and generally better than of the stochastic optimizer.
A somewhat subtle point is that whether better optimization of the acquisition function results in improved closed-loop BO performance depends on the acquisition function as well as the underlying problem. More exploitative acquisition functions, such as EI, tend to show worse performance for problems with high noise levels. In these settings, not solving the EI maximization exactly adds randomness and thus induces additional exploration, which can improve closed-loop performance. While a general discussion of this point is outside the scope of this paper, BoTorch does provide a framework for optimizing acquisition functions well, so that these questions can be compartmentalized and acquisition function performance can be investigated independently from the quality of optimizing acquisition functions.
Perhaps the most significant advantage of using deterministic optimization algorithms is that, unlike for algorithms such as SGD that require tuning the learning rate, the optimization procedure is essentially hyperparameter-free. Figure 14 shows the closed-loop optimization performance of qEI for both deterministic and stochastic optimization for different optimizers and learning rates. While some of the stochastic variants (e.g. ADAM with learning rate 0.01) achieve performance similar to the deterministic optimization, the type of optimizer and learning rate matters. In fact, the rank order of SGD and ADAM w.r.t. to the learning rate is reversed, illustrating that selecting the right hyperparameters for the optimizer is itself a non-trivial problem.
We present results for constrained BO on a synthetic function. We consider a multi-output function and the optimization problem:
(7) |
Both and are observed with noise and we model the two components using independent GP models. A constraint-weighted composite objective is used in each of the BoTorch acquisition functions EI, NEI, and OKG.
Results for the case of a Hartmann6 objective and two types of constraints are given in Figures 18-18 (we only show results for BoTorch’s algorithms, since the other packages do not natively support optimization subject to unknown constraints).
The regret values are computed using a feasibility-weighted objective, where “infeasible” is assigned an objective value of zero. For random search and EI, the suggested point is taken to be the best feasible noisily observed point, and for NEI and OKG, we use out-of-sample suggestions by optimizing the feasibility-weighted version of the posterior mean. The results displayed in Figure 18 are for the constrained Hartmann6 benchmark from Letham et al. (2019). Note, however, that the results here are not directly comparable to the figures in Letham et al. (2019) because (1) we use feasibility-weighted objectives to compute regret and (2) they follow a different convention for suggested points. We emphasize that our contribution of outcome constraints for the case of KG has not been shown before in the literature.
For most acquisition functions, the optimization surface is highly non-convex, multi-modal, and (especially for “improvement-based” ones such as EI or KG) often flat (i.e. has zero gradient) in much of the domain . Therefore, optimizing the acquisition function is itself a challenging problem.
The simplest approach is to use zeroth-order optimizers that do not require gradient information, such as DIRECT or CMA-ES (Jones et al., 1993; Hansen & Ostermeier, 2001). These approaches are feasible for lower-dimensional problems, but do not scale to higher dimensions. Note that performing parallel optimization over candidates in a -dimensional feature space means solving a -dimensional optimization problem.
A more scalable approach incorporates gradient information into the optimization. As described in Section 7, BoTorch by default uses quasi-second order methods, such as L-BFGS-B. Because of the complex structure of the objective, the initial conditions for the algorithm are extremely important so as to avoid getting stuck in a potentially highly sub-optimal local optimum. To reduce this risk, one typically employs multi-start optimization (i.e. start the solver from multiple initial conditions and pick the best of the final solutions). To generate a good set of initial conditions, BoTorch heavily exploits the fast batch evaluation discussed in the previous section. Specifically, BoTorch by default uses candidates generated using the following heuristic:
[leftmargin=4ex,topsep=0ex,itemsep=0ex]
Sample quasi-random -tuples of points from using quasi-random Sobol sequences (see Section 4.4).
Batch-evaluate the acquisition function at these candidate sets: .
Sample candidate sets
according to the weight vector
, where and is a temperature parameter.^{4}^{4}4Acquisition functions that are known to be flat in large parts of are handled with additional care in order to avoid starting in locations with zero gradients.Sampling initial conditions this way achieves an exploration/exploitation trade-off controlled by the magnitude of . As we perform Sobol sampling, while means the initialization is chosen in a purely greedy fashion. The latter is generally not advisable, since for large the highest-valued points are likely to all be clustered together, which would run counter to the goal of multi-start optimization. Fast batch evaluation allows evaluating a large number of samples ( in the tens of thousands is feasible even for moderately sized models).
Figure 20 shows results for the case of parallel optimization with . We find similar results as in the case of in Figure 3. While for EI-CF performance is similar for and , KG-CF reaches lower regret significantly faster for compared to , suggesting that “looking ahead“ is beneficial in this context.
Figure 21 shows the location of the base grid in comparison to the samples obtained by minimizing IPV.