Visualize optimization algorithms in MATLAB.
Computational models in fields such as computational neuroscience are often evaluated via stochastic simulation or numerical approximation. Fitting these models implies a difficult optimization problem over complex, possibly noisy parameter landscapes. Bayesian optimization (BO) has been successfully applied to solving expensive black-box problems in engineering and machine learning. Here we explore whether BO can be applied as a general tool for model fitting. First, we present a novel hybrid BO algorithm, Bayesian adaptive direct search (BADS), that achieves competitive performance with an affordable computational overhead for the running time of typical models. We then perform an extensive benchmark of BADS vs. many common and state-of-the-art nonconvex, derivative-free optimizers, on a set of model-fitting problems with real data and models from six studies in behavioral, cognitive, and computational neuroscience. With default settings, BADS consistently finds comparable or better solutions than other methods, including `vanilla' BO, showing great promise for advanced BO techniques, and BADS in particular, as a general model-fitting tool.READ FULL TEXT VIEW PDF
There are a large number of optimization problems in physical models whe...
We consider derivative-free black-box global optimization of expensive n...
Game theory finds nowadays a broad range of applications in engineering ...
Hyperparameter optimization and neural architecture search can become
We propose a Bayesian optimization algorithm for objective functions tha...
In this paper, we show how to transform any optimization problem that ar...
Attributed graphs, which contain rich contextual features beyond just ne...
Visualize optimization algorithms in MATLAB.
Many complex, nonlinear computational models in fields such as behaviorial, cognitive, and computational neuroscience cannot be evaluated analytically, but require moderately expensive numerical approximations or simulations. In these cases, finding the maximum-likelihood (ML) solution – for parameter estimation, or model selection – requires the costly exploration of a rough or noisy nonconvex landscape, in which gradients are often unavailable to guide the search.
Here we consider the problem of finding the (global) optimum of a possibly noisy objective over a (bounded) domain , where the function
can be intended as the (negative) log likelihood of a parameter vectorfor a given dataset and model, but is generally a black box. With many derivative-free optimization algorithms available to the researcher rios2013derivative , it is unclear which one should be chosen. Crucially, an inadequate optimizer can hinder progress, limit the complexity of the models that can be fit, and even cast doubt on the reliability of one’s findings.
Bayesian optimization (BO) is a state-of-the-art machine learning framework for optimizing expensive and possibly noisy black-box functions jones1998efficient ; brochu2010tutorial ; shahriari2016taking . This makes it an ideal candidate for solving difficult model-fitting problems. Yet there are several obstacles to a widespread usage of BO as a general tool for model fitting. First, traditional BO methods target very
costly problems, such as hyperparameter tuningsnoek2012practical , whereas evaluating a typical behavioral model might only have a moderate computational cost (e.g., 0.1-10 s per evaluation). This implies major differences in what is considered an acceptable algorithmic overhead, and in the maximum number of allowed function evaluations (e.g., hundreds vs. thousands). Second, it is unclear how BO methods would fare in this regime against commonly used and state-of-the-art, non-Bayesian optimizers. Finally, BO might be perceived by non-practitioners as an advanced tool that requires specific technical knowledge to be implemented or tuned.
We address these issues by developing a novel hybrid BO algorithm, Bayesian Adaptive Direct Search (BADS), that achieves competitive performance at a small computational cost. We tested BADS, together with a wide array of commonly used optimizers, on a novel benchmark set of model-fitting problems with real data and models drawn from studies in cognitive, behaviorial and computational neuroscience. Finally, we make BADS available as a free MATLAB package with the same user interface as existing optimizers and that can be used out-of-the-box with no tuning.111Code available at https://github.com/lacerbi/bads.
BADS is a hybrid BO method in that it combines the mesh adaptive direct search (MADS) framework audet2006mesh (Section 2.1) with a BO search performed via a local Gaussian process (GP) surrogate (Section 2.2
), implemented via a number of heuristics for efficiency (Section3). BADS proves to be highly competitive on both artificial functions and real-world model-fitting problems (Section 4), showing promise as a general tool for model fitting in computational neuroscience and related fields.
There is a large literature about (Bayesian) optimization of expensive, possibly stochastic, computer simulations, mostly used in machine learning brochu2010tutorial ; shahriari2016taking ; snoek2012practical or engineering (known as kriging-based optimization) taddy2009bayesian ; picheny2014noisy ; gramacy2015mesh . Recent work has combined MADS with treed GP models for constrained optimization (TGP-MADS gramacy2015mesh
). Crucially, these methods have large overheads and may require problem-specific tuning, making them impractical as a generic tool for model fitting. Cheaper but less precise surrogate models than GPs have been proposed, such as random forestshutter2011sequential , Parzen estimators bergstra2011algorithms , and dynamic trees TaLeDKo2014 . In this paper, we focus on BO based on traditional GP surrogates, leaving the analysis of alternative models for future work (see Conclusions).
The MADS algorithm is a directional direct search framework for nonlinear optimization audet2006mesh ; audet2008erratum . Briefly, MADS seeks to improve the current solution by testing points in the neighborhood of the current point (the incumbent), by moving one step in each direction on an iteration-dependent mesh. In addition, the MADS framework can incorporate in the optimization any arbitrary search strategy which proposes additional test points that lie on the mesh.
MADS defines the current mesh at the -th iteration as , where is the set of all points evaluated since the start of the iteration, is the mesh size, and D is a fixed matrix in whose columns represent viable search directions. We choose , where
is the identity matrix in dimension.
Each iteration of MADS comprises of two stages, a search stage and an optional poll stage. The search stage evaluates a finite number of points proposed by a provided search strategy, with the only restriction that the tested points lie on the current mesh. The search strategy is intended to inject problem-specific information in the optimization. In BADS, we exploit the freedom of search to perform Bayesian optimization in the neighborhood of the incumbent (see Section 2.2 and 3.3). The poll stage is performed if the search fails in finding a point with an improved objective value. poll constructs a poll set of candidate points, , defined as where is the incumbent and is the set of polling directions constructed by taking discrete linear combinations of the set of directions D. The poll size parameter defines the maximum length of poll displacement vectors , for (typically, ). Points in the poll set can be evaluated in any order, and the poll is opportunistic in that it can be stopped as soon as a better solution is found. The poll stage ensures theoretical convergence to a local stationary point according to Clarke calculus for nonsmooth functions audet2006mesh ; clarke1983optimization .
If either search or poll are a success, finding a mesh point with an improved objective value, the incumbent is updated and the mesh size remains the same or is multiplied by a factor . If neither search or poll are successful, the incumbent does not move and the mesh size is divided by . The algorithm proceeds until a stopping criterion is met (e.g., maximum budget of function evaluations).
The typical form of Bayesian optimization (BO) jones1998efficient builds a Gaussian process (GP) approximation of the objective , which is used as a relatively inexpensive surrogate to guide the search towards regions that are promising (low GP mean) and/or unknown (high GP uncertainty), according to a rule, the acquisition function, that formalizes the exploitation-exploration trade-off.
GPs are a flexible class of models for specifying prior distributions over unknown functions rasmussen2006gaussian . GPs are specified by a mean function and a positive definite covariance, or kernel function . Given any finite collection of points , the value of at these points is assumed to be jointly Gaussian with mean and covariance matrix K, where for . We assume i.i.d. Gaussian observation noise such that evaluated at returns , and is the vector of observed values. For a deterministic , we still assume a small to improve numerical stability of the GP gramacy2012cases . Conveniently, observation of such (noisy) function values will produce a GP posterior whose latent marginal conditional mean
and varianceat a given point are available in closed form (see Supplementary Material), where is a hyperparameter vector for the mean, covariance, and likelihood. In the following, we omit the dependency of and from the data and GP parameters to reduce clutter.
Our main choice of stationary (translationally-invariant) covariance function is the automatic relevance determination (ARD) rational quadratic (RQ) kernel,
where is the signal variance, are the kernel length scales along each coordinate direction, and is the shape parameter. More common choices for Bayesian optimization include the squared exponential (SE) kernel gramacy2015mesh or the twice-differentiable ARD Matérn 5/2 (M) kernel snoek2012practical , but we found the RQ kernel to work best in combination with our method (see Section 4.2). We also consider composite periodic kernels for circular or periodic variables (see Supplementary Material).
For a given GP approximation of , the acquisition function, , determines which point in should be evaluated next via a proxy optimization . We consider here the GP lower confidence bound (LCB) metric srinivas2010gaussian ,
where is a tunable parameter, is the number of function evaluations so far, is a probabilistic tolerance, and is a learning rate chosen to minimize cumulative regret under certain assumptions. For BADS we use the recommended values and srinivas2010gaussian . Another popular choice is the (negative) expected improvement (EI) over the current best function value mockus1978application , and an historical, less used metric is the (negative) probability of improvement (PI) kushner1964new .
We describe here the main steps of BADS (Algorithm 1). Briefly, BADS alternates between a series of fast, local BO steps (the search stage of MADS) and a systematic, slower exploration of the mesh grid (poll stage). The two stages complement each other, in that the search can explore the space very effectively, provided an adequate surrogate model. When the search repeatedly fails, meaning that the GP model is not helping the optimization (e.g., due to a misspecified model, or excess uncertainty), BADS switches to poll. The poll stage performs a fail-safe, model-free optimization, during which BADS gathers information about the local shape of the objective function, so as to build a better surrogate for the next search. This alternation makes BADS able to deal effectively and robustly with a variety of problems. See Supplementary Material for a full description.
The algorithm is initialized by providing a starting point , vectors of hard lower/upper bounds LB, UB, and optional vectors of plausible lower/upper bounds PLB, PUB, with the requirement that for each dimension , .222A variable can be fixed by setting . Fixed variables become constants, and BADS runs on an optimization problem with reduced dimensionality. Plausible bounds identify a region in parameter space where most solutions are expected to lie. Hard upper/lower bounds can be infinite, but plausible bounds need to be finite. Problem variables whose hard bounds are strictly positive and are automatically converted to log space. All variables are then linearly rescaled to the standardized box such that the box bounds correspond to in the original space. BADS supports bound or no constraints, and optionally other constraints via a provided barrier function (see Supplementary Material). The user can also specify circular or periodic dimensions (such as angles); and whether the objective is deterministic or noisy (stochastic), and in the latter case provide a coarse estimate of the noise (see Section 3.4).
The initial design consists of the provided starting point and additional points chosen via a space-filling quasi-random Sobol sequence bratley1988algorithm in the standardized box, and forced to lie on the mesh grid. If the user does not specify whether is deterministic or stochastic, the algorithm assesses it by performing two consecutive evaluations at .
The default GP has hyperparameters . We impose an empirical Bayes prior on the GP hyperparameters based on the current training set (see Supplementary Material), and select via maximum a posteriori (MAP) estimation. We fit via a gradient-based nonlinear optimizer, starting from either the previous value of or a weighted draw from the prior, as a means to escape local optima. We refit the hyperparameters every to function evaluations; more often earlier in the optimization, and whenever the current GP is particularly inaccurate at predicting new points, according to a normality test on the residuals, (assumed independent, in first approximation).
The GP training set X consists of a subset of the points evaluated so far (the cache), selected to build a local approximation of the objective in the neighborhood of the incumbent , constructed as follows. Each time X is rebuilt, points in the cache are sorted by their -scaled distance (Eq. 1) from . First, the closest points are automatically added to X. Second, up to additional points with are included in the set, where is a radius function that depends on the decay of the kernel. For the RQ kernel, (see Supplementary Material). Newly evaluated points are added incrementally to the set, using fast rank-one updates of the GP posterior. The training set is rebuilt any time the incumbent is moved.
We initialize and (in standardized space), such that the initial poll steps can span the plausible region, whereas the mesh grid is relatively fine. We use , and increase the mesh size only after a successful poll. We skip the poll after a successful search.
We apply an aggressive, repeated search strategy that consists of up to unsuccessful search steps. In each step, we use a search oracle, based on a local BO with the current GP, to produce a search point (see below). We evaluate and add it to the training set. If the improvement in objective value is none or insufficient, that is less than , we continue searching, or switch to poll after steps. Otherwise, we call it a success and start a new search from scratch, centered on the updated incumbent.
We choose via a fast, approximate optimization inspired by CMA-ES hansen2003reducing . We sample batches of points in the neighborhood of the incumbent , drawn , where is the current search focus, a search covariance matrix, and a scaling factor, and we pick the point that optimizes the acquisition function (see Supplementary Material). We remove from the search set candidate points that violate non-bound constraints (), and we project candidate points that fall outside hard bounds to the closest mesh point inside the bounds. Across search steps, we use both a diagonal matrix with diagonal , and a matrix proportional to the weighted covariance matrix of points in X (each point weighted according to a function of its ranking in terms of objective values ). We choose between and probabilistically via a hedge strategy, based on their track record of cumulative improvement hoffman2011portfolio .
We incorporate the GP approximation in the poll in two ways: when constructing the set of polling directions , and when choosing the polling order. We generate according to the random LTMADS algorithm audet2006mesh , but then rescale each vector coordinate proportionally to the GP length scale (see Supplementary Material). We discard poll vectors that do not satisfy the given bound or nonbound constraints. Second, since the poll is opportunistic, we evaluate points in the poll set according to the ranking given by the acquisition function gramacy2015mesh .
We stop the optimization when the poll size goes below a threshold (default ); when reaching a maximum number of objective evaluations (default ); or if there is no significant improvement of the objective for more than iterations. The algorithm returns the optimum (transformed back to original coordinates) with the lowest objective value .
In case of a noisy objective, we assume for the noise a hyperprior, with a base noise magnitude (default , but the user can provide an estimate). To account for additional uncertainty, we also make the following changes: double the minimum number of points added to the training set, , and increase the maximum number to 200; increase the initial design to ; and double the number of allowed stalled iterations before stopping.
Due to noise, we cannot simply use the output values as ground truth in the search and poll stages. Instead, we replace
with the GP latent quantile functionpicheny2013quantile
where is the quantile function of the standard normal (plugin approach picheny2013benchmark ). Moreover, we modify the MADS procedure by keeping an incumbent set , where is the incumbent at the end of the -th iteration. At the end of each poll we re-evaluate for all elements of the incumbent set, in light of the new points added to the cache. We select as current (active) incumbent the point with lowest . During optimization we set (mean prediction only), which promotes exploration. We use a conservative for the last iteration, to select the optimum returned by the algorithm in a robust manner. Instead of , we return either
or an unbiased estimate ofobtained by averaging multiple evaluations (see Supplementary Material).
We tested BADS and many optimizers with implementation available in MATLAB (R2015b, R2017a) on a large set of artificial and real optimization problems (see Supplementary Material for details).
Besides BADS, we tested 16 optimization algorithms, including popular choices such as Nelder-Mead (fminsearch lagarias1998convergence ), several constrained nonlinear optimizers in the fmincon function (default interior-point waltz2006interior , sequential quadratic programming sqp wright1999numerical , and active-set actset gill1981practical
), genetic algorithms (ga goldberg1989genetic ), random search (randsearch) as a baseline bergstra2012random ; and also less-known state-of-the-art methods for nonconvex derivative-free optimization rios2013derivative , such as Multilevel Coordinate Search (MCS huyer1999global ) and CMA-ES hansen2003reducing ; jastrebski2006improving (cmaes, in different flavors). For noisy objectives, we included algorithms that explicitly handle uncertainty, such as snobfit csendes2008global and noisy CMA-ES hansen2009method . Finally, to verify the advantage of BADS’ hybrid approach to BO, we also tested a standard, ‘vanilla’ version of BO snoek2012practical (bayesopt, R2017a) on the set of real model-fitting problems (see below). For all algorithms, including BADS, we used default settings (no fine-tuning).
First, we considered a standard benchmark set of artificial, noiseless functions (bbob09 hansen2009real , 24 functions) in dimensions , for a total of test functions. We also created ‘noisy’ versions of the same set. Second, we collected model-fitting problems from six published or ongoing studies in cognitive and computational neuroscience (ccn17). The objectives of the ccn17 set are negative log likelihood functions of an input parameter vector, for specified datasets and models, and can be deterministic or stochastic. For each study in the ccn17
set we asked its authors for six different real datasets (i.e., subjects or neurons), divided between one or two main models of interest; collecting a total of 36 test functions with.
We ran 50 independent runs of each algorithm on each test function, with randomized starting points and a budget of function evaluations ( for noisy problems). If an algorithm terminated before depleting the budget, it was restarted from a new random point. We consider a run successful if the current best (or returned, for noisy problems) function value is within a given error tolerance from the true optimum (or our best estimate thereof).333Note that the error tolerance is not a fractional error, as sometimes reported in optimization, because for model comparison we typically care about (absolute) differences in log likelihoods. For noiseless problems, we compute the fraction of successful runs as a function of number of objective evaluations, averaged over datasets/functions and over (log spaced). This is a realistic range for , as differences in log likelihood below 0.01 are irrelevant for model selection; an acceptable tolerance is (a difference in deviance, the metric used for AIC or BIC, less than 1); larger associate with coarse solutions, but errors larger than 10 would induce excessive biases in model selection. For noisy problems, what matters most is the solution that the algorithm actually returns, which, depending on the algorithm, may not necessarily be the point with the lowest observed function value. Since, unlike the noiseless case, we generally do not know the solutions that would be returned by any algorithm at every time step, but only at the last step, we plot instead the fraction of successful runs at function evaluations as a function of , for
(noise makes higher precisions moot), and averaged over datasets/functions. In all plots we omit error bars for clarity (standard errors would be about the size of the line markers or less).
The bbob09 noiseless set hansen2009real comprises of 24 functions divided in 5 groups with different properties: separable; low or moderate conditioning; unimodal with high conditioning; multi-modal with adequate / with weak global structure. First, we use this benchmark to show the performance of different configurations for BADS. Note that we selected the default configuration (RQ kernel, ) and other algorithmic details by testing on a different benchmark set (see Supplementary Material). Fig 1 (left) shows aggregate results across all noiseless functions with , for alternative choices of kernels and acquisition functions (only a subset is shown, such as the popular M, EI combination), or by altering other features (such as setting , or fixing the search covariance matrix to or ). Almost all changes from the default configuration worsen performance.
We then compared BADS to other algorithms (Fig 1 middle). Depending on the number of function evaluations, the best optimizers are BADS, methods of the fmincon family, and, for large budget of function evaluations, CMA-ES with active update of the covariance matrix.
We produce noisy versions of the bbob09 set by adding i.i.d. Gaussian observation noise at each function evaluation, , with . We consider a variant with moderate homoskedastic (constant) noise (), and a variant with heteroskedastic noise with , which follows the observation that variability generally increases for solutions away from the optimum. For many functions in the bbob09 set, this heteroskedastic noise can become substantial () away from the optimum. Fig 1 (right) shows aggregate results for the heteroskedastic set (homoskedastic results are similar). BADS outperforms all other optimizers, with CMA-ES (active, with or without the noisy option) coming second.
Notably, BADS performs well even on problems with non-stationary (location-dependent) features, such as heteroskedastic noise, thanks to its local GP approximation.
The algorithmic cost of BADS is s to s per function evaluation, depending on , mostly due to the refitting of the GP hyperparameters. This produces a non-negligible overhead, defined as (total optimization time / total function time ). For a fair comparison with other methods with little or no overhead, for deterministic problems we also plot the effective performance of BADS by accounting for the extra cost per function evaluation. In practice, this correction shifts rightward the performance curve of BADS in log-iteration space, since each function evaluation with BADS has an increased fractional time cost. For stochastic problems, we cannot compute effective performance as easily, but there we found small overheads (), due to more costly evaluations (more than 1 s).
For a direct comparison with standard BO, we also tested on the ccn17 set a ‘vanilla’ BO algorithm, as implemented in MATLAB R2017a (bayesopt). This implementation closely follows snoek2012practical , with optimization instead of marginalization over GP hyperparameters. Due to the fast-growing cost of BO as a function of training set size, we allowed up to 300 training points for the GP, restarting the BO algorithm from scratch with a different initial design every 300 BO iterations (until the total budget of function evaluations was exhausted). The choice of 300 iterations already produced a large average algorithmic overhead of s per function evaluation. In showing the results of bayesopt, we display raw performance without penalizing for the overhead.
Causal inference (CI) in perception is the process whereby the brain decides whether to integrate or segregate multisensory cues that could arise from the same or from different sources kording2007causal . This study investigates CI in visuo-vestibular heading perception across tasks and under different levels of visual reliability, via a factorial model comparison acerbi2017bayesian . For our benchmark we fit three subjects with a Bayesian CI model (), and another three with a fixed-criterion CI model () that disregards visual reliability. Both models include heading-dependent likelihoods and marginalization of the decision variable over the latent space of noisy sensory measurements , solved via nested numerical integration in 1-D and 2-D.
This study investigates the Bayesian confidence hypothesis
that subjective judgments of confidence are directly related to the posterior probability the observer assigns to a learnt perceptual categoryadler2017human (e.g., whether the orientation of a drifting Gabor patch belongs to a ‘narrow’ or to a ‘wide’ category). For our benchmark we fit six subjects to the ‘Ultrastrong’ Bayesian confidence model (), which uses the same mapping between posterior probability and confidence across two tasks with different distributions of stimuli. This model includes a latent noisy decision variable, marginalized over via 1-D numerical integration.
The authors of this study explore the origins of diversity of neuronal orientation selectivity in visual cortex via novel stimuli (orientation mixtures) and modeling goris2015origin . We fit the responses of five V1 and one V2 cells with the authors’ neuronal model () that combines effects of filtering, suppression, and response nonlinearity goris2015origin . The model has one circular parameter, the preferred direction of motion of the neuron. The model is analytical but still computationally expensive due to large datasets and a cascade of several nonlinear operations.
This study models a word recognition task in which subjects rated their confidence that a presented word was in a previously studied list van2017fechner (data from mickes2007direct ). We consider six subjects divided between two normative models, the ‘Retrieving Effectively from Memory’ model shiffrin1997model () and a similar, novel model444Unpublished; upcoming work from Aspen H. Yoo and Wei Ji Ma. (). Both models use Monte Carlo methods to draw random samples from a large space of latent noisy memories, yielding a stochastic log likelihood.
This study looks at differences in observers’ decision making strategies in target detection (‘was the target present?’) and localization (‘which one was the target?’) with displays of or oriented Gabor patches.555Unpublished; upcoming work from Andra Mihali and Wei Ji Ma. Here we fit six subjects with a previously derived ideal observer model ma2011behavior ; mazyar2012does () with variable-precision noise van2012variability , assuming shared parameters between detection and localization. The log likelihood is evaluated via simulation due to marginalization over latent noisy measurements of stimuli orientations with variable precision.
This study analyzes people’s strategies in a four-in-a-row game played on a 4-by-9 board against human opponents (van2016people , Experiment 1). We fit the data of six players with the main model (
), which is based on a Best-First exploration of a decision tree guided by a feature-based value heuristic. The model also includes feature dropping, value noise, and lapses, to better capture human variability. Model evaluation is computationally expensive due to the construction and evaluation of trees of future board states, and achieved viainverse binomial sampling, an unbiased stochastic estimator of the log likelihood van2016people . Due to prohibitive computational costs, here we only test major algorithms (MCS is the method used in the paper van2016people ); see Fig 3 right.
In all problems, BADS consistently performs on par with or outperforms all other tested optimizers, even when accounting for its extra algorithmic cost. The second best algorithm is either some flavor of CMA-ES or, for some deterministic problems, a member of the fmincon family. Crucially, their ranking across problems is inconsistent, with both CMA-ES and fmincon performing occasionally quite poorly (e.g., fmincon does poorly in the causal inference set because of small fluctuations in the log likelihood landscape caused by coarse numerical integration). Interestingly, vanilla BO (bayesopt) performs poorly on all problems, often at the level of random search, and always substantially worse than BADS, even without accounting for the much larger overhead of bayesopt. The solutions found by bayesopt are often hundreds (even thousands) points of log likelihood from the optimum. This failure is possibly due to the difficulty of building a global GP surrogate for BO, coupled with strong non-stationarity of the log likelihood functions; and might be ameliorated by more complex forms of BO (e.g., input warping to produce nonstationary kernels snoek2014input , hyperparameter marginalization snoek2012practical ). However, these advanced approaches would substantially increase the already large overhead. Importantly, we expect this poor perfomance to extend to any package which implements vanilla BO (such as BayesOpt martinez2014bayesopt ), regardless of the efficiency of implementation.
We have developed a novel BO method and an associated toolbox, BADS, with the goal of fitting moderately expensive computational models out-of-the-box. We have shown on real model-fitting problems that BADS outperforms widely used and state-of-the-art methods for nonconvex, derivative-free optimization, including ‘vanilla’ BO. We attribute the robust performance of BADS to the alternation between the aggressive search strategy, based on local BO, and the failsafe poll stage, which protects against failures of the GP surrogate – whereas vanilla BO does not have such failsafe mechanisms, and can be strongly affected by model misspecification. Our results demonstrate that a hybrid Bayesian approach to optimization can be beneficial beyond the domain of very costly black-box functions, in line with recent advancements in probabilistic numerics hennig2015probabilistic .
Like other surrogate-based methods, the performance of BADS is linked to its ability to obtain a fast approximation of the objective, which generally deteriorates in high dimensions, or for functions with pathological structure (often improvable via reparameterization). From our tests, we recommend BADS, paired with some multi-start optimization strategy, for models with up to variables, a noisy or jagged log likelihood landscape, and when algorithmic overhead is (e.g., model evaluation s). Future work with BADS will focus on testing alternative statistical surrogates instead of GPs TaLeDKo2014 ; combining it with a smart multi-start method for global optimization; providing support for tunable precision of noisy observations picheny2013quantile ; improving the numerical implementation; and recasting some of its heuristics in terms of approximate inference.
We thank Will Adler, Robbe Goris, Andra Mihali, Bas van Opheusden, and Aspen Yoo for sharing data and model evaluation code that we used in the ccn17 benchmark set; Maija Honig, Andra Mihali, Bas van Opheusden, and Aspen Yoo for providing user feedback on earlier versions of the bads package for MATLAB; Will Adler, Andra Mihali, Bas van Opheusden, and Aspen Yoo for helpful feedback on a previous version of this manuscript; John Wixted and colleagues for allowing us to reuse their data for the ccn17 ‘word recognition memory’ problem set; and three anonymous reviewers for useful feedback. This work has utilized the NYU IT High Performance Computing resources and services.
Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligencepp. 327–336.
In this section, we describe definitions and additional specifications of the Gaussian process (GP) model used for Bayesian optimization (BO) in BADS. Specifically, this part expands on Sections 2.2 and 3.2 in the main text.
We consider a GP based on a training set X with points, a vector of observed function values , and GP mean function and GP covariance or kernel function , with i.i.d. Gaussian observation noise . The GP posterior latent marginal conditional mean and variance are available in closed form at a chosen point as
where , for , is the kernel matrix, is the -dimensional column vector of cross-covariances, and is the vector of GP hyperparameters.
Besides the automatic relevance determination (ARD) rational quadratic (RQ) kernel described in the main text (and BADS default), we also considered the common squared exponential (SE) kernel
and the ARD Matérn 5/2 kernel snoek2012practical ,
where is the signal variance, and are the kernel length scales along each coordinate. Note that the RQ kernel tends to the SE kernel for .
The Matérn 5/2 kernel has become a more common choice for Bayesian global optimization because it is only twice-differentiable snoek2012practical
, whereas the SE and RQ kernels are infinitely differentiable – a stronger assumption of smoothness which may cause extrapolation issues. However, this is less of a problem for a local interpolating approximation (as in BADS) than it is for a global approach, and in fact we find the RQ kernel to work well empirically (see main text).
We allow the user to specify one or more periodic (equivalently, circular) coordinate dimensions , which is a feature of some models in computational neuroscience (e.g., the preferred orientation of a neuron, as in the ‘neuronal selectivity’ problem set goris2015origin of the ccn17 benchmark; see Section 4.3 in the main text). For a chosen base stationary covariance function (e.g., RQ, SE, M), we define the composite ARD periodic kernel as
for , where is the period in the -th coordinate dimension, and the length scale of is shared between pairs when . In BADS, the period is determined by the provided hard bounds as (where the hard bounds are required to be finite).
We construct the training set X according to a simple subset-of-data quinonero2007approximation local GP approximation. Points are added to the training set sorted by their -scaled distance from the incumbent . The training set contains a minimum of points (if available in the cache of all points evaluated so far), and then up to additional points with , where is a radius function that depends on the decay of the kernel. For a given stationary kernel of the form , we define as the distance such that . We have then
where for example , and .
We fit the GP hyperparameters by maximizing their posterior probability (MAP), , which, thanks to the Gaussian likelihood, is available in closed form as rasmussen2006gaussian
where is the identity matrix in dimension (the number of points in the training set), and is the prior over hyperparameters, described in the following.
We adopt an approximate empirical Bayes approach by defining the prior based on the data in the training set, that is . Empirical Bayes can be intended as a quick, heuristic approximation to a proper but more expensive hierarchical Bayesian approach. We assume independent priors for each hyperparameter, with bounded (truncated) distributions. Hyperparameter priors and hard bounds are reported in Table S1. In BADS, we include an observation noise parameter also for deterministic objectives , merely for the purpose of fitting the GP, since it has been shown to yield several advantages gramacy2012cases . In particular, we assume a prior such that decreases as a function of the poll size , as the optimization ‘zooms in’ to smaller scales. Another distinctive choice for BADS is that we set the mean for the GP mean equal to the 90-th percentile of the observed values in the current training set , which encourages the exploration to remain local.
|RQ kernel shape|
|GP observation noise|
|noisy||1 (or user-provided estimate)|
denotes the standard deviation of a set of elements;is the poll size parameter at the current iteration ; Q denotes the -th quantile of a set of elements (Q is the median).
We optimize Eq. S6 with a gradient-based optimizer (see Section D), providing the analytical gradient to the algorithm. We start the optimization from the previous hyparameter values . If the optimization seems to be stuck in a high-noise mode, or we find an unusually low value for the GP mean , we attempt a second fit starting from a draw from the prior averaged with . If the optimization fails due to numerical issues, we keep the previous value of the hyperparameters. We refit the hyperparameters every to function evaluations; more often earlier in the optimization, and whenever the current GP is particularly inaccurate at predicting new points. We test accuracy on newly evaluated points via a Shapiro-Wilk normality test on the residuals royston1982extension , (assumed independent, in first approximation), and flag the approximation as inaccurate if .
Besides the GP lower confidence bound (LCB) metric srinivas2010gaussian described in the main text (and default in BADS), we consider two other choices that are available in closed form using Eq. S1 for the GP predictive mean and variance.
This strategy maximizes the probability of improving over the current best minimum kushner1964new . For consistency with the main text, we define here the negative PI,
where is an optional trade-off parameter to promote exploration, and
is the cumulative distribution function of the standard normal.is known to excessively favor exploitation over exploration, and it is difficult to find a correct setting for to offset this tendency lizotte2008practical .
We then consider the popular predicted improvement criterion snoek2012practical ; mockus1978application ; jones1998efficient . The expected improvement over the current best minimum (with an offset ) is defined as . For consistency with the main text we consider the negative EI, which can be computed in closed form as
where is the standard normal pdf.
We report here extended details of the BADS algorithm, and how the various steps of the MADS framework are implemented (expanding on Sections 3.1 and 3.3 of the main text). Main features of the algorithm are summarized in Table S2. Refer also to Algorithm 1 in the main text.
|GP training set size||70 (), 250 () (min 200 for noisy problems)|
|poll directions generation||LTMADS with GP rescaling|
|search set generation||Two-step ES algorithm with search matrix|
|search evals. ()|
|Supported constraints||None, bound, and non-bound via a barrier function|
|Initial mesh size|
BADS solves the optimization problem
where is defined by pairs of hard bound constraints for each coordinate, for , and we allow and similarly . We also consider optional non-bound constraints specified via a barrier function that returns constraint violations. We only consider solutions such that is zero or less.
The algorithm takes as input a starting point ; vectors of hard lower/upper bounds LB, UB; optional vectors of plausible lower/upper bounds PLB, PUB; and an optional barrier function . We require that, if specified, ; and for each dimension , and . Plausible bounds identify a region in parameter space where most solutions are expected to lie, which in practice we usually think of as the region where starting points for the algorithm would be drawn from. Hard upper/lower bounds can be infinite, but plausible bounds need to be finite. As an exception to the above bound ordering, the user can specify that a variable is fixed by setting . Fixed variables become constants, and BADS runs on an optimization problem with reduced dimensionality. The user can also specify circular or periodic dimensions (such as angles), which change the definition of the GP kernel as per Section A.1. The user can specify whether the objective is deterministic or noisy (stochastic), and in the latter case provide a coarse estimate of the noise (see Section B.5).
Problem variables whose hard bounds are strictly positive and are automatically converted to log space for all internal calculations of the algorithm. All variables are also linearly rescaled to the standardized box such that the box bounds correspond to in the original space. BADS converts points back to the original coordinate space when calling the target function or the barrier function , and at the end of the optimization. BADS never violates constraints, by removing from the poll and search sets points that violate either bound or non-bound constraints (). During the search stage, we project candidate points that violate a bound constraint to the closest mesh point within the bounds. We assume that , if provided, is known and inexpensive to evaluate.
We assume that the scale of interest for differences in the objective (and the scale of other features, such as noise in the proximity of the solution) is of order , and that differences in the objective less than are negligible. For this reason, BADS is not invariant to arbitrary rescalings of the objective . This assumption does not limit the actual values taken by the objective across the optimization. If the objective is the log likelihood of a dataset and model (e.g., summed over trials), these assumptions are generally satisfied. They would not be if, for example, one were to feed to BADS the average log likelihood per trial, instead of the total (summed) log likelihood. In cases in which has an unusual scale, we recommend to rescale the objective such that the magnitude of differences of interest becomes of order .
We initialize and (in standardized space). The initial design comprises of the provided starting point and additional points chosen via a low-discrepancy Sobol quasirandom sequence bratley1988algorithm in the standardized box, and forced to be on the mesh grid. If the user does not specify whether is deterministic or stochastic, the algorithm assesses it by performing two consecutive evaluations at . For all practical purposes, a function is deemed noisy if the two evaluations at differ more than .111Since this simple test might fail, users are encouraged to actively specify whether the function is noisy.
In BADS we perform an aggressive search stage in which, in practice, we keep evaluating candidate points until we fail for consecutive steps to find a sufficient improvement in function value, with ; and only then we switch to the poll stage. At any iteration , we define an improvement sufficient if , where is the poll size.
In each search step we choose the final candidate point to evaluate, , by performing a fast, approximate optimization of the chosen acquisition function in the neighborhood of the incumbent , using a two-step evolutionary heuristic inspired by CMA-ES hansen2003reducing . This local search is governed by a search covariance matrix , and works as follows.
We draw a first generation of candidates for , where we project each point onto the closest mesh point (see Section 2.1 in the main text); is a search covariance matrix with unit trace,222Unit trace (sum of diagonal entries) for implies that a draw has unit expected squared length. and by default. For each candidate point, we assign a number of offsprings inversely proportionally to the square root of its ranking according to , for a total of offsprings hansen2003reducing . We then draw a second generation and project it onto the mesh grid, where is the index of the parent of the -th candidate in the 2nd generation, and is a zooming factor (we choose ). Finally, we pick . At each step, we remove candidate points that violate non-bound constraints (), and we project candidate points that fall outside hard bounds to the closest mesh point inside the bounds.
The search covariance matrix can be constructed in several ways. Across search steps we use both a diagonal matrix with diagonal , and a matrix proportional to the weighted covariance matrix of points in X (each point weighted according to a function of its ranking in terms of objective values , see hansen2003reducing ). At each step, we compute the probability of choosing , with , according to a hedging strategy taken from the Exp3 Hedge algorithm hoffman2011portfolio ,
where , , is the number of considered search matrices, and is a running estimate of the reward for option . The running estimate is updated each search step as
where is a decay factor, and is the improvement in objective of the -th strategy (0 if was not chosen in the current search step). This method allows us to switch between searching along coordinate axes (), and following an approximation of the local curvature around the incumbent (), according to their track record of cumulative improvement.
We perform the poll stage only after a search stage that did not produce a sufficient improvement after steps. We incorporate the GP approximation in the poll in two ways: when constructing the set of polling directions , and when choosing the polling order.
At the beginning of the poll stage, we generate a preliminary set of directions according to the random LTMADS algorithm audet2006mesh . We then transform it to a rescaled set based on the current GP kernel length scales: for , we define a rescaled vector with , for , and , where
denotes the geometric mean, and we use(resp. ) whenever (resp. ) is unbounded. This construction of deviates from the standard MADS framework. However, since the applied rescaling is bounded, we could redefine the mesh parameters and the set of polling directions to accomodate our procedure (as long as we appropriately discretize ). We remove from the poll set points that violate constraints, if present.
Since the poll is opportunistic, we evaluate points in the poll set starting from most promising, according to the ranking given by the chosen acquisition function gramacy2015mesh .
If the search stage was successful in finding a sufficient improvement, we skip the poll, move the incumbent and start a new iteration, without changing the mesh size (note that mesh expansion under a success is not required in the MADS framework audet2006mesh ). If the poll stage was executed, we verify if overall the iteration was successful or not, update the incumbent in case of success, and double (halven, in case of failure) the mesh size (). If the optimization has been stalling (no sufficient improvement) for more than three iterations, we accelerate the mesh contraction by temporarily switching to .
The optimization stops when one of these conditions is met:
the poll size goes below a threshold (default );
the maximum number of objective evaluations is reached (default );
the algorithm is stalling, that is there has no sufficient improvement of the objective , for more than iterations.
The algorithm returns the optimum (transformed back to original coordinates) that has the lowest objective value . For a noisy objective, we return instead the stored point with the lowest quantile across iterations, with ; see Section 3.4 in the main text. We also return the function value at the optimum, , or, for a noisy objective, our estimate thereof (see below, Section B.5). See the online documentation for more information about the returned outputs.
For noisy objectives, we change the behavior and default parameters of the algorithm to offset measurement uncertainty and allow for an accurate local approximation of . First, we:
double the minimum number of points added to the GP training set, ;
increase the total number of points (within radius ) to at least 200, regardless of ;
increase the initial design set size to points;
double the number of allowed stalled iterations before stopping.
The main difference with a deterministic objective is that, due to observation noise, we cannot simply use the output values as ground truth in the search and poll stages. Instead, we adopt a plugin approach picheny2013benchmark and replace with the GP latent quantile function picheny2013quantile (see Eq. 3 in the main text). Moreover, we modify the MADS procedure by keeping an incumbent set , where is the incumbent at the end of the -th iteration. At the end of each poll stage, we re-evaluate for all elements of the incumbent set, in light of the new points added to the cache which might change the GP prediction. We select as current (active) incumbent the point with lowest . During optimization, we set (mean prediction only), which promotes exploration. For the last iteration, we instead use a conservative to select the optimum returned by the algorithm in a robust manner. For a noisy objective, instead of the noisy measurement , we return either our best GP prediction and its uncertainty , or, more conservatively, an estimate of and its standard error, obtained by averaging function evaluations at (default ). The latter approach is a safer option to obtain an unbiased value of , since the GP approximation may occasionally fail or have substantial bias.
The user can optionally provide a noise estimate which is used to set the mean of the hyperprior over the observation noise (see Table S1). We recommend to set to the standard deviation of the noisy objective in the proximity of a good solution. If the problem has tunable precision (e.g., number of samples for log likelihoods evaluated via Monte Carlo), we recommend to set it, compatibly with computational cost, such that the standard deviation of noisy evaluations in the neighborhood of a good solution is of order 1.
We tested the performance of BADS on a large set of artificial and real problems and compared it with that of many optimization methods with implementation available in MATLAB (R2015b, R2017a).333MATLAB’s bayesopt optimizer was tested on version R2017a, since it is not available for R2015b. We include here details that expand on Section 4.1 of the main text.
|bads||Bayesian Adaptive Direct Search||GitHub page 444https://github.com/lacerbi/bads||This||✓|
|fminsearchbnd||Nelder-Mead (fminsearch) w/ bounded domain||File Exchange555https://www.mathworks.com/matlabcentral/fileexchange/8277-fminsearchbnd--fminsearchcon.||lagarias1998convergence||✗||✗|
|cmaes||Covariance Matrix Adaptation Evolution Strategy||Author’s website666https://www.lri.fr/~hansen/cmaes_inmatlab.html||hansen2003reducing||✗|
|— (active)||CMA-ES with active covariance adaptation||—||jastrebski2006improving||✗|
|— (noise)||CMA-ES with uncertainty handling||—||hansen2009method||✓|
|mcs||Multilevel Coordinate Search||Author’s website777https://www.mat.univie.ac.at/~neum/software/mcs/||huyer1999global||✗||✓|
|snobfit||Stable Noisy Optimization by Branch and FIT||Author’s website888http://www.mat.univie.ac.at/~neum/software/snobfit/||huyer2008snobfit||✓||✓|
|randsearch||Random search||GitHub page101010https://github.com/lacerbi/neurobench/tree/master/matlab/algorithms||bergstra2012random||✗||✓|
|fmincon||Interior point (interior-point, default)||Opt. Toolbox||waltz2006interior||✗||✗|
|— (sqp)||Sequential quadratic programming||—||wright1999numerical||✗||✗|
|patternsearch||Pattern search||Global Opt. Toolbox||kolda2003optimization||✗||✗|
|ga||Genetic algorithms||Global Opt. Toolbox||goldberg1989genetic||✗|
|particleswarm||Particle swarm||Global Opt. Toolbox||eberhart1995new||✗|
|simulannealbnd||Simulated annealing w/ bounded domain||Global Opt. Toolbox||kirkpatrick1983optimization||✗|
|bayesopt||Vanilla Bayesian optimization||Stats. & ML Toolbox||snoek2012practical||✓||✓|
The list of tested algorithms is reported in Table S3. For all methods, we used their default options unless stated otherwise. For BADS, CMA-ES, and bayesopt, we activated their uncertainty handling option when dealing with noisy problems (for CMA-ES, see hansen2009method ). For noisy problems of the ccn17 set, within the fmincon family, we only tested the best representative method (active-set), since we found that these methods perform comparably to random search on noisy problems (see Fig S1 right, and Fig 1, right panel, in the main text). For the combinatorial game-playing problem subset in the ccn17 test set, we used the settings of MCS provided by the authors as per the original study van2016people . We note that we developed algorithmic details and internal settings of BADS by testing it on the cec14 test set for expensive optimization liang2013problem and on other model-fitting problems which differ from the test problems presented in this benchmark. For bayesopt, we allowed up to 300 training points for the GP, restarting the BO algorithm from scratch with a different initial design every 300 BO iterations (until the total budget of function evaluations was exhausted). The choice of 300 iterations already produced a large average algorithmic overhead of s per function evaluation. As acquisition function, we used the default EI-per-second snoek2012practical , except for problems for which the computational cost is constant across all parameter space, for which we used the simple EI. All algorithms in Table S3 accept hard bound constraints lb, ub, which were provided with the bbob09 set and with the original studies in the ccn17 set. For all studies in the ccn17 set we also asked the original authors to provide plausible lower/upper bounds plb, pub for each parameter, which we would use for all problems in the set (if not available, we used the hard bounds instead). For all algorithms, plausible bounds were used to generate starting points. We also used plausible bounds (or their range) as inputs for algorithms that allow the user to provide additional information to guide the search, e.g. the length scale of the covariance matrix in CMA-ES, the initialization box for MCS, and plausible bounds in BADS.
For all problems and algorithms, for the purpose of our benchmark, we first transformed the problem variables according to the mapping described in ‘Transformation of variables and constraints’ (Section B.1). In particular, this transformation maps the plausible region to the hypercube, and transforms to log space positive variables that span more than one order of magnitude. This way, all methods dealt with the same standardized domains. Starting points during each optimization run were drawn uniformly randomly from inside the box of provided plausible bounds.
For deterministic problems, during each optimization run we kept track of the best (lowest) function value found so far after function evaluations. We define the immediate regret (or error) at time as , where is the true minimum or our best estimate thereof, and we use the error to judge whether the run is a success at step (error less than a given tolerance ). For problems in the bbob09 set (both noiseless and noisy variants), we know the ground truth . For problems in the ccn17 set, we do not know , and we define it as the minimum function value found across all optimization runs of all algorithms ( function evaluations per noiseless problem), with the rationale that it would be hard to beat this computational effort. We report the effective performance of an algorithm with non-negligible fractional overhead by plotting at step its performance at step , which corresponds to a shift of the performance curve when is plotted in log scale (Fig 2 in the main text).111111We did not apply this correction when plotting the results of vanilla BO (bayesopt), since the algorithm’s performance is already abysmal even without accounting for the substantial overhead.
For noisy problems, we care about the true function value(s) at the point(s) returned by the algorithm, since, due to noise, it is possible for an algorithm to visit a neighborhood of the solution during the course of the optimization but then return another point. For each noisy optimization run, we allowed each algorithm to return up to three solutions, obtained either from multiple sub-runs, or from additional outputs available from the algorithm, such as with MCS, or with population-based methods (CMA-ES, ga, and particleswarm). If more than three candidate solutions were available, we gave precedence to the main output of the algorithm, and then we took the two additional solutions with lowest observed function value. We limited the number of candidates per optimization run to allow for a fair comparison between methods, since some methods only return one point and others potentially hundreds (e.g., ga) – under the assumption that evaluating the true value of the log likelihood for a given candidate would be costly. For the combinatorial game-playing problem subset in the ccn17 set, we increased the number of allowed solutions per run to 10 to match the strategy used in the original study van2016people . For noisy problems in the ccn17 set, we estimated the log likelihood at each provided candidate solution via 200 function evaluations, and took the final estimate with lowest average.
For plotting, we determined ranking of the algorithms in the legend proportionally to the overall performance (area under the curve), across iterations (deterministic problems) or across error tolerances (noisy problems.)
In our benchmark, we made some relatively arbitrary choices to assess algorithmic performance, such as the range of tolerances or the number of function evaluations. We show here that our findings are robust to variations in these parameters, by plotting results from the bbob09 set with a few key changes (see Fig 1 in the main text for comparison). First, we restrict the error tolerance range for deterministic functions to instead of the wider range used in the main text (Fig S1 left and middle). This narrower range covers realistic practical requirements for model selection. Second, we reran the bbob09 noisy benchmark, allowing functions evaluation, as opposed to in the main text (Fig S1 right). Our main conclusions do not change, in that BADS performs on par with or better than other algorithms.
BADS is currently freely available as a MATLAB toolbox, bads (a Python version is planned).
The basic design of bads is simplicity and accessibility for the non-expert end user. First, we adopted an interface that resembles that of other common MATLAB optimizers, such as fminsearch or fmincon. Second, bads is plug-and-play, with no requirements for installation of additional toolboxes or compiling C/C++ code via mex files, which usually requires specific expertise. Third, bads hides most of its complexity under the hood, providing the standard user with thoroughly tested default options that need no tweaking.
For the expert user or developer, bads has a modular design, such that poll set generation, the search oracle, acquisition functions (separately for search and poll), and initial design can be freely selected from a large list (under development), and new options are easy to add.
We based our GP implementation in MATLAB on the GPML Toolbox rasmussen2010gaussian (v3.6), modified for increased efficiency of some algorithmic steps, such as computation of gradients,121212We note that version 4.0 of the GPML toolbox was released while BADS was in development. GPML v4.0 solved efficiency issues of previous versions, and might be supported in future versions of BADS., and we added specific functionalities. We optimize the GP hyperparameters with fmincon in MATLAB (if the Optimization Toolbox is available), or otherwise via a the minimize function provided with the GPML package, modified to support bound constraints.