New Heuristics for Parallel and Scalable Bayesian Optimization

07/01/2018 ∙ by Ran Rubin, et al. ∙ 0

Bayesian optimization has emerged as a strong candidate tool for global optimization of functions with expensive evaluation costs. However, due to the dynamic nature of research in Bayesian approaches, and the evolution of computing technology, using Bayesian optimization in a parallel computing environment remains a challenge for the non-expert. In this report, I review the state-of-the-art in parallel and scalable Bayesian optimization methods. In addition, I propose practical ways to avoid a few of the pitfalls of Bayesian optimization, such as oversampling of edge parameters and over-exploitation of high performance parameters. Finally, I provide relatively simple, heuristic algorithms, along with their open source software implementations, that can be immediately and easily deployed in any computing environment.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Choosing a strategy for searching for the global minimum or maximum of an unknown function is a non-trivial task. A major consideration is the balance between the function’s evaluation cost and available resources. Different strategies are optimal for different cases. For example, the function values may be the results of experiments that take days or weeks to complete and only a few can be run concurrently. In this scenario, we would like our search strategy to be as efficient as possible, i.e maximize the function as quickly as possible by choosing the parameters of the next function evaluation given the previous results. In addition, we would probably be willing to invest considerable computation resources to produce these parameters. On the other hand, when the durations of function evaluations are only a few seconds, new parameters must be chosen quickly which restricts the type of computations that can be performed.

Here, we consider an intermediate case encountered in many computational fields of science. In this case, the dimensionality of the function is moderately high () and function evaluations are moderately expensive (). On the other hand, evaluating the functions usually involves some sort of a computer simulation and current cluster computers allow us to run many (

) function evaluations in parallel. In addition, we assume that the optimization is done given a particular amount of computational resources, for example the number of CPUs we can use and the amount of time we are willing to use them. An example that fits this scenario, from the field of machine learning, is the training of neural networks where the parameters to optimize are usually the various learning rates of the learning algorithm, number of neurons and connections, etc.

What would be the requirements for a good search strategy in this scenario? First, in high dimensions exhaustive searches are unreasonable, thus the search must be efficient and find local extrema of the function as quickly as possible. Second, our strategy must allow for as many parallel evaluations of the function as our resources allow. In most scenarios, the duration of function evaluations is variable. Thus, it is also important that our search strategy is able to operate in an asynchronous way, proposing new parameters as soon as each function evaluation ends. Finally, as our computer resources continually increase we expect to be able to evaluate parameter sets in a reasonable amount of time. Thus, our search strategy must be scalable and able to handle such large datasets.

Several, fields of study have approached this problem and many different proposals have been made. However, no single accepted method has been chosen by the scientific community thus far. In addition, the software implementations of most of the proposed algorithms are non-trivial and many of them lack well designed, easy to use software packages. Therefore, many practitioners opt to simply perform manual optimization, by evaluating the function on a grid, identifying regions of interest where performance is good and evaluating on a finer grid in these regions. This simple approach allows for unlimited parallel evaluations, is trivially scalable (since almost no computation is needed to choose parameter sets to test) but is very inefficient and labor intensive.

Many heuristic optimization algorithms, termed metaheuristics, such as Simulated Annealing, Genetic Algorithms, Swarm Algorithms etc., have been proposed as viable solutions for global optimization

[1, 2]. These approaches are usually designed for parallel evaluation of the function and are usually scalable. However, the efficiency of their search is unclear and comparison among them is a very challenging task.

The field of non-smooth optimization has seen major advances with the introduction of Generalized Pattern Search (GPS) and Mesh Adaptive Direct Search (MADS) algorithms [3, 4]. These are fairly general frameworks for optimization that can easily combine different search approaches, are well founded in theory and come with convergence guaranties under mild assumptions., and can be fairly easily parallelized and scaled. However, combining them with efficient search approaches suitable for parallel environments remains an active field of research [5] with no currently available tools for the non-expert.

Finally, recent years have also seen a flourish of research in the field of Bayesian Optimization (BO), that is, using Bayesian inference to guide the search strategy during function optimization (for a review see

[6]). With a proper selection of prior distributions this approach yields a very efficient search strategy. However, Bayesian inference itself is computationally intensive which is prohibitive in many situations. The research on how to use Bayesian inference in a parallel environment and how to scale the inference itself to large datasets is intense and ongoing and mature algorithms and tools are not yet available.

It seems that no single approach satisfies all the requirements we set for our search strategy. In this report, I would like to examine more closely the Bayesian approach for optimization, and review the possible ways in which it can be parallelized and scaled. In addition, I will present new, relatively simple, heuristic algorithms for parallel and scalable Bayesian optimization. These algorithms combine several recently proposed approaches and some new ideas of my own. Importantly, I also provide an open source software implementation which is designed to be modular and modifiable, and represents an immediately available, working solution for moderate scale, parallel Bayesian optimization. Hopefully, this code can serve as a foundation upon which other systematic, large-scale approaches may be built.

In the following, I will briefly review the state-of-the-art in Bayesian optimization in sections 2, present a few modifications that may be advantageous in section 3 and give the details of my proposed algorithms in section 4.

2 Review of Bayesian Inference and Optimization

We are interested in finding a suitable set of parameters, , that maximize some unknown function . We assume our function evaluation values, , are distributed according to,

(1)

with constant, unknown standard deviation

.

The goal of our optimization is to find , such that

(2)

In BO we select points to evaluate by performing Bayesian inference based on past observations: We start by adopting a probabilistic model of , which we denote as , that models our knowledge of as a distribution over functions. In particular, we choose a Gaussian Process (GP) [7] denoted as,

(3)

where is a mean function and is some positive definite covariance kernel. This notation implies, that a set of points

induces a multivariate normal distribution on the vector

(with ):

(4)

with means and covariance matrix .

To perform Bayesian inference we assume some priors on the model’s parameters. First, we assume that has a GP prior with constant mean function111Remember that the posterior distribution will not, in general, conserve the properties of the prior distribution and therefore the constant mean assumption is not restrictive. and a covariance kernel with a set of hyper-parameters :

(5)

Further, we specify priors for the the noise magnitude, GP prior mean and the kernel’s hyper-parameters:

(6)

The choice of kernel is an important step, as different kernels confer different properties on the inferred functions and may be suitable in different situations [7]. The kernel hyper-parameters, , usually represent, the typical amplitude and length scales of our inferred functions.

Now, given a set of observation pairs,

(7)

we use Bayes rule to infer the posterior distribution of the parameters:

(8)

The full posterior distribution lets us make predictions on

and also provides us with an estimate of the level of certainty of our predictions. One attractive feature of a GP is that, given the parameters

and the posterior distribution of functions, , given is also a GP and it is possible to calculate analytically its posterior mean function and covariance kernel, simplifying calculations involving the full posterior distribution (for details see [7]).

2.1 Scalability of Bayesian Inference

Due to the required inversion of the observations’ covariance matrix, full Bayesian inference for GPs scales as with the number of observations, . To be useful in higher dimensions, when of observations are needed to explore the parameter space, we must use an approximation method that scales as . The leading solution in the literature is variational inference (VI) [8], which for GPs amounts to using the kernel method to approximate the posterior mean and covariance functions with a fixed number of kernel elements. A recent proposal [9] proposes a VI algorithm that indeed scales as

. The improved scaling comes at the cost of a somewhat reduced expressibility and an overestimation of the predicted variance. An alternative approach for the scaling of Bayesian inference is to reduce the number of observations used in the inference by considering only local Bayesian inference around promising loci in the parameter space

[5] or by performing random sub-sampling of the observations [10].

2.2 Bayesian Search Strategies

Given a set of observation, , a BO search strategy proposes a new point according to some criteria. This is usually done by choosing the point that maximizes some function over the search domain, termed the acquisition function.

Although many acquisition functions have been proposed (see [6] and references therein), a common one is the expected improvement (EI) which we define here as:

(9)

where the expectation is taken over the posterior distribution given the observations, and for and otherwise. We then choose:

(10)

We note that the EI can be high either because the actual value of the posterior mean, , is high or the uncertainty (posterior variance, ) at point is high. Thus, this measure implements a kind of exploration-exploitation balance.

For a GP the EI can be relatively easily calculated following a few simple approximations: First, practitioners remove the sample dependence of the max operation by replacing with the posterior mean, , or with the observations themselves, . Then, the EI can be calculated analytically conditioned on and

and practitioners usually approximate the expectancy over these parameters by using a periodically updated point estimate for them or by averaging a set of samples from their posterior using Markov Chain Monte Carlo (MCMC) methods (which usually significantly increase the computational cost of the inference).

Finally, we note that maximizing the EI (or any other acquisition function) is a non-trivial task in itself, and may be computationally expensive and require some form of approximation.

2.3 Parallel Bayesian Search Strategies

Since Bayesian inference is computationally intensive, even a relatively short inference time (for example, a few seconds) may create a computational bottleneck if a large number of inferences is required. This may leave some of our computer resources idle while waiting for the new points to evaluate. As proposed in [11], one way to make sure that the inference remains scalable is to perform the Bayesian inference itself in parallel. However, a couple of difficulties arise in this case:

First, computing several points for evaluation in parallel, given the same or very similar sets of known measurements, will typically result in proposing the same, or very similar, new point, which will hinder the exploration of the function’s parameters space. One way to avoid this problem is to introduce variability in the proposed new points. Recently, the use of Thompson Sampling

[12, 13, 14] has been proposed [11, 15] as a tool for generating this variability in a way that is consistent with the posterior distribution. Given observations, , and a single sample of and from the posterior distribution, Thompson sampling proposes a new point for evaluations according to

(11)

Simply put, Thompson Sampling can be regarded as a probability matching search strategy, setting the probability of choosing a point

to the posterior probability this point is indeed the maximum of

. When several inferences run in parallel, each will sample from the posterior probability of independently, and may therefore propose different points for evaluation. We note that, although conceptually simple, sampling from the posterior and maximizing over it is an additional computational step that may not be trivial to implement in a scalable way

Second, performing inference in an asynchronous parallel environment implies that in addition to the set of observations, there is an additional, possibly significant, set of running experiments/simulations for which the parameter values are known but their results are still pending. Snoek et al. [16] propose to take this second set into account during the inference by ’fantasizing’ the results of the evaluation and performing the inference with the fantasized values.

It is simple to incorporate fantasies within the framework of Thompson Sampling. Given observations, , and pending points , one first samples fantasized observation values, where and are sampled from the posterior distribution given the observations. Then, another can be sampled from the posterior distribution given the union . While this procedure will not, on average, change the predicted mean of the function at the pending points, it will reduce the uncertainty around points that are pending and make them less likely to be selected again. In addition, as noted in [11], sampling from the posterior distribution does not, on average, change the posterior of the GP hyper-parameters, so there is no need to sample them again conditioned on .

3 Practical Considerations and Heuristic Modifications for Improved Efficiency

Until now I have reviewed the state-of-the-art of parallel and scalable Bayesian optimization. Here I propose three modifications and improvements that may increase the efficiency of the Bayesian optimization. I found these to be helpful in my experiments but did not conduct a systematical quantification of their effect. Such a quantification is not a trivial task and is beyond the scope of this technical report. Nonetheless, I present them here as possible, fruitful avenues of research.

3.1 Sample Improvement

Thompson Sampling has been shown to bound the mean regret when the goal of optimization is to maximize the accumulative value or minimize the accumulated regret over time [17, 18, 19]. As such, it tends to lean more on exploitation once a ‘good‘ set of parameters is found. Indeed, it has been claimed that Thompson sampling asymptotically converges to the global maximizer in exponential time [20]. However, in the context of parameter optimization, there is really no need to evaluate the same point again once we are confident enough about the value of at this point.

Here I propose the concept of Sample Improvement (SI). We start with the simple idea to approximate/replace the expected improvement by a single sample of the improvement, and select the point that maximizes this sample improvement function. Given observations, , and a single sample of and from the posterior distribution, we define the Sample Improvement of point , as simply

(12)

As an approximation, the SI will introduce the variability that we can exploit for parallel inference. We therefore choose the new point for evaluation according to

(13)

Note that the expected improvement is always positive so the operation is a well-defined function. However, in case no point in the sample improves the result of the function we are left with for all and the is not defined. The advantage of using SI over Thompson sampling (Eq. 11) is that we can interpret a sample function for which for all and some a priori as a sign of possible convergence of the inference to a local minimum. In this case, it might be appropriate to use a more exploratory strategy for the selection of points for evaluation. One possibility is choosing the point that locally maximizes the estimated variance or an upper confidence bound of (This will not generate variability in the selected point which may be problematic if many inferences are run in parallel). Another is choosing random points around the current estimated maximum to encourage exploration in its vicinity.

3.2 Variance Control

Alternating between search strategies to improve the exploration-exploitation balance based on the current state of the search is important for parameter optimization. Indeed several meta-strategies were proposed to select between the sampling proposals of several strategies [21, 22]. However, finding a strategy that will be suitable at every stage of the optimization and that will always prevent oversampling of similar points remains a challenge. Here I propose a simple strategy, termed Variance Control, to ensure that no point is ever oversampled by explicitly controlling the estimated variance of sampled points. We consider a situation in which we wish to evaluate at any given point, only up to some level of accuracy. Bayesian inference can aid in setting this level: A reasonable limit might be a fraction of the estimated noise level in our observation (). We would like our algorithm to avoid sampling points where the estimated variance is below our required accuracy. This can be achieved by excluding these points from our search domain.

Given a set of observations and pending points and a sample of and , we can define our search domain

(14)

where is the posterior estimated variance given and , and is our desired level of accuracy as a fraction of the estimated noise. In this setting, we can view parameter optimization not as an attempt to find the maximum of over the entire parameter space but rather only over . Therefore, we choose our next point to evaluate according to our current search strategy only if the proposed point is in . In addition, we also compute improvement measures based only on observations that are in . We will describe two different approaches for implementing variance control, in section 4.3.

3.3 Boundary Avoidance

In many texts, examples of Bayesian inference are presented for a function of one or two parameters. While this is instructive, one has to remember that our intuitive understanding of low-dimensional spaces does not always work well for higher dimensions. A famous example is the volume of a dimensional sphere: unlike 2-dim. circles or 3-dim. balls, when most of the volume of a sphere is concentrated at the edge of the sphere.

In the context of Bayesian optimization, we are usually concerned with finding the best parameters within some range for each parameter. The range is usually picked based on our prior belief of the reasonable value of the parameters. In fact, we usually suspect that the optimal value is not close to the edges of our range. However, when choosing points to sample using Bayesian inference, we are very likely to choose points on the edges because, unless already sampled, at these point the model is extrapolating and we expect to have large expected variance and, possibly, a monotonically increasing mean function. For functions of one or two parameters, evaluating the function at the edges of our range will only take a few samples and will quickly reduce the estimated variance there. However for higher dimensional spaces, sampling the edges well will require many function evaluations in locations that we do not believe are optimal.

Taking the Bayesian approach, we would like to express our knowledge of the parameter range as a prior distribution on the location of the maximum of . However, since we only directly model our knowledge of and not of the location of its maximum, it is not clear how to incorporate this prior within the GP framework we described. First steps at systematically addressing the edges oversampling problem were taken in [23]. Our practical solution, is to simply reject points for evaluations if they are on the edges of our parameter range.

4 Methods and Algorithms

Implementing Bayesian inference and optimization involves many choices and approximations that may affect the results. However, it is very hard to estimate systematically the effect of these choices, and the implementer must rely on choices previously made by others as well as common sense and intuition. In this section, I would like to inform the reader and users of these choices and approximations so that proper review and improvement can be made. In addition, I have tried to design my algorithmic implementation to be as modular as possible, so that various parts can be easily replaced in case some choice or approximation is not to the user’s liking or is unsuitable for the user’s circumstances.

As an immediately available working solution, I provide two simple algorithms that perform parallel, Bayesian inference using the above proposed heuristics to generate variability in the selection of points, avoid boundaries, alternate between global and local search strategies and avoid oversampling of points. An open source software implementation of these algorithms along with further implementation details, can be found in the GitHub repository: https://github.com/ranr01/miniBOP.

4.1 Parallel Bayesian Optimization

This algorithm assumes that each computational node can be used to either evaluate the function at a single set of parameters, , or choose a new point for evaluation using Bayesian inference. Given computational nodes, a simple algorithm to perform the optimization can be defined as follows:

  1. Submit a batch of function evaluations on a quasi-random grid.

  2. Wait for a job to finish.

    1. If the finished job is function evaluation, process results and submit a Bayesian inference job with the current information of pending and completed simulations.

    2. If the finished job is a Bayesian inference job, submit a function evaluation with the suggested parameters.

  3. Repeat 2 until maximal amount of resources are used.

For a more detailed discussion of different resource management strategies I refer the reader to [24, 25].

4.2 Sampling and maximizing a function from the posterior GP

Sampling a function from the posterior GP is done by sampling a multivariate Gaussian with the appropriate mean and covariance according to the sampled points. This can be done in batch by computing the Cholesky decomposition of the full covariance matrix of the sampled points. For sequential sampling of points (as is usually required for maximization) the Cholesky decomposition can be performed iteratively in an efficient manner.

Note that, this procedure does not scale well with the number of sampled points from . Finding an efficient, scalable way of sampling from a high-dimensional multivariate Gaussian is in fact an open question under scientific investigation (see for example [26, 27]). To avoid sampling a large number of points from a single sampled function, during our inference we do not try to find a global maximum of . Instead, we find a local maximum by running the Nelder-Mead (Simplex) algorithm starting from a random initial point. We specify an absolute tolerance for as a tunable parameter.

4.3 Choosers

The heart of Bayesian optimization is the algorithm to choose new points for evaluation. Here we present two heuristic algorithms that incorporate the principles discussed above.

4.3.1 Bayesian Optimization and Poll steps (BOP)

This algorithm alternates between approximately maximizing the Sample Improvement and taking random steps from the current global maximum. The name ’poll steps’ is borrowed from the GPR and MADS algorithm to describe the random exploration around the current estimated maximum.

Given a set of observations and pending points we choose the next point in the following way:

  1. Sample the hyper-parameters () from the posterior given the observation set , using MCMC.

  2. Sample the results of the pending points by sampling were is sampled from the posterior GP given the observations set .

  3. Sample a set of candidate points, , , by finding a local maximum of a sample function , sampled from the posterior GP given the union . To avoid sampling a large number of points from a single sample function we sample a new for each .

  4. Exclude candidate points in that have low predicted variance (i.e. ) or that are on the edges of the parameters range.

  5. Calculate the improvement of the remaining points, where improvement is defined as

    (15)
  6. Exclude candidate points in for which the improvement is zero (or smaller than some value ).

  7. If is not empty return the point with the maximal improvement and exit.

  8. Poll step:

    1. Find the best parameters out of the sampled and pending points.

    2. Choose a set of candidate points, by choosing a random point around the best point. The random step size is chosen to be proportional to the estimated length scales of the covariance kernel.

    3. Exclude points with low predicted variance.

    4. if is not empty, return the point with the maximal predicted variance and exit.

  9. Default step: return a random point and exit.

4.3.2 Function Barrier Variance Control (FuBar VC)

Here, we approximate the requirement of sampling points only inside by adding a soft barrier function to our sampled function . We define

(16)

where is the estimated standard deviation of at the point and is a function which is very large for and negligible for . As a concrete example, we choose a power law function,

(17)

with . Here, controls how sharp our barrier function is around the boundary . This parameter can control the rate of exploration around the boundary as higher values allow sampling of points closer to the boundary.

The resulting algorithm remains the same as the BOP algorithm except the two following modifications:

  • In step 3 we replace the optimization of with the optimization of

  • In step 5 improvement is defined as

    (18)

We note that using the barrier function renders the estimated variance check in step 4 somewhat redundant, so in practice it can be dropped.

4.4 Other implementation details

4.4.1 Parameter rescaling

We consider a scenario in which optimization is performed in a hyper-box specified by a minimal and maximal value for each function variable. For the purpose of Bayesian inference, we follow the common practice of rescaling all the variables to lie inside the unit hyper-cube, i.e. inside the range.

4.4.2 Covariance kernel function

Following [16] we use the Matern 5/2 kernel with individual length scale for each dimension (sometimes termed Automatic Relevance Determination).

Thus the kernel has hyper-parameters, , and given and we have:

(19)

with,

(20)

4.4.3 MCMC for hyper-parameter sampling

We sample the hyper-parameters () from the posterior given the observations by performing MCMC steps using Slice Sampling in the manner done in [16].

4.4.4 Assumed Priors

We specify prior distributions for and .

For the mean of the GP we take a flat prior

(21)

where

is a uniform distribution between

and . We set to be the min(max) over all function value observations.

Following [16]

, for the observation noise variance we take a probability distribution

(22)

and, for the GP amplitude,

we take a log-normal distribution

(23)

For each length scale parameter, we use an inverse Gamma distribution:

(24)

See source code for the chosen default values of , , and .

5 Discussion

We set out to find a search strategy suitable for global optimization of expensive functions in a massively-parallel computing environment. As such we needed a strategy that would be efficient, parallel and scalable. Unfortunately, most of the available open source software for optimization was not explicitly designed for this parallel scenario (see [6] for a list of some of the available software). One proprietary software solution for massive Bayesian optimization [28] exists. However, its opaque methods and algorithms make it hard to evaluate its performance.

In my opinion, the combination of quick Bayesian search methods (such as variational inference) with parallel versions of mathematically grounded grid search methods such as MADS is a probable way forward to produce robust and scalable methods of Bayesian Optimization for large-scale global optimization. First steps in this direction have been taken with the recently proposed BADS algorithm [5].

The heuristic algorithms I presented here perform efficient Bayesian search in a parallel environment. The scalability of these algorithms is limited only by the scalability of the Bayesian inference and sampling. In the current software implementation, the classic full Bayesian inference method is used, scaling with the number of observed points. This practically limits the application of the algorithms moderate size problems where to only a few thousands of observed point are needed. Once a scalable implementation of Bayesian inference and sampling is available, both algorithms presented here can be scaled as well. As mentioned above, a systematic quantification of the performance of these algorithms and a thorough comparison to other approaches is still lacking. Hopefully, they can serve as a useful tool until such a study can be performed.

Acknowledgments

I thank Larry Abbott, James Murray, Fabio Stefanini and Ari Pakman for helpful discussions and comments. This work was supported by the Center for Theoretical Neuroscience’s NeuroNex award and by the charitable contribution of the Gatsby Foundation.

References

Appendix: Tunable parameters of the BOP and FuBar VC algorithms

Parameter Description
Number of local maxima to find in each Bayesian sampling step.
Number of random poll points candidates to find in each poll step
Size of poll step relative to estimated length scales
Minimal fraction of estimated noise level used for variance control of the predicted standard deviation
Minimal absolute level used for variance control of the predicted standard deviation
Exponent of power law barrier function in the FuBar VC algorithm
Absolute tolerance level for maximization of sample functions
Number of MCMC steps to take for each GP hyper-parameters sample
Scale of prior distribution of observation noise ()
Scale of prior distribution of GP amplitude ()
Shape parameter of prior distribution of length scales ( )
Scale of prior distribution of length scales ( )
exclude_edge_points Should the algorithm exclude edge points (Boolean).