We propose minimum regret search (MRS), a novel acquisition function for Bayesian optimization. MRS bears similarities with information-theoretic approaches such as entropy search (ES). However, while ES aims in each query at maximizing the information gain with respect to the global maximum, MRS aims at minimizing the expected simple regret of its ultimate recommendation for the optimum. While empirically ES and MRS perform similar in most of the cases, MRS produces fewer outliers with high simple regret than ES. We provide empirical results both for a synthetic single-task optimization problem as well as for a simulated multi-task robotic control problem.READ FULL TEXT VIEW PDF
We propose MUMBO, the first high-performing yet computationally efficien...
Entropy Search (ES) and Predictive Entropy Search (PES) are popular and
Multi-task IRL allows for the possibility that the expert could be switc...
The biomechanical energy harvester is expected to harvest the electric
We consider multi-objective optimization (MOO) of an unknown vector-valu...
This paper presents a novel approach to top-k ranking Bayesian optimizat...
This paper aims to solve machine learning optimization problem by using
Bayesian optimization (BO, Shahriari et al., 2016) denotes a sequential, model-based, global approach for optimizing black-box functions. It is particularly well-suited for problems which are non-convex, do not necessarily provide derivatives, are expensive to evaluate (either computationally, economically, or morally), and can potentially be noisy. Under these conditions, there is typically no guarantee for finding the true optimum of the function with a finite number of function evaluations. Instead, one often aims at finding a solution which has small simple regret (Bubeck et al., 2009) with regard to the true optimum, where simple regret denotes the difference of the true optimal function value and the function value of the “solution” selected by the algorithm after a finite number of function evaluations. BO aims at finding such a solution of small simple regret while minimizing at the same time the number of evaluations of the expensive target function. For this, BO maintains a probabilistic surrogate model of the objective function, and a myopic utility or acquisition function, which defines the “usefulness” of performing an additional function evaluation at a certain input for learning about the optimum.
A critical component for the performance of BO is the acquisition function, which controls the exploratory behavior of the sequential search procedure. Different kinds of acquisition functions have been proposed, ranging from improvement-based acquisition functions over optimistic acquisition functions to information- theoretic acquisition functions (see Section 2). In the latter class, the group of entropy search-based approaches (Villemonteix et al., 2008; Hennig & Schuler, 2012; Hernández-Lobato et al., 2014), which aims at maximizing the information gain regarding the true optimum, has achieved state-of-the-art performance on a number of synthetic and real-world problems. However, performance is often reported as the median over many runs, which bears the risk that the median masks “outlier” runs that perform considerably worse than the rest. In fact, our results indicate that the performance of sampling-based entropy search is not necessarily better than traditional and cheaper acquisition functions according to the mean simple regret.
In this work, we propose minimum regret search (MRS), a novel acquisition function that explicitly aims at minimizing the expected simple regret (Section 3). MRS performs well according to both the mean and median performance on a synthetic problem (Section 5.1). Moreover, we discuss how MRS can be extended to multi-task optimization problems (Section 4) and present empirical results on a simulated robotic control problem (Section 5.2).
In this section, we provide a brief overview of Bayesian Optimization (BO). We refer to Shahriari et al. (2016) for a recent more extensive review of BO. BO can be applied to black-box optimization problems, which can be framed as optimizing an objective function on some bounded set . In contrast to most other black-box optimization methods, BO is a global method which makes use of all previous evaluations of rather than using only a subset of the history for approximating a local gradient or Hessian. For this, BO maintains a probabilistic model for , typically a Gaussian process (GP, Rasmussen & Williams, 2006), and uses this model for deciding at which the function will be evaluated next.
Assume we have already queried datapoints and observed their (noisy) function values . The choice of the next query point is based on a utility function over the GP posterior, the so-called acquisition function , via . Since the maximum of the acquisition function cannot be computed directly, a global optimizer is typically used to determine . A common strategy is to use DIRECT (Jones et al., 1993) to find the approximate global maximum, followed by L-BFGS (Byrd et al., 1995) to refine it.
The first class of acquisition functions are optimistic policies such as the upper confidence bound (UCB) acquisition function. aims at minimizing the regret during the course of BO and has the form
where is a tunable parameter which balances exploitation () and exploration (), and and
denote mean and standard deviation of the GP at, respectively. Srinivas et al. (2010) proposed GP-UCB, which entails a specific schedule for that yields provable cumulative regret bounds.
The second class of acquisition functions are improvement-based policies
, such as the probability of improvement(PI, Kushner, 1964) over the current best value, which can be calculated in closed-form for a GP model:
denotes the cumulative distribution function of the standard Gaussian, anddenotes the incumbent, typically the best function value observed so far: . Since PI exploits quite aggressively (Jones, 2001), a more popular alternative is the expected improvement (EI, Mockus et al., 1978) over the current best value , which can again be computed in closed form for a GP model as , where denotes the standard Gaussian density function. A generalization of EI is the knowledge gradient factor (Frazier et al., 2009)
, which can better handle noisy observations, which impede the estimation of the incumbent. The knowledge gradient requires defining a setfrom which one would choose the final solution.
The third class of acquisition functions are information-based policies
, which entail Thompson sampling and entropy search(ES, Villemonteix et al., 2008; Hennig & Schuler, 2012; Hernández-Lobato et al., 2014). Let denote the posterior distribution of the unknown optimizer after observing . The objective of ES is to select the query point that results in the maximal reduction in the differential entropy of . More formally, the entropy search acquisition function is defined as
where denotes the differential entropy of and the expectation is with respect to the predictive distribution of the GP at
, which is a normal distribution. Computing directly is intractable for continuous spaces ; prior work has discretized and used either Monte Carlo sampling (Villemonteix et al., 2008) or expectation propagation (Hennig & Schuler, 2012)
. While the former may require many Monte Carlo samples to reduce variance, the latter incurs a run time that is quartic in the number of representer points used in the discretization of. An alternative formulation is obtained by exploiting the symmetric property of the mutual information, which allows rewriting the ES acquisition function as
This acquisition function is known as predictive entropy search (PES, Hernández-Lobato et al., 2014). PES does not require discretization and allows a formal treatment of GP hyperparameters.
Contextual Policy Search
(CPS) denotes a model-free approach to reinforcement learning, in which the (low-level) policy
is parametrized by a vector. The choice of is governed by an upper-level policy . For generalizing learned policies to multiple tasks, the task is characterized by a context vector and the upper-level policy is conditioned on the respective context. The objective of CPS is to learn the upper-level policy such that the expected return over all contexts is maximized, where . Here, is the distribution over contexts and is the expected return when executing the low level policy with parameter in context . We refer to Deisenroth et al. (2013) for a recent overview of (contextual) policy search approaches in robotics.
Information-based policies for Bayesian optimization such as ES and PES have performed well empirically. However, as we discuss in Section 3.3, their internal objective of minimizing the uncertainty in the location of the optimizer , i.e., minimizing the differential entropy of , is actually different (albeit related) to the common external objective of minimizing the simple regret of , the recommendation of BO for after trials. We define the simple regret of as .
Clearly, has zero and thus minimum simple regret, but a query that results in the maximal decrease in is not necessarily the one that also results in the maximal decrease in the expected simple regret. In this section, we propose minimum regret search (MRS), which explicitly aims at minimizing the expected simple regret.
Let be some bounded domain and be a function. We are interested in finding the maximum of on . Let there be a probability measure over the space of functions , such as a GP. Based on this , we would ultimately like to select an which has minimum simple regret . We define the expected simple regret ER of selecting parameter under as:
In -step Bayesian optimization, we are given a budget of function evaluations and are to choose a sequence of query points for which we evaluate to obtain . Based on this, is estimated and a point is recommended as the estimate of such that the expected simple regret under is minimized, i.e., . The minimizer of the expected simple regret under fixed can be approximated efficiently in the case of a GP since it is identical to the maximizer of the GP’s mean .
However, in general, depends on data and it is desirable to select the data such that ER is also minimized with regard to the resulting . We are thus interested in choosing a sequence such that we minimize the expected simple regret of with respect to the , where depends on and the (potentially noisy) observations . As choosing the optimal sequence at once is intractable, we follow the common approach in BO and select sequentially in a myopic way. However, as itself depends on and is thus unknown for , we have to use proxies for it based on the currently available subset . One simple choice for a proxy is to use the point which has minimal expected simple regret under , i.e., . Let us denote the updated probability measure on after performing a query at and observing the function value by . We define the acquisition function MRS as the expected reduction of the minimum expected regret for a query at , i.e.,
where the expectation is with respect to ’s predictive distribution at and we drop the implicit dependence on in the notation of . The next query point would thus be selected as the maximizer of .
One potential drawback of MRS, however, is that it does not account for the inherent uncertainty about . To address this shortcoming, we propose using the measure as defined in entropy search (see Section 2) as proxy for . We denote the resulting acquisition function by MRS and define it analogously to :
MRS can thus be seen as a more Bayesian treatment, where we marginalize our uncertainty about , while MRS is more akin to a point-estimate since we use a single point (the minimizer of the expected simple regret) as proxy for .
Since several quantities in MRS cannot be computed in closed form, we resort to similar discretizations and approximations as proposed for entropy search by Hennig & Schuler (2012). We focus here on sampling based approximations; for an alternative way of approximating based on expectation propagation, we refer to Hennig & Schuler (2012).
(Left) Illustration of GP posterior, probability of maximum, expected regret ER, and (scale on the right-hand side). (Right) Illustration of GP posterior and different acquisition function. Absolute values have been normalized such that the mean value of an acquisition function is . Best seen in color.
Firstly, we approximate by taking Monte Carlo samples from , which is straightforward in the case of GPs. Secondly, we approximate by taking Monte Carlo samples from ’s predictive distribution at . And thirdly, we discretize to a finite set of representer points chosen from a non-uniform measure, which turns in the definition of into a weighted sum. The discretization of is discussed by Hennig & Schuler (2012) in detail; we select the representer points as follows: for each representer point, we sample candidate points uniform randomly from and select the representer point by Thompson sampling from on the candidate points. Moreover, estimating on the representer points can be done relatively cheap by reusing the samples used for approximating , which incurs a small bias which had, however, a negligible effect in preliminary experiments.
The resulting estimate of would have high variance and would require to be chosen relatively large; however, we can reduce the variance considerably by using common random numbers (Kahn & Marshall, 1953) in the estimation of for different .
Figure 1 presents an illustration of different acquisition functions on a simple one-dimensional target function. The left graphic shows a hypothetical GP posterior (illustrated by its mean and standard deviation) for length scale , and the resulting probability of being the optimum of denoted by . Moreover, the expected simple regret of selecting denoted by is shown. The minimum of and the maximum of are both at . The expected regret of is approximately . We plot additionally to shed some light onto situations in which would incur a significant regret: this quantity shows that most of the expected regret of stems from situations where the “true” optimum is located at . This can be explained by the observation that this area has high uncertainty and is at the same time largely uncorrelated with because of the small length-scale of the GP.
The right graphic compares different acquisition functions for , , , and fixed representer points. Since the assumed GP is noise-free, the acquisition-value of any parameter that has already been evaluated is approximately . The acquisition functions differ considerably in their global shape: EI becomes large for areas with close-to-maximal predicted mean or with high uncertainty. ES becomes large for parameters which are informative with regard to most of the probability mass of , i.e., for . In contrast MRS becomes maximal for . This can be explained as follows: according to the current GP posterior, MRS selects . As shown in the Figure 1 (left), most of the expected regret for this value of stems from scenarios where the true optimum would be at . Thus, sampling in this parameter range can reduce the expected regret considerably—either by confirming that the true value of on is actually as small as expected or by switching to this area if turns out to be large. The maximum of MRS is similar to MRS. However, since it also takes the whole measure into account, its acquisition surface is smoother in general; in particular, it assigns a larger value to regions such as , which do not cause regret for but for alternative choices such as .
Why does ES not assign a large acquisition value to query points ? This is because ES does not take into account the correlation of different (representer) points under . This, however, would be desirable as, e.g., reducing uncertainty regarding optimality among two highly correlated points with large (for instance and in the example) will not change the expected regret considerably since both points will have nearly identical value under all . On the other hand, the value of two points which are nearly uncorrelated under and have non-zero such as and might differ considerably under different and choosing the wrong one as might cause considerable regret. Thus, identifying which of the two is actually better would reduce the regret considerably. This is exactly why MRS assigns large value to .
Several extensions of Bayesian optimization for multi-task learning have been proposed, both for discrete set of tasks (Krause & Ong, 2011) and for continuous set of tasks (Metzen, 2015). Multi-task BO has been demonstrated to learn efficiently about a set of discrete tasks concurrently (Krause & Ong, 2011), to allow transferring knowledge learned on cheaper tasks to more expensive tasks (Swersky et al., 2013), and to yield state-of-the-art performance on low-dimensional contextual policy search problems (Metzen et al., 2015)
, in particular when combined with active learning(Metzen, 2015). In this section, we focus on multi-task BO for a continuous set of tasks; a similar extension for discrete multi-task learning would be straightforward.
A continuous set of tasks is encountered for instance when applying BO to contextual policy search (see Section 2). We follow the formulation of BO-CPS (Metzen et al., 2015) and adapt it for MRS were required. In BO-CPS, the set of tasks is encoded in a context vector and BO-CPS learns a (non-parametric) upper-level policy which selects for a given context the parameters of the low-level policy . The unknown function corresponds to the expected return of executing a low-level policy with parameters in context . Thus, the probability measure (typically a GP) is defined over functions on the joint parameter-context space. The probability measure is conditioned on the trials performed so far, i.e., with for . Since is defined over the joint parameter and context space, experience collected in one context is naturally generalized to similar contexts.
In passive BO-CPS on which we focus here (please refer to Metzen (2015) for an active learning approach), the context (task) of a trial is determined externally according to
, for which we assume a uniform distribution in this work. BO-CPS selects the parameterby conditioning on and finding the maximum of the acquisition function for fixed , i.e., . Acquisition functions such as PI and EI are not easily generalized to multi-task problems as they are defined relative to an incumbent , which is typically the best function value observed so far in a task. Since there are infinitely many tasks and no task is visited twice with high probability, the notion of an incumbent is not directly applicable. In contrast, the acquisition functions GP-UCB (Srinivas et al., 2010) and ES (Metzen et al., 2015) have been extended straightforwardly and the same approach applies also to MRS.
In the first experiment111Source code for replicating the reported experiment is available under https://github.com/jmetzen/bayesian_optimization., we conduct a similar analysis as Hennig & Schuler (2012, Section 3.1): we compare different algorithms on a number of single-task functions sampled from a generative model, namely from the same GP-based model that is used by the optimization internally as surrogate model. This precludes model-mismatch issues and unwanted bias which could be introduced by resorting to common hand-crafted test functions222Please refer to the Appendix A for an analogous experiment with model-mismatch.. More specifically, we choose the parameter space to be the 2-dimensional unit domain and generate test functions by sampling function values jointly from a GP with an isotropic RBF kernel of length scale and unit signal variance. A GP is fitted to these function values and the resulting posterior mean is used as the test function. Moreover, Gaussian noise with standard deviation is added to each observation. The GP used as surrogate model in the optimizer employed the same isotropic RBF kernel with fixed, identical hyperparameters. In order to isolate effects of the different acquisition functions from effects of different recommendation mechanisms, we used the point which maximizes the GP posterior mean as recommendation regardless of the employed acquisition function. All algorithms were tested on the same set of test functions, and we used , , and . We do not provide error-bars on the estimates as the data sets have no parametric distribution. However, we provide additional histograms on the regret distribution.
Figure 2 summarizes the results for a pure exploration setting, where we are only interested in the quality of the algorithm’s recommendation for the optimum after queries but not in the quality of the queries themselves: according to the median of the simple regret (top left), ES, MRS, and MRS perform nearly identical, while EI is about an order of magnitude worse. GP-UCB performs even worse initially but surpasses EI eventually333GP-UCB would reach the same level of mean and median simple regret as MRS eventually after steps (in mean and median) with no significant difference according to a Wilcoxon signed-rank test.. PI performs the worst as it exploits too aggressively. These results are roughly in correspondence with prior results (Hennig & Schuler, 2012; Hernández-Lobato et al., 2014) on the same task; note, however, that Hernández-Lobato et al. (2014) used a lower noise level and thus, absolute values are not comparable. However, according to the mean simple regret, the picture changes considerably (top right): here, MRS, MRS, and EI perform roughly on par while ES is about an order of magnitude worse. This can be explained by the distribution of the simple regrets (bottom): while the distributions are fairly non-normal for all acquisition functions, there are considerably more runs with very high simple regret () for ES (10) than for MRS (4) or EI (4).
We illustrate one such case where ES incurs high simple regret in Figure 3. The same set of representer points has been used for ES and MRS. While both ES and MRS assign a non-zero density (representer points) to the area of the true optimum of the function (bottom center), ES assigns high acquisition value only to areas with a high density of in order to further concentrate density in these areas. Note that this is not due to discretization of but because of ES’ objective, which is to learn about the precise location of the optimum, irrespective of how much correlation their is between the representer points according to . Thus, predictive entropy search (Hernández-Lobato et al., 2014) would likely be affected by the same deficiency. In contrast, MRS focuses first on areas which have not been explored and have a non-zero , since those are areas with high expected simple regret (see Section 3.3). Accordingly, MRS is less likely to incur a high simple regret. In summary, the MRS-based acquisition functions are the only acquisition functions that perform well both according to the median and the mean simple regret; moreover, MRS performs slightly better than MRS and we will focus on MRS subsequently.
We present results in the simulated robotic control task used by Metzen (2015), in which the robot arm COMPI (Bargsten & de Gea, 2015) is used to throw a ball at a target on the ground encoded in a two-dimensional context vector. The target area is and the robot arm is mounted at the origin of this coordinate system. Contexts are sampled uniform randomly from . The low-level policy is a joint-space dynamical movement primitives (DMP, Ijspeert et al., 2013) with preselected start and goal angle for each joint and all DMP weights set to 0. This DMP results in throwing a ball such that it hits the ground close to the center of the target area. Adaptation to different target positions is achieved by modifying the parameter : the first component of corresponds to the execution time of the DMP, which determines how far the ball is thrown, and the further components encode the final angle of the -th joint. We compare the learning performance for different number of controllable joints; the not-controlled joints keep the preselect goal angles of the initial throw. The limits on the parameter space are and .
All approaches use a GP with anisotropic Matérn kernel for representing and the kernel’s length scales and signal variance are selected in each BO iteration as point estimates using maximum marginal likelihood. Based on preliminary experiments, UCB’s exploration parameter is set to a constant value of . For MRS and ES, we use the same parameter values as in the Section 5.1, namely , , and . Moreover, we add a “greedy” acquisition function, which always selects that maximizes the mean of the GP for the given context (UCB with ), and a “random” acquisition function that selects randomly. The return is defined as , where denotes the position hit by the ball, and denotes a penalty term on the sum of squared joint velocities during DMP execution; both and depend indirectly on .
Figure 4 summarizes the main results of the empirical evaluation for different acquisition functions. Shown is the mean offline performance of the upper-level policy at test contexts on a grid over the context space. Selecting parameters randomly (“Random”) or greedily (“Greedy”) during learning is shown as a baseline and indicates that generalizing experience using a GP model alone does not suffice for quick learning in this task. In general, performance when learning only the execution time and the first joint is better than learning several joins at once. This is because the execution time and first joint allow already adapting the throw to different contexts (Metzen et al., 2015); controlling more joints mostly adds additional search dimensions. MRS and ES perform on par for controlling one or two joints and outperform UCB. For higher-dimensional search spaces (three or four controllable joints), MRS performs slightly better than ES ( after episodes for a Wilcoxon signed-rank test). A potential reason for this might be the increasing number of areas with potentially high regret in higher dimensional spaces that may remain unexplored by ES; however, this hypothesis requires further investigation in the future.
We have proposed MRS, a new class of acquisition functions for single- and multi-task Bayesian optimization that is based on the principle of minimizing the expected simple regret. We have compared MRS empirically to other acquisition functions on a synthetic single-task optimization problem and a simulated multi-task robotic control problem. The results indicate that MRS performs favorably compared to the other approaches and incurs less often a high simple regret than ES since its objective is explicitly focused on minimizing the regret. An empirical comparison with PES remains future work; since PES uses the same objective as ES (minimizing ), it will likely show the same deficit of ignoring areas that have small probability but could nevertheless cause a large potential regret. On the other hand, in contrast to ES and MRS, PES allows a formal treatment of GP hyperparameters, which can make it more sample-efficient. Potential future research on approaches for addressing GP hyperparameters and more efficient approximation techniques for MRS would thus be desirable. Additionally, combining MRS with active learning as done for entropy search by Metzen (2015) would be interesting. Moreover, we consider MRS to be a valuable addition to the set of base strategies in a portfolio-based BO approach (Shahriari et al., 2014). On a more theoretical level, it would be interesting if formal regret bounds can be proven for MRS.
This work was supported through two grants of the German Federal Ministry of Economics and Technology (BMWi, FKZ 50 RA 1216 and FKZ 50 RA 1217).
Annals of Mathematics and Artificial Intelligence (AMAI), 2015.
We present results for an identical setup as reported in Section 5.1, with the only difference being that the test functions have been sampled from a GP with rational quadratic kernel with length scale and scale mixture . The kernel used in the GP surrogate model is not modified, i.e., an RBF kernel with length scale is used. Thus, since different kind of kernel govern test functions and surrogate model, we have model mismatch as would be the common case on real-world problems. Figure 5 summarizes the results of the experiment. Interestingly, in contrast to the experiment without model mismatch, for this setup there are also considerable differences in the mean simple regret between MRS and ES: while ES performs slightly better initially, it is outperformed by MRS for . We suspect that this is because ES tends to explore more locally than MRS once has mostly settled onto one region of the search space. More local exploration, however, can be detrimental in the case of model-mismatch since the surrogate model is more likely to underestimate the function value in regions which have not been sampled. Thus a more homogeneous sampling of the search space as done by the more global exploration of MRS is beneficial. As a second observation, in contrast to a no-model-mismatch scenario, MRS performs considerably worse than MRS when there is model-mismatch. This emphasizes the importance of accounting for uncertainty, particularly when there is model mis-specification.
According to the median simple regret, the difference between MRS, MRS, ES, and EI is less pronounced in Figure 5. Moreover, the histograms of the regret distribution exhibit less outliers (regardless of the method). We suspect that this stems from properties of the test functions that are sampled from a GP with rational quadratic rather than from the model-mismatch. However, a conclusive answer on this would require further experiments.