Bayesian optimization (BO) Brochu_2010Tutorial ; Hennig_2012Entropy ; Shahriari_2016Taking is an efficient method for the global optimization of a black-box function. BO has been successfully employed in selecting chemical compounds Hernandez_2017Parallel , material design Frazier_2016Bayesian ; li2018accelerating_ICDM
, and in search for hyper-parameters of machine learning algorithmsSnoek_2012Practical ; klein2017fast ; chen2018bayesian . These recent results suggest BO is more efficient than manual, random, or grid search with better overall performance.
Bayesian optimization finds the global maximizer of the black-box function by incorporating prior beliefs about and updating the prior with samples drawn from this black-box function to get a posterior that better approximates . The model used for approximating the black-box function is called the surrogate model. A popular choice for a surrogate model is the Gaussian process (GP) Rasmussen_2006gaussian
although there are existing alternative options, such as random forestHutter_2011Sequential
, deep neural networkSnoek_2015Scalable , Bayesian neural network Springenberg_2016Bayesian and Mondrian tree wang2018batched . This surrogate model is then used to define an acquisition function which determines the next query of the black-box function.
Traditional Bayesian optimization approaches do not take into account the known optimum value , available in advance for some applications. For example, the optimal reward is available for common reinforcement learning benchmarks or we know the optimum accuracy of in tuning classification algorithm for some datasets. The goal is to efficiently find the best hyper-parameters for a deep reinforcement learning algorithm or a classification algorithm to produce the best performance using the fewest number of iterations.
In this paper, we consider a novel setting in Bayesian optimization where we know what we are looking for, but we do not know where it is. Specifically, we observe the knowledge about the optimum value and aim to search for the unknown optimum location by utilizing value.
We exploit the information about into Bayesian optimization in the following ways. First, we use the knowledge of to build a transformed GP surrogate model from the data. Our intuition in transforming a GP is based on the fact that the black-box function value should not be above the threshold (since by definition). As a result, the GP surrogate should also follow this property. Second, we propose two acquisition functions which effectively make the informed decision based on observing value. In the first acquisition function, We propose confidence bound minimization
to select the location where we are certain (low GP variance) that the expected function value (GP mean) is the closest to the known optimum value. In the second acquisition function, we deriveexpected regret minimization, which minimizes the regret defined by the expected function value of a selected point and the known optimum value .
We validate our model using an extensive set of benchmark functions and tuning deep reinforcement learning algorithm where we observe the optimum value in advance. These experiments demonstrate that our proposed framework works both intuitively better and experimentally outperform the baselines. Our main contributions are summarized as follows
a first study of Bayesian optimization for exploiting the known optimum value available.
a transformed Gaussian process surrogate using the knowledge of .
two novel acquisition functions to efficiently select the optimum location given .
demonstration on tuning deep reinforcement learning algorithm and XGBoost.
2 Exploiting Known Optimum Value for Bayesian Optimization
We present a new approach for Bayesian optimization given the knowledge of optimum value . Our goal is to utilize this knowledge to improve BO performance in finding the unknown optimum location . We first encode to build an informed GP surrogate model through transformation and then we propose two acquisition functions which effectively exploit the information of .
2.1 Transformed Gaussian Process
We make use of the knowledge about the optimum value to control the GP surrogate model through transformation. Our transformation starts with the key observation that the function value should not be greater than the optimal value , by definition of being a maximum value. Therefore, the desired GP surrogate should not go above this threshold. Based on this intuition, we propose the GP transformation given as follows
Using this transformation, the desired property about the function is always held as . We have chosen the non-zero prior mean for as so that the mean prior of is zero - as a common practice in GP modeling where the output is standardized around zero . Given the observations , we can compute the observations for as where . Then, we can write the posterior of as and where is the non-zero prior mean Rasmussen_2006gaussian .
The above transformation causes the distribution for any to become a non-central process, making the analysis intractable. In order to tackle this problem and obtain a posterior distribution that is also Gaussian, we employ an approximation technique presented in gunter2014sampling ; ru2018fast . That is, we perform a local linearization of the transformation around and obtain where the gradient . Following gunter2014sampling ; ru2018fast , we set to the mode of the posterior distribution and obtain an expression for as
Since the linear transformation of a Gaussian process remains Gaussian, the predictive posterior distribution fornow has a closed form for where the predictive mean and variance are given by
These above Eqs. (1) and (2) are the key to compute our acquisition functions in the next sections. As the effect of transformation, the predictive uncertainty of the transformed GP becomes larger than in the case of vanilla GP at the location where is low. This is because is high when is low and thus is high in Eq. (2). This property may let other acquisition functions (e.g., UCB, EI) explore more aggressively than should be. We further examine these effects in the supplement.
We visualize the property of our transformed GP and compare with the vanilla GP in Fig. 1 (top row). By transforming GP using , we encode the knowledge about into the surrogate model and thus is able to control the surrogate model going below as desired while vanilla GP is not.
2.2 Confidence Bound Minimization
In this section, we introduce the confidence bound minimization (CBM) to efficiently select the (unknown) optimum location given value. Our idea is based on the underlying concept of GP-UCB Srinivas_2010Gaussian
. We consider the GP surrogate at any location with high probability
where is a hyper-parameter. Given the knowledge of , we find the next point to evaluate by minimizing the confidence bound around the location
with the estimated valueclosing to the optimum value . That is
where and are the GP mean and variance computed from Eq. (1) and Eq. (2) respectively. We follow Srinivas_2010Gaussian to set the parameter for which controls the level of exploration. In the above objective function, we aim to quickly locate the area which potentially contains an optimum. Since the acquisition function is non-negative , it takes its minimum value at the ideal location where and . When these two conditions are met, we can conclude that and thus is what we are looking for, as the property of Eq. (3).
Because the CBM involves a hyper-parameter to which performance can be sensitive, we below propose another acquisition function taking the knowledge of using no hyper-parameter.
2.3 Expected Regret Minimization
We next develop the second acquisition function to exploit the knowledge about the known optimum value, called expected regret minimization (ERM). We start with the regret function where is the known global optimum value. The likelihood of regret on a normal posterior distribution is as follows
As a end goal in optimization is to minimize the regret, we consider our acquisition function to minimize this expected regret as . Using the likelihood function in Eq. (8), we write the expected regret minimization acquisition function as
Let , we obtain the closed-form computation as follows
where and are the standard normal pdf and cdf, respectively. To select the next point, we minimize this acquisition function which is equivalent to minimizing the expected regret,
Our choice in Eq. (7) is where to minimize the expected regret. We can see that this acquisition function is always positive . It is minimized at the ideal location , i.e., , when and . This case happens at the desired location when the expected regret is zero.
We summarize all steps in Algorithm 1. Given the original observation and , we compute , then build a transformed GP using . Using a transformed GP, we can predict mean and uncertainty at any location from Eqs. (1) and (2) which are used to compute the CBM and ERM acquisition functions in either Eq. (4) or Eq. (7). Our formulas are in closed-forms and the algorithm is easy to implement. In addition, our computational complexity is as cheap as the GP-UCB and EI.
Although our ERM is inspired by the EI in the way that we define the regret function and take the expectation, the resulting approach is different in the following. The original EI strategy is to balance exploration and exploitation, i.e., prefers high GP mean and high GP variance. On the other hand, ERM will not encourage such trade-off directly. Instead, ERM selects the point to minimize the expected regret with closer to the known while having low variance to make sure that the GP estimation at our chosen location is correct. Then, if the chosen location turns out to be not expected (e.g., poor function value), ERM will move to another place which minimizes the expected regret using the updated GP. Therefore, these behaviors of EI and our ERM are radically different.
Illustration of CBM and ERM
We illustrate our proposed CBM and ERM comparing to the standard UCB and EI in both vanilla GP and transformed GP settings in Fig. 1. Our acquisition functions make use of the knowledge of to make an informed decision about where we should query. That is, CBM and ERM will select the location where the GP mean is close to the optimal value and we are highly certain about it - or low . On the other hand, GP-UCB and EI will always keep exploring as the principle of explore-exploit without using the knowledge of . As the results, GP-UCB and EI can not identify the unknown location efficiently as opposed to our acquisition functions.
The main goal of our experiments is to show that we can effectively exploit the usefulness of the known optimum value to improve Bayesian optimization performance. We first demonstrate the efficiency of our model on benchmark functions. Then, we perform hyper-parameter optimization for a deep reinforcement learning task on CartPole problem and a XGBoost classification on Skin Segmentation dataset where the optimum values are publicly available. We provide additional experiments in the supplement.
To the best of our knowledge, there is no baseline in directly using the known optimum value for BO. We select to compare our model with the vanilla BO without knowing the optimum value including the GP-UCB Srinivas_2010Gaussian and EI Mockus_1978Application . In addition, we create two other baselines which can incorporate the value of into the decision function including EI+ (using as the incumbent) and MES+ (using instead of sampling from either Thompson or Gumbel sampling).
The experiments are independently performed times. All implementations are in Python. We run the deep reinforcement learning experiment on a NVIDIA GTX 1080 GPU machine. We use the squared exponential kernel where is chosen by maximizing the GP marginal likelihood, the input is scaled and the output is standardized for robustness.
We follow Theorem 3 in Srinivas_2010Gaussian to specify . Our CBM and ERM use a transformed Gaussian process (Sec. 2.1) in all experiments. We learn empirically that using a transformed GP as a surrogate will boost the performance for our CBM and ERM significantly against the case of using vanilla GP. We run all baselines using both surrogates and report the best performance from the two settings. We present these details of experiments in the supplement.
3.1 Comparison on benchmark function given the known optimum value
We perform optimization tasks on common benchmark functions111https://www.sfu.ca/~ssurjano/optimization.html. For these functions, we assume that the optimum value is available in advance which will be given to the algorithm. We use the simple regret for comparison, defined as for maximization problem.
The experimental results are presented in Fig. 2 which shows that our proposed CBM and ERM are among the best approaches over all problems considered. This is because our framework has utilized the additional knowledge of to build an informed surrogate model and decision functions. Especially, ERM outperforms all methods by a wide margin. While CBM can be sensitive to the hyper-parameter , ERM has no parameter and is thus more robust.
Particularly, our approaches exploiting perform significantly better than the non-exploiting baselines in gSobol and Alpine1 functions. These high dimensional functions are typically challenging for optimization without the knowledge of .
3.2 Tuning machine learning algorithms with optimal values available
A popular application of BO is for hyper-parameter tuning of machine learning models. Some machine learning tasks come with the known optimal value in advance. We consider tuning (1) a deep reinforcement learning task on a CartPole problem barto1983neuronlike and (2) a classification task using XGBoost on a Skin dataset. Further detail of the experiment is described in the supplement.
Deep Reinforcement Learning.
CartPole is a pendulum with a center of gravity above its pivot point. The goal is to keep the cartpole balanced by controlling a pivot point. The reward performance in CartPole is often averaged over consecutive trials. The maximum reward is known from the literature222https://gym.openai.com/envs/CartPole-v0/ as .
We then use a deep reinforcement learning (DRL) algorithm to solve the CartPole problem and use Bayesian optimization to optimize this DRL algorithm. In particular, we select to use the advantage actor critic (A2C) algorithm Sutton_1998Reinforcement which possesses three sensitive hyper-parameters including the discount factor , the learning rate for actor model and the learning rate for critic model
. We choose not to optimize the deep learning architecture for simplicity. We use Bayesian optimization given the known optimum value ofto find the best hyper-parameters for the A2C algorithm. We present the results in Fig. 3 where our ERM reaches the optimal performance after iterations outperforming all other baselines. In Fig. 3 Left, we visualize the selected point by our ERM acquisition function. Our ERM initially explores at several places and then exploits in the high value region (yellow dots).
|min child weight|
We demonstrate a classification task using XGBoost chen2016xgboost on a Skin Segmentation dataset 333https://archive.ics.uci.edu/ml/datasets/skin+segmentation where we know the best accuracy is , as shown in Table 1 of Le_2016Nonparametric .
. To optimize the integer (ordinal) variables, we round the scalars to the nearest values in the continuous space. We present the result in Fig.4. Our proposed ERM is the best approach, outperforming all the baselines by a wide margin. This demonstrates the benefit of exploiting the optimum value in Bayesian optimization.
3.3 What happens if we misspecify the optimum value
We provide further analysis for our model by considering the counterfactual setting. That is, we set the to a value which is not the true optima of the black-box function. Specifically, we over-specify larger and under-specify smaller than the true value in a maximization problem.
We experiment with our ERM using this misspecified setting of in Fig. 5. The results suggest that our algorithm using the true value ( for Hartmann and for gSobol) will have the best performance. Both over-specifying and under-specifying the optimum value will return worse performance. Especially, under-specifying case will result in worse performance than over-specifying. This is because our acquisition function will keep exploiting the area once being found wrongly as the optimal. On the other hand, if we over-specify , our algorithm will keep finding the optimum because it can not find the point where both conditions are met and .
We make the following observations. If we know the true value, ERM will bring the best result. If we do not know the exact value, but know its over-specified value, we can either use ERM or vanilla EI. If we do not know the true at all, we should use the vanilla EI or other acquisition functions for the best performance.
4 Conclusion and Future Work
In this paper, we have considered a new setting in Bayesian optimization with known optimum value. We present a transformed Gaussian process surrogate to model the objective function better by exploiting the knowledge of . Then, we propose two decision strategies which can exploit the function optimum value to make informed decisions. Our approaches are intuitively simple and easy to implement. By using extra knowledge of , we demonstrate that our ERM can converge quickly to the optimum in benchmark functions and real-world applications.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
-  Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5):834–846, 1983.
-  Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
-  Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
-  Yutian Chen, Aja Huang, Ziyu Wang, Ioannis Antonoglou, Julian Schrittwieser, David Silver, and Nando de Freitas. Bayesian optimization in alphago. arXiv preprint arXiv:1812.06855, 2018.
-  Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Artificial Intelligence and Statistics, pages 207–215, 2013.
-  Peter I Frazier and Jialei Wang. Bayesian optimization for materials design. In Information Science for Materials Discovery and Design, pages 45–75. Springer, 2016.
-  Tom Gunter, Michael A Osborne, Roman Garnett, Philipp Hennig, and Stephen J Roberts. Sampling for inference in probabilistic models with fast bayesian quadrature. In Advances in neural information processing systems, pages 2789–2797, 2014.
-  Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13:1809–1837, 2012.
José Miguel Hernández-Lobato, James Requeima, Edward O Pyzer-Knapp, and
Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space.In International Conference on Machine Learning, pages 1470–1479, 2017.
-  Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pages 507–523. Springer, 2011.
Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter.
Fast bayesian optimization of machine learning hyperparameters on large datasets.In Artificial Intelligence and Statistics, pages 528–536, 2017.
Trung Le, Vu Nguyen, Tu Dinh Nguyen, and Dinh Phung.
Nonparametric budgeted stochastic gradient descent.In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 654–572, 2016.
-  Cheng Li, Rana Santu, Sunil Gupta, Vu Nguyen, Svetha Venkatesh, Alessandra Sutti, David Rubin De Celis Leal, Teo Slezak, Murray Height, Mazher Mohammed, et al. Accelerating experimental design by incorporating experimenter hunches. In IEEE International Conference on Data Mining (ICDM), pages 257–266, 2018.
-  Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for seeking the extremum. Towards global optimization, 2(117-129):2, 1978.
-  Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
-  Carl Edward Rasmussen. Gaussian processes for machine learning. 2006.
-  Binxin Ru, Michael Osborne, and Mark McLeod. Fast information-theoretic bayesian optimisation. In International Conference on Machine Learning, 2018.
-  Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
-  Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In Proceedings of the 32nd International Conference on Machine Learning, pages 2171–2180, 2015.
-  Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust bayesian neural networks. In Advances in Neural Information Processing Systems, pages 4134–4142, 2016.
-  Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pages 1015–1022, 2010.
-  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
-  Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pages 745–754, 2018.
-  Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In International Conference on Machine Learning, pages 3627–3635, 2017.
Appendix A Appendix
In the supplementary material, we provide the derivation for the expected regret minimization and additional details about the experiments.
a.1 Derivation for Expected Regret Minimization
We are given an optimization problem where is a black-box function that we can evaluate pointwise. Let be the observation set including an input , an outcome and be the bounded search space. We define the regret function where is the known global optimum value. The likelihood of the regret on a normal posterior distribution is as follows
The expected regret can be written using the likelihood function in Eq. (8), we obtain
As a ultimate goal in optimization is to minimize the regret, we consider our acquisition function to minimize this expected regret as . Let , then and . We write as
We compute the first term in Eq. (9) as
Next, we compute the second term in Eq. (9) as
Let , we obtain the acquisition function as follows
where is the standard normal pdf and is the cdf. To select the next point, we minimize this acquisition function which is equivalent to minimize the expected regret
We can see that this acquisition function is minimized when and . Our chosen point is the one which offers the smallest expected regret. We aim to find the point with the desired property of .
a.2 Experiments Appendix
We first describe the two baselines used in the experiments: expected improvement using incumbent (EI) and MES using (MES). Then, we provide additional information about the deep reinforcement learning experiment in the main paper. Next, we illustrate the experiments of our proposed acquisition function in vanilla GP and transformed GP. We show that our acquisition functions perform better with the transformed GP than the vanilla GP. Although the transformed GP is ideal for our acquisition functions, we show that it may not be useful for the EI and GP-UCB.
For completeness, we describe the two baseline acquisition functions which can take the known optimum value .
Expected improvement with incumbent (EI).
We have the closed-form acquisition function for EI  using as the incumbent to improve from
where is the standard normal p.d.f. and is the standard normal c.d.f. We note that this baseline is radically different from our proposed ERM as discussed in Sec. 3.2.
Max-value entropy search with (Mes).
MES  considers maximizing the information gained about the optima value . When is not known in advance, Wang et al  utilizes either Thompson sampling or Gumbel sampling to generate a collection of samples. In our known optimum value setting, because is observed, we can use the known value directly to the above equation to obtain
where are the GP predictive mean and predictive variance, is the known optimum value, is the standard normal c.d.f. and is the standard normal p.d.f.
a.2.2 Details of Advantage Actor Critic on CartPole problem
We use the advantage actor critic (A2C)  as the deep reinforcement learning algorithm to solve the CartPole problem . This A2C is implemented in Tensorflow  and run on a NVIDIA GTX 2080 GPU machine. In A2C, we use two neural network models to learn and separately. In particular, we use a simple neural network architecture with layers and nodes in each layer. The range of the used hyper-parameters in A2C and the found optimal parameter are summarized in Table 2.
We illustrate the reward performance over training episodes using the found optimal parameter value in Fig. 6. In particular, we plot the raw reward and the average reward over 100 consecutive episodes - this average score is used as the evaluation output. Our A2C with the found hyper-parameter will take around episodes to reach the optimum value .
|learning rate model|
|learning rate model|
a.2.3 Comparison using vanilla GP and transformed GP
In this section, we empirically compare the proposed transformed Gaussian process (using the knowledge of presented in the Sec. 3.1 of the main paper and the vanilla Gaussian process  as the surrogate model for Bayesian optimization. We then test our ERM and EI on the two surrogate models. After the experiment, we learn that the transformed GP is more suitable for our ERM while it may not be ideal for the EI.
We perform experiments on ERM acquisition function using two surrogate models as vanilla Gaussian process (GP) and transformed Gaussian process (TGP). Our acquisition function performs better with the transformed GP. The TGP exploits the knowledge about the optimum value to construct the surrogate model. Thus, it is more informative and can be helpful in high dimension functions, such as Alpine1 and gSobol , , in which the ERM on TGP achieves much better performance than ERM on GP. On the simpler functions, such as branin and hartmann, the transformed GP surrogate achieves comparable performances with the vanilla GP. We visualize all results in Fig. 7.
Expected Improvement (EI).
We then test the EI acquisition function on two surrogate models of vanilla Gaussian process and our transformed Gaussian process (using ) in Fig. 8. In contrast to the case of ERM above, we show that the EI will perform well on the vanilla GP, but not on the TGP. This can be explained by the side effect of the GP transformation as follows. From Eq. (1) in the main paper, when the location has poor (or low) prediction value , we will have large value . As a result, this large value of will make the uncertainty larger from Eq. (2) in the main paper. Therefore, TGP will make an additional uncertainty at the location where is low.
Under the additional uncertainty effect of TGP, the expected improvement may spend more iterations to explore these uncertainty area and take more time to converge than the case of using the vanilla GP. We note that this effect will also happen to the GP-UCB and other acquisition functions, which rely on exploration-exploitation trade-off.
In high dimensional function of gSobol , TGP will make the EI explore aggressively due to the high uncertainty effect (described above) and thus result in worse performance. That is, it keeps exploring at poor region in the first iterations (see bottom row of Fig. 8).
The transformed Gaussian process (TGP) surrogate takes into account the knowledge of optimum value to inform the surrogate. However, this transformation may create additional uncertainty at the area where function value is low. While our proposed acquisition function ERM and CBM will not suffer this effect, the existing acquisition functions of EI and UCB will. Therefore, we only recommend to use this TGP with our acquisition functions for the best optimization performance.