Towards Automatic Bayesian Optimization: A first step involving acquisition functions

03/21/2020 ∙ by Eduardo C. Garrido-Merchán, et al. ∙ Universidad Autónoma de Madrid 19

Bayesian Optimization is the state of the art technique for the optimization of black boxes, i.e., functions where we do not have access to their analytical expression nor its gradients, they are expensive to evaluate and its evaluation is noisy. The most popular application of bayesian optimization is the automatic hyperparameter tuning of machine learning algorithms, where we obtain the best configuration of machine learning algorithms by optimizing the estimation of the generalization error of these algorithms. Despite being applied with success, bayesian optimization methodologies also have hyperparameters that need to be configured such as the probabilistic surrogate model or the acquisition function used. A bad decision over the configuration of these hyperparameters implies obtaining bad quality results. Typically, these hyperparameters are tuned by making assumptions of the objective function that we want to evaluate but there are scenarios where we do not have any prior information about the objective function. In this paper, we propose a first attempt over automatic bayesian optimization by exploring several heuristics that automatically tune the acquisition function of bayesian optimization. We illustrate the effectiveness of these heurisitcs in a set of benchmark problems and a hyperparameter tuning problem of a machine learning algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optimization problems, which task assuming minimization is to retrieve the minimizer , are often solved easily when we have access to the gradient of the function that we want to optimize. Nevertheless, there exist a plethora of scenarios where we do not have access to these gradients. Typically, metaheuristics [11]

like genetic algorithms

[5] are used in this setting. Genetic algorithms and metaheuristics in general are useful when the evaluation of the function is cheap whether the cheap definiton refers to computational time or other resources such as the budget of the optimization process. This is not always the case. For example, we may consider an scenario when the function to optimize requires to configure a robot [3]

or training a deep neural network

[9]. We can not afford in these scenarios a high number of evaluations. Ideally, we would like to consider a method that suggest as an approximation of the optimum of the problem in the least number of evaluations as possible. An approximated solution to the true minimizer of the problem would be one with low absolute regret at the end of the optimization process , i.e. a local optima, not necessarily close, w.r.t. some distance metric in , in the input space to the minimizer.

Moreover, we can even consider a more complicated scenario that the one described if the function that we want to optimize

is modelled as a latent variable that we cannot observed as it has been contaminated by some random variable, for example, a gaussian random variable, hence observing

where is i.i.d.

. In other words, for any two similar points of the input space we observe a, without loss of generality, gaussian distribution

. Functions whose analytical expression is unknown, the evaluations are costly and the observation is contaminated with noise are often referred to as black boxes. Non convex Black box optimization has been dealt with success by Bayesian Optimization (BO) methodologies [2], being the current state of the art approach.

The most popular example of such an optimization is the task of automatic Machine Learning tuning of the hyperparameters or the hyperparameter problem of machine learning algorithms [19], such as the PC algorithm [4], but also all kinds of subjective tasks like Suggesting Cooking Recipes [7] or other applications belonging to robotics, renewable energies and more [18].

Automatic Hyperparameter Tuning of Machine Learning algorithms is a desirable process that BO can tackle, but the BO procedure also have hyperparameters that need to be fixed a priori. As we are going to see in more detail in the next section, BO needs to fit a probabilistic surrogate model , such as a Gaussian Process (GP) [17], in every iteration to the observations. This GP or other model have a set of hyperparameters associated with it. An Acquisition Function is then built in every iteration from the GP, or other model, that tries to represent an optimal tradeoff between the uncertainty given by the probabilistic model in every point of the input space and its prediction. The Acquisition Function is a free hyperparameter of BO and it could be a bad choice depending on the problem. There are an infinite number of acquisition functions , being the functional space of possible acquisition functions. There is no single acquisition function that is the best for every problem. A bad choice on these and other hyperparameters of Bayesian Optimization lead to bad results in the optimization process. Hence, we ideally need a process that performs automatic bayesian optimization without the need of also hyperparametrize the Bayesian Optimization algorithm. This work tries to attempt this problem and starts dealing with the automatic decision of which acquisition function should we use by performing different heuristics. We hypothesize that an automatic bayesian optimization algorithm will deliver better results than having to manually tune the hyperparameters of bayesian optimization in problems where we do not have prior information about them.

This paper is organized as follows, in section 2 we introduce the fundamental theory of bayesian optimization and gaussian process. Then, in section 3, we exhibit our proposed approaches for Bayesian Optimization. We introduce a set of benchmark experiments and a real experiment to show the utility of our approach in an experiments section. Finally, a conclusions and further work section summarizes the paper.

2 Bayesian Optimization Issues for Automatic Optimization

The Bayesian Optimization algorithm is executed in an iterative fashion, where it uses a probabilistic surrogate model as a prior over functions which functional space contains all the hypotheses about the objective function that we want to get the maximum of . This model , hyperparametrized by a set , is typically a Gaussian Process (GP) [17], but other models such as Bayesian Neural Networks [20]

and Random Forests

[14] are also used. In order for Bayesian Optimization to work, we need to assume that the function can be sampled from it . Hence, depending on the problem, different models may be optimal and even some of them may led to bad result, being hence the model and its hyperparameters a hyperparameter of Bayesian Optimization. For example, if we consider the popular GP for a problem, if the objective function is not stationary and we do not do any transformation of the input space to treat this property of the objective function, the GP does not serve as a prior for that function and independently of the other hyperparameters of the Bayesian Optimization algorithm and of the number of evaluations, we are going to retrieve bad results.

Even by choosing the same probabilistic surrogate model we need to define the correct hyperparameters for that model. In the typical case of a GP, a wrong choice of kernel can imply that the function that we want to optimize is no longer on the functional space that the GP defines. Even by optimizing the rest of the GP hyperparameters by a maximum likelihood procedure or taking an ensemble of different GPs with hyperparameters sampled from a hyperparmeter distribution, as they depend on the choice of kernel, that optimization procedure would be useless, leading again the Bayesian Optimization algorithm to bad results.

Bayesian Optimization uses the prediction and uncertainty of the probabilistic surrogate model in every point of the input space to build an acquisition function . This acquisition function represents the utility of evaluating every point in order to retrieve the optimum of the objective function in the, in the standard bayesian optimization algorithm, next step of the iteration, being a myopic optimization procedure. The literature contains different acquisition functions that try to represent the optimal trade off between exploration of the space areas that have not been yet explored and the exploitation of previously good evaluated results. Some of these acquisition functions are the following ones:

Probability of Improvement: This acquisition function basically represents, for each point of the space, the probability of this point to be better if evaluated than the best observed value retrieved so far.

Expected Improvement: . The previous function does not take into account, for every point and sample function of the probabilistic model, how much does the point improve the maximum value found. Expected improvement represents a theoretical improvement over the probability of improvement by considering this quantity.

Lower Confidence Bound: . This acquisition function is representing a tradeoff between the prediction of the probabilistic model in each point of the space and exploration over unknown areas given by the uncertainty of the model in each point of the space . The parameter assigns a weight for each quantity.

But there are a lot more, in fact, we could generate an infinite number of possible acquisition functions. As in the case of the probablistic surrogate model, the decision of the chosen acquisition function conditions the optimization process. For example, if the function is monotonic, we do not need a heavy exploratory based acquisition function, as being exploitative is a better policy in that scenario. On the other way, if the objective function is contaminated by a high level of noise, the exploitation criterion is practically useless, being a heavily based exploratory acquisition function better suited for that kind of scenario. There is no single best acquisition function for any possible bayesian optimization scenario, as the no free lunch theorem of optimization states [13].

for  do
       1: Find the next point to evaluate by optimizing the acquisition function: . 2: Evaluate the black-box objective at : . 3: Augment the observed data . 4: Update the Gaussian process model using .
end for
Result: Optimize the mean of the Gaussian process to find the solution.
Algorithm 1 Bayesian optimization of a black-box objective function.

Bayesian Optimization does even have more hyperparameters, as for example the optimization algorithm of the acquisition function, typically a grid search over the space of the acquisition function and a local optimization procedure such as the L-BFGS algorithm [6]. The sampling procedure for the hyperparameter distribution of the probabilistic surrogate model, the number of samples, the type of grid that we use to discretize the input space, the number of points and more. Varying the value of those hyperparameters condition the quality of the final suggestion of the bayesian optimization algorithm. We have observed that despite the fact that bayesian optimization is an excellent optimization procedure, it is not automatic and we need to choose wisefully the hyperparameters in order to deliver good results. This is possible if we have prior knowledge about the function that we want to optimize but this is not a scenario that always happens.

Hence, if we do not have prior knowledge about the function that we want to optimize, we ideally need a procedure to search for the best bayesian optimization hyperparameters, concretely the model and the acquisition, as the function is being optimized. This work is a first step towards this goal. We explore different simple heuristics to determine if they affect to the optimization behaviour. We have only focused on the acquisition functions, but the selection of a particular probabilistic surrogate model while the optimization is being performed is also an essential issue to deliver automatic bayesian optimization.

We find in the literature a nice tutorial [2] for more information about Bayesian Optimization. The next section will illustrate the first possible methods that we can execute to perform a simple search of the possible acquisition functions belonging to the set of all possible acquisition functions to build from a probabilistic surrogate model.

3 Heuristic driven Bayesian Optimization

In this work, we begin to explore the possibilities of combining Acquisition Functions in order to build criteria that satisfies the majority of the problems or that it adapts to the optimization process.

Formally, if we have a set of Acquisition Functions, we are going to build criteria that combines these Acquisition Functions.

We hypothesize that different GP states of an underlying objective function need different Acquisition Functions in order to discover which is the optimum of the underlying function. Which is in contrast to the typical bayesian optimization algorithm that just uses the same acquisition function for all the iterations.

We propose, given the same probabilistic surrogate model, using different acquisition functions or linear combinations between acquisition functions in the same bayesian optimization algorithm. For every iteration, a different acquisition function will be used, defining now for bayesian optimization problems not an acquisition function as in standard bayesian optimization but an acquisition function generator that generates for every iteration a different acquisition function . These generators can use any possible acquisition function as seeds for the generation of acquisition functions in every iteration. We illustrate different approaches for an acquisition function generator that are basically heuristics that search the best possible acquisition function.

In practice, we have explored combinations of Standard Acquisition Functions used in the BO literature. We formulate the hyperparameter tuning of acquisition functions for Bayesian Optimization as a search problem and start tackling it with heuristics to observe how the global behaviour of bayesian optimization is conditioned.

We propose the following approaches over the acquisition functions described in the previous section. As it has been described, we could use an extended set of Acquisition Functions like including PES [12], MES [21]

or any other. We also hypothesize that the behaviour of the heuristics will improve with the addition of more and more diverse acquisition functions to the seed set of acquisition functions that we consider. The heuristics that we propose are, in first place, the Random criteria, basically defined by placing un uniform distribution

over the functional set of acquisition functions and sampling from it in every iteration. For every iteration a different acquisition function is going to be executed. We hypothesize that the optimization process will be enriched by the random execution of different criteria, obtaining good results. In our case, as we only consider the EI, LCB and PI acquisitions, the criterion will be given by the following expression: , but in the general case it would be:

We could perform the same logic as in the Random case but performing a Sequential criterion. . We model here all the acquisitions in an ordered list and sample them sequentially, one acquisition for every iteration. We have proposed this two initial strategies in an analogy with respect to the grid search and random search, hypothesizing that they fully explore the set of seed acquisition functions and enriching the optimizing process results.

If we assume that all the acquisition functions can be valid in any time of the optimization process and retrieve different but interested results, then, a logical suggestion will be to consider a linear combination over all the considered acquisition functions, that is the weighted acquisition function criterion, defined by the following expression: . In our particular case the weighted criterion function would be

Lastly, lots of metaheuristics and machine learning algorithms include mechanisms such as the mutation probability in genetic algorithms or dropout in deep neural networks that act as regularizers, enforcing exploration and preventing from overfitting, improving the results. We hypothesize that we can establish an analogy for the acquisition function search so we introduce a noised criterion, that basically transforms the acquisition in a latent functional variable and contaminates it with i.i.d gaussian noise to enforce exploration:

All these approaches are heuristic but explore a space defined by the set . Our procedure combines Acquisition Functions like this: The weighted acquisition function criterion contains a weight for each acquisition function to measure its the importance. This is a generalization of common bayesian optimization but does not solve the automatic bayesian optimization scenario. If, instead of being hardcoded by the user, these weights were adapted as the problem is being optimized or in function of the problem, the optimization would be automatic. As a first attempt towards automatic bayesian optimization, we propose to use a Metaoptimization of the weights using Bayesian Optimization over the weight space . We define a search space of weights that are associated with their respective acquisition functions. Then, we execute a standard Bayesian Optimization procedure that gives us the weights that minimize the predicted error by the underlying bayesian optimization algorithm. By performing this double loop, the weights are optimized and the underlying bayesian optimization algorithm is automatic. Nevertheless, the upper bayesian optimization algorithm still needs to be tuned but we can study several problems to adjust a reasonable prior over the weight space.

4 Experiments

We carry out several experiments to evaluate the performance of the described heuristics in the previous section. We also compare the approaches to a pure exploration method based on Random Search [1]. The set of seeds acquisition functions and the proposed ones have been implemented in SkOpt [15]

. In each experiment carried out in this section we report average results and the corresponding standard deviations. The results reported are averages over 100 repetitions of the corresponding experiment. Means and standard deviations are estimated using 200 bootstrap samples. The hyperparameters of the underlying GPs are maximized through maximum likelihood in the optimization process. The acquisition function of each method is maximized through a grid search.

4.1 Benchmark Experiments

We test the proposed acquisition functions and compare with GP-Hedge over a set of benchmark problems, namely, the Branin, 3-dimensional Hartmann and 3-dimensional Rastrigin functions. We plot the results in Figures 1, 2 and 3.

Figure 1: Means and standard deviations of the log difference w.r.t the absolute regret of the maximizer of the different considered acquisition functions in the Branin Function.

We can observe that, for the Branin function, the best method is the weighted acquisition function optimized by the metaoptimization process. GP-Hedge method also delivers good results, tying at the end with the weighted acquisition function. We hypothesize that the good behaviour of the ensemble acquisition functions (weighted and hedge) is a consequence given by the fact that every seed adds some value in the problem. Separated, although, they do not provide good results.

Figure 2: Means and standard deviations of the log difference w.r.t the absolute regret of the maximizer of the different considered acquisition functions in the Hartmann Function.

We observe a different behaviour in the Hartmann function, where only the pure exploitation acquisition functions (EI and PI) report a good result. This happens due to the shape of Hartmann, where exploration is a bad strategy as with pure exploitation we can reach to the optimum. We can observe empirically that EI is better than PI as it considers the amount of improvement over the incumbent. Ensemble acquisition functions, as they consider exploration or other criteria rather than EI and PI lose performance, but they are not as bad as LCB, which is not a good strategy here. This property of ensemble acquisition functions guarantees that they are not as bad as the worst case in any scenario.

Figure 3: Means and standard deviations of the log difference w.r.t the absolute regret of the maximizer of the different considered acquisition functions in the Rastrigin Function.

In the Rastrigin function, we can observe that the random methods do not perform well but the others tie, performing a better result. No acquisition function seems to govern, maybe all locating just local optima of Rastrigin. The large standard deviations of the Rastrigin function may be explained for different reasons, first is the shape of the function with lots of local optima, each repetition may end in different points and hence the deviation is big. Other explanations are the optimization of the acquisition function being done with a grid search. We need to perform a L-BGFS optimization of the maximum valued point retrieved by this search to discard the hypothesis that the large deviations are happening for local optima. Another important fact is to consider a hyperparameter distribution of the GPs to sample from it with an algorithm such as slice sampling instead of simply optimizing the hyperparameters through maximum likelihood, incurring in overfitting of the model as bayesian optimization performs a small number of evaluations.

4.2 Real Experiment

In this section we perform a hyperparameter tuning problem of the learning rate, minimum samples split and maximum tree depth of a Gradient Boosting Ensemble classifier on the Digits Dataset. We do not find the issues of the Rastrigin function in this problem as, typically, the shape of the estimation of the generalization error function for machine learning algorithms is smooth, so we expect that the retrieved results by bayesian optimization in this case will not contain a high standard deviation and favour the weighted criterion. The results can be seen in Figure

4.

Figure 4: Means and standard deviations of the log difference w.r.t a perfect classification error of the different considered acquisition functions in the Hyperparameter Tuning of a Gradient Boosting Ensemble.

As we can see, the weighted criterion is the best one in this problem, that might contain some local optima and irregularities as the random search also work pretty well. Maybe due to certain combinations of parameters that generates good results. There is a lot more to do for automatic bayesian optimization but the first necessary step towards that goal is to explore the set of all possible acquisition function through, as in this case, generators of linear combinations of acquisition functions that, in average, produce great results.

5 Conclusions and Further Work

The proposed approaches provide alternatives for Hyperparameter Tuning problems with respect to the standard Acquisition Functions. There is still a lot of work to do for automatic bayesian optimization, such as doing a similar approach as this one but with probabilistic graphical models and acquisition function optimizers. In future work, we would like to build a dataset from a plethora of GP states and try to train a deep neural network that learns to predict which is the best Acquisition Function to use or even the best point to consider given the dataset and the state of the current GP. We would like to test whether if the transformations made in the input space to deal with integer [8] and categorical-valued variables [10] change the behaviour of the given acquisition function heuristics. The final purpose of this research is to employ automatic bayesian optimization for the optimization of the hyperparameters of the machine learning architecture of the creative robots that exhibit human behaviour [16] to test machine consciousness hypotheses.

Acknowledgements

The authors gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, grants TIN2016-76406-P and TEC2016-81900-REDT.

References

  • [1] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of machine learning research 13 (Feb), pp. 281–305. Cited by: §4.
  • [2] E. Brochu, V. M. Cora, and N. De Freitas (2010)

    A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

    .
    arXiv preprint arXiv:1012.2599. Cited by: §1, §2.
  • [3] R. Calandra, N. Gopalan, A. Seyfarth, J. Peters, and M. P. Deisenroth (2014) Bayesian gait optimization for bipedal locomotion. In International Conference on Learning and Intelligent Optimization, pp. 274–290. Cited by: §1.
  • [4] I. Córdoba, E. C. Garrido-Merchán, D. Hernández-Lobato, C. Bielza, and P. Larranaga (2018)

    Bayesian optimization of the pc algorithm for learning gaussian bayesian networks

    .
    In

    Conference of the Spanish Association for Artificial Intelligence

    ,
    pp. 44–54. Cited by: §1.
  • [5] L. Davis (1991) Handbook of genetic algorithms. Cited by: §1.
  • [6] G. Gao, A. C. Reynolds, et al. (2004) An improved implementation of the lbfgs algorithm for automatic history matching. In SPE Annual Technical Conference and Exhibition, Cited by: §2.
  • [7] E. C. Garrido-Merchán and A. Albarca-Molina (2018) Suggesting cooking recipes through simulation and bayesian optimization. In International Conference on Intelligent Data Engineering and Automated Learning, pp. 277–284. Cited by: §1.
  • [8] E. C. Garrido-Merchán and D. Hernández-Lobato (2017) Dealing with integer-valued variables in bayesian optimization with gaussian processes. arXiv preprint arXiv:1706.03673. Cited by: §5.
  • [9] E. C. Garrido-Merchán and D. Hernández-Lobato (2019) Predictive entropy search for multi-objective bayesian optimization with constraints. Neurocomputing 361, pp. 50–68. Cited by: §1.
  • [10] E. C. Garrido-Merchán and D. Hernández-Lobato (2020) Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocomputing 380, pp. 20–35. Cited by: §5.
  • [11] F. W. Glover and G. A. Kochenberger (2006) Handbook of metaheuristics. Vol. 57, Springer Science & Business Media. Cited by: §1.
  • [12] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani (2014) Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926. Cited by: §3.
  • [13] Y. Ho and D. L. Pepyne (2002) Simple explanation of the no-free-lunch theorem and its implications. Journal of optimization theory and applications 115 (3), pp. 549–570. Cited by: §2.
  • [14] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown (2017) Auto-weka 2.0: automatic model selection and hyperparameter optimization in weka. The Journal of Machine Learning Research 18 (1), pp. 826–830. Cited by: §2.
  • [15] S. Markov (2017) SKOPT documentation. Cited by: §4.
  • [16] E. C. G. Merchán and M. Molina (2020)

    A machine consciousness architecture based on deep learning and gaussian processes

    .
    arXiv preprint arXiv:2002.00509. Cited by: §5.
  • [17] C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §1, §2.
  • [18] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas (2015) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1.
  • [19] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
  • [20] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter (2016) Bayesian optimization with robust bayesian neural networks. In Advances in neural information processing systems, pp. 4134–4142. Cited by: §2.
  • [21] Z. Wang and S. Jegelka (2017) Max-value entropy search for efficient bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3627–3635. Cited by: §3.