Simple and Scalable Parallelized Bayesian Optimization

06/24/2020
by   Masahiro Nomura, et al.
0

In recent years, leveraging parallel and distributed computational resources has become essential to solve problems of high computational cost. Bayesian optimization (BO) has shown attractive results in those expensive-to-evaluate problems such as hyperparameter optimization of machine learning algorithms. While many parallel BO methods have been developed to search efficiently utilizing these computational resources, these methods assumed synchronous settings or were not scalable. In this paper, we propose a simple and scalable BO method for asynchronous parallel settings. Experiments are carried out with a benchmark function and hyperparameter optimization of multi-layer perceptrons, which demonstrate the promising performance of the proposed method.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

11/18/2019

A Simple Heuristic for Bayesian Optimization with A Low Budget

The aim of black-box optimization is to optimize an objective function w...
07/01/2018

New Heuristics for Parallel and Scalable Bayesian Optimization

Bayesian optimization has emerged as a strong candidate tool for global ...
09/12/2018

PARyOpt: A software for Parallel Asynchronous Remote Bayesian Optimization

PARyOpt is a python based implementation of the Bayesian optimization ro...
03/02/2022

Hyperparameter optimization of data-driven AI models on HPC systems

In the European Center of Excellence in Exascale computing "Research on ...
12/13/2020

Warm Starting CMA-ES for Hyperparameter Optimization

Hyperparameter optimization (HPO), formulated as black-box optimization ...
10/07/2019

A Locally Adaptive Bayesian Cubature Method

Bayesian cubature (BC) is a popular inferential perspective on the cubat...
08/12/2021

Scalable3-BO: Big Data meets HPC - A scalable asynchronous parallel high-dimensional Bayesian optimization framework on supercomputers

Bayesian optimization (BO) is a flexible and powerful framework that is ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Black-box optimization is a significant problem that appears in the fields of science and industry. In black-box optimization, no algebraic representation of the objective function is given, and no gradient information is available; we can only observe the evaluation value for the sampled point.

One of the prominent methods in black-box optimization is Bayesian optimization (BO) (Frazier, 2018; Shahriari et al., 2015; Brochu et al., 2010). There are many successful examples of BO for problems such as hyperparameter optimization of machine learning algorithms (Swersky et al., 2013; Snoek et al., 2012; Bergstra et al., 2011; Feurer & Hutter, 2019) and neural architecture search (Kandasamy et al., 2018b; Liu et al., 2018). BO has high efficiency by using the observed data and dealing with the exploration-exploitation tradeoff.

In today’s world, when computer infrastructure is becoming easier to prepare by cloud computing services, it is progressively vital to evaluate the objective function in parallel. In sequential optimization, the next search cannot be performed until an evaluation of a previous point finished. By contrast, using parallel optimization allows multiple points to be evaluated at the same time, thus increasing efficiency depending on the resource. The expansion of BO to parallel environments has been actively studied (Wang et al., 2018; Kathuria et al., 2016; Wu & Frazier, 2016; Shah & Ghahramani, 2015; Desautels et al., 2014; Contal et al., 2013; Kandasamy et al., 2018a; Ginsbourger et al., 2011; Janusevskis et al., 2012; Wang et al., 2016). While the majority of these studies in parallel BO assume batch evaluations (i.e., synchronous settings), this is a problem when the computational cost varies greatly by evaluation point. For example, in hyperparameter optimization, the training time of a model varies greatly depending on the hyperparameter configuration. Several BO methods for asynchronous parallel settings exist (Kandasamy et al., 2018a; Ginsbourger et al., 2011; Janusevskis et al., 2012; Wang et al., 2016; Snoek et al., 2012). However, the time complexity of these methods is cubic with respect to the number of points evaluated; this can be an obstacle when applied to more massive scale problems.

In this paper, we propose a simple and scalable BO method for asynchronous parallel settings. The proposed method inherits the scalability of the Tree-structured Parzen Estimator 

(Bergstra et al., 2011)

, whose time complexity is linear. In our experiments for a benchmark function and hyperparameter optimization of multi-layer perceptrons, we demonstrate that the proposed method achieves faster convergence within a fixed evaluation budget than the asynchronous parallel Thompson Sampling 

(Kandasamy et al., 2018a), which is one of the state-of-the-art methods in parallel BO.

2 Background

2.1 Bayesian Optimization

Bayesian optimization (BO) is an efficient method for black-box optimization, which is performed by repeating the following steps: (1) Based on the data observed thus far, BO constructs a surrogate model that considers the uncertainty of the objective function. (2) BO calculates the acquisition function to determine the point to be evaluated next by using the surrogate model. (3) By utilizing the acquisition function, BO determines the next point to evaluate. (4) BO then updates the surrogate model based on the newly obtained data, then returns to step (2).

Popular acquisition functions in BO include the expected improvement (EI) (Jones et al., 1998)

, probability of improvement 

(Kushner, 1964), upper confidence bound (Srinivas et al., 2010), mutual information (Contal et al., 2014), and knowledge gradient (Frazier et al., 2009). Among them, EI is one of the most common acquisition functions in BO and calculated as follows:

(1)

where is some threshold. In the Gaussian process-based BO (Snoek et al., 2012), the best evaluation value so far is typically used for .

2.2 Tree-structured Parzen Estimator

The Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011, 2013b) is one of the most popular BO methods and widely used in hyperparameter optimization (Bergstra et al., 2013a; Eggensperger et al., 2013; Falkner et al., 2018; Akiba et al., 2019). In the calculation of EI in Equation (1), TPE models and using the following equations instead of modeling :

(2)
(3)

Using the Equation (2) and (3), TPE calculates EI as follows:

(4)

In each iteration, TPE selects that maximizes as the next evaluation point; that is, TPE selects that maximizes

. TPE uses a kernel density estimator to model

and . Compared to the Gaussian process-based BO (Snoek et al., 2012), whose time complexity is cubic for the number of points evaluated, the time complexity of TPE is linear and therefore scalable.

2.3 Parallelized Bayesian Optimization

Although there are abundant parallel BO methods (Wang et al., 2018; Kathuria et al., 2016; Wu & Frazier, 2016; Shah & Ghahramani, 2015; Desautels et al., 2014; Contal et al., 2013; Kandasamy et al., 2018a; Ginsbourger et al., 2011; Janusevskis et al., 2012; Wang et al., 2016), many of those methods assume synchronous settings. For problems such as hyperparameter optimization where the computation time of the evaluation can vary greatly depending on the evaluation point, the optimization method should be adapted to the asynchronous setting. There are several BO methods for asynchronous settings (Kandasamy et al., 2018a; Ginsbourger et al., 2011; Janusevskis et al., 2012; Wang et al., 2016; Snoek et al., 2012). However, these methods have a problem of scalability because the time complexity is cubic. On the other hand, the proposed method inherits the advantages of TPE and enables optimization with linear time complexity.

3 Proposed Method

We propose a new optimization method that is simple and scalable for asynchronous parallel settings. The proposed method tries to sample more points for regions with a high probability of having an evaluation value better than a threshold ; in other words, the points are sampled from the distribution . We note that, unlike typical Bayesian optimization, which selects that optimizes the acquisition function, the proposed method samples the points directly from the distribution . By using Equation (2), (3), and , is calculated as follows:

(5)
1:objective function

, quantile

2:
3:for  do every time one of workers become free
4:     Sample a point from the distribution using rejection sampling
5:      evaluate
6:     
7:     Compute by using and in Equation(5)
Algorithm 1 Proposed Method in asynchronous parallel settings

Algorithm 1 shows the overall algorithm of the proposed method in the asynchronous setting. Because the proposed method samples points probabilistically, it can be naturally extended to asynchronous parallel settings. We sample from the distribution using rejection sampling. In the proposed method, one-dimensional kernel density estimation is performed hierarchically to obtain and , as in (Bergstra et al., 2011, 2013b).

While the proposed method is inspired by TPE in modeling and , the attitude of these methods to the parallel setting is different. TPE always selects that maximizes ; therefore, it continues to select the same or similar points for all workers in parallel settings. On the other hand, in the proposed method, the above situation can be avoided because the points are sampled probabilistically.

4 Experiments

In this section, we assess the performance of the proposed method through a benchmark function and the hyperparameter optimization of machine learning algorithms. We compared the proposed method with the asynchronous parallel Thompson Sampling (Parallel-TS) (Kandasamy et al., 2018a), which is a Bayesian optimization method using a Gaussian process as a surrogate model and Thompson Sampling as an acquisition function. We do not compare with other asynchronous methods such as asynchronous random sampling, asynchronous version of upper confidence bound (Desautels et al., 2014; Ginsbourger et al., 2011), and asynchronous expected improvement (Jones et al., 1998) because Parallel-TS shows better performance than these methods (Kandasamy et al., 2018a). For each method, the evaluation budget for optimization was set to seconds. We set the number of parallel workers to and performed trials for each problem. Details of the experimental settings are shown in Appendix A.

In the first experiment, we optimized the Hartmann18 function, which is 18-dimensional, used in (Kandasamy et al., 2018a). Hartmann18 function was constructed as where is the six-dimensional Hartmann function (Surjanovic & Bingham, )

. We modeled the evaluation time as a random variable that is drawn from a half-normal distribution 

(Kandasamy et al., 2018a). Further details of the experiment of the benchmark function is shown in Appendix B.

In the second experiment, we conducted the hyperparameter optimization of multi-layer perceptrons (MLPs). The MLP consists of two fully-connected layers with softmax at the end. Table 1

shows the six hyperparameter of MLP we optimized and their respective search spaces. We regarded the integer-valued hyperparameters as continuous variables by using rounded integer when evaluating. We set the maximum number of epochs during training to

, and the mini-batch size to . MLPs were trained on Fashion-MNIST clothing articles dataset (Xiao et al., 2017). We used images as training data and images as validation data. The misclassification rate on the validation dataset was used as the evaluation value.

Hyperparameters Type Range
Learning Rate float
Momentum float
Number of Hidden Nodes (1st Layer) int
Number of Hidden Nodes (2nd Layer) int
Dropout Rate (1st Layer) float
Dropout Rate (2nd Layer) float
Table 1: Details of hyperparameters of MLP on Fashion-MNIST dataset.
(a) Hartmann18 function
(b) MLP on Fashion-MNIST
Figure 1:

The sequences of the mean and standard error of the best evaluation values on the benchmark function and hyperparameter optimization. The x-axis denotes (left) the number of evaluations or (right) the computation time.

The results of these experiments are shown in Figure 1. To assess the search efficiency and calculation cost of the proposed method separately, we show two figures; in each problem, the figure on the left shows the number of evaluations on the x-axis, and the figure on the right shows the calculation time on the x-axis. In both of the problems, the performance of the proposed method shows competitive results with that of Parallel-TS in terms of the number of evaluations; on the other hand, in terms of computation time, the proposed method outperforms Parallel-TS. This result is because Parallel-TS needs to sample many points that follow the posterior distribution of the Gaussian process. In contrast, the proposed method does not require such expensive calculations.

5 Conclusion

In this paper, we proposed a simple and scalable parallelized Bayesian optimization method for black-box optimization. The proposed method enables efficient search by generating more points in regions that are likely to be promising. By inheriting the useful properties of the Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011), the proposed method achieves linear time complexity. Furthermore, the proposed method can be naturally extended to asynchronous settings because it is a probabilistic sampling method. The experiments on the benchmark function and hyperparameter optimization have shown the effectiveness of the proposed method.

As future work, we would like to confirm whether the performance of BOHB (Falkner et al., 2018) will improve when the TPE part of BOHB, which combines Hyperband and TPE, is replaced with the proposed method. BOHB deals with the parallelization in an ad-hoc way which reduces the number of samples used to optimize the acquisition function in TPE. We believe that the performance of BOHB may be improved by using the more motivated proposed method.

Acknowledgments

The author would like to thank Masashi Shibata for his valuable advice with the experiments.

References

References

  • Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631, 2019.
  • Bergstra et al. (2013a) James Bergstra, Dan Yamins, and David D Cox. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in science conference, pp. 13–20, 2013a.
  • Bergstra et al. (2013b) James Bergstra, Daniel Yamins, and David Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pp. 115–123, 2013b.
  • Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-Parameter Optimization. In Advances in neural information processing systems, pp. 2546–2554, 2011.
  • Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599, 2010.
  • Contal et al. (2013) Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 225–240, 2013.
  • Contal et al. (2014) Emile Contal, Vianney Perchet, and Nicolas Vayatis. Gaussian Process Optimization with Mutual Information. In International Conference on Machine Learning, pp. 253–261, 2014.
  • Desautels et al. (2014) Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing Exploration-Exploitation Tradeoffs in Gaussian Process Bandit Optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
  • Eggensperger et al. (2013) Katharina Eggensperger, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger Hoos, and Kevin Leyton-Brown. Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10, pp.  3, 2013.
  • Falkner et al. (2018) Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1437–1446, 2018.
  • Feurer & Hutter (2019) Matthias Feurer and Frank Hutter. Hyperparameter Optimization. In Automated Machine Learning, pp. 3–33. 2019.
  • Frazier et al. (2009) Peter Frazier, Warren Powell, and Savas Dayanik. The Knowledge-Gradient Policy for Correlated Normal Beliefs. INFORMS journal on Computing, 21(4):599–613, 2009.
  • Frazier (2018) Peter I Frazier. A Tutorial on Bayesian Optimization. arXiv preprint arXiv:1807.02811, 2018.
  • Ginsbourger et al. (2011) David Ginsbourger, Janis Janusevskis, and Rodolphe Le Riche. Dealing with asynchronicity in parallel Gaussian Process based global optimization. In Conference of the ERCIM WG on Computing and Statistics, 2011.
  • Janusevskis et al. (2012) Janis Janusevskis, Rodolphe Le Riche, David Ginsbourger, and Ramunas Girdziusas. Expected Improvements for the Asynchronous Parallel Global Optimization of Expensive Functions: Potentials and Challenges. In International Conference on Learning and Intelligent Optimization, pp. 413–418, 2012.
  • Jones et al. (1998) Donald R Jones, Matthias Schonlau, and William J Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global optimization, 13(4):455–492, 1998.
  • Kandasamy et al. (2018a) Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. Parallelised Bayesian Optimisation via Thompson Sampling. In

    International Conference on Artificial Intelligence and Statistics

    , pp. 133–142, 2018a.
  • Kandasamy et al. (2018b) Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing. Neural Architecture Search with Bayesian Optimisation and Optimal Transport. In Advances in Neural Information Processing Systems, pp. 2016–2025, 2018b.
  • Kathuria et al. (2016) Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched Gaussian Process Bandit Optimization via Determinantal Point Processes. In Advances in Neural Information Processing Systems, pp. 4206–4214, 2016.
  • Kushner (1964) Harold J Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise. Journal of Basic Engineering, 86(1):97–106, 1964.
  • Liu et al. (2018) Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive Neural Architecture Search. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    , pp. 19–34, 2018.
  • Shah & Ghahramani (2015) Amar Shah and Zoubin Ghahramani. Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions. In Advances in Neural Information Processing Systems, pp. 3330–3338, 2015.
  • Shahriari et al. (2015) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
  • Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022, 2010.
  • (26) S. Surjanovic and D. Bingham. Virtual Library of Simulation Experiments: Test Functions and Datasets. Retrieved September 22, 2019, from http://www.sfu.ca/~ssurjano.
  • Swersky et al. (2013) Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-Task Bayesian Optimization. In Advances in neural information processing systems, pp. 2004–2012, 2013.
  • Wang et al. (2016) Jialei Wang, Scott C Clark, Eric Liu, and Peter I Frazier. Parallel Bayesian Global Optimization of Expensive Functions. arXiv preprint arXiv:1602.05149, 2016.
  • Wang et al. (2018) Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched Large-scale Bayesian Optimization in High-dimensional Spaces. In International Conference on Artificial Intelligence and Statistics, pp. 745–754, 2018.
  • Wu & Frazier (2016) Jian Wu and Peter Frazier. The Parallel Knowledge Gradient Method for Batch Bayesian Optimization. In Advances in Neural Information Processing Systems, pp. 3126–3134, 2016.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.

Appendix A Details of Experimental Setup

For the asynchronous parallel Thompson Sampling (Parallel-TS) , we used a squared exponential kernel with the bandwidth for each dimension, the scale parameter of the kernel, and the noise variance, as stated in

(Kandasamy et al., 2018a). These kernel hyperparameters were updated by maximizing the data likelihood after each iteration. We sampled points from the posterior distribution of the Gaussian process and selected the with the smallest value as the next point to evaluate. Using the authors’ repository111https://github.com/kirthevasank/gp-parallel-ts of Parallel-TS as a reference, we set , where is the dimension, and is the number of completed evaluations.

We used the kernel density estimator implemented in Optuna (Akiba et al., 2019) to construct and in the proposed method. We set the quantile parameter in Equation (5) to through the experiments.

Appendix B Details of Benchmark Function

In the experiment of the Hartmann18 function, we model the evaluation time as a random variable that is drawn from a half-normal distribution (Kandasamy et al., 2018a)

. The probability density function of the half-normal is given by

(6)

where is a scale parameter. Because the expectation of the random variable that follows this distribution is , we set so that the expectation is .