Bayesian optimization is a powerful tool for optimizing objective functions which are very costly or slow to evaluate (Martinez-Cantin et al., 2007; Brochu et al., 2010; Snoek et al., 2012). In particular, we consider problems where the maximum is sought for an expensive function ,
within a domain
which is a bounding box (tensor product of bounded and connected univariate domains). Numerous strategies for modelingin the Bayesian optimization setting have been suggested, including the use of Gaussian processes (Snoek et al., 2012; Martinez-Cantin, 2014)2011b)
, and tree-structured Parzen estimators(Bergstra et al., 2011, 2013b).
Given the variety of alternatives, performance comparisons of various optimization methods is paramount to both researchers and practitioners. Related work summarizing the performance of Bayesian optimization methods includes (Eggensperger et al., 2013; Martinez-Cantin, 2014; Eggensperger et al., 2015). Much of the literature involves the use of potentially inappropriate statistical analysis, and provides little guidance as to how performance on multiple functions
can be analyzed in chorus. Consequently, results often read in the form of a confidence interval (derived from a small sample size) relevant to only a single function and without any means for broader interpretation.
To address these difficulties, Section 2
details a non-parametric strategy for ranking performance between various optimizers on multiple metrics and aggregating that performance across multiple functions. By allowing for multiple metrics, optimizers can be studied in more detail, e.g., considering both the quality of the solution and the speed with which it is attained. Our strategy permits ties during the ranking component, allowing for low significance values and non-parametric hypothesis tests which are less powerful but do not rely on the central limit theorem. An example of the ranking and aggregation process is provided as well as supplementary statistical analysis which motivates the need for non-parametric statistics.
We demonstrate our evaluation framework on a collection of Bayesian optimization methods using an open source suite of benchmark functions. The use of such artificial test functions is well established within the Bayesian optimization community and plays an important role in current research (Snoek et al., 2015; Hernández-Lobato et al., 2015; González et al., 2015). The test functions are stratified by specific attributes, such as being unimodal or non-smooth, as discussed in Section 3. Section 4 outlines our results and provides some interpretation. We also provide an implementation of these functions (McCourt, 2016), hopefully facilitating future empirical insights.
2 Evaluating Optimization Performance
Many metrics exist for describing the quality of an optimizer, and each application values them differently. Here, we consider only two metrics, which is sufficient to demonstrate the hierarchical nature of our ranking algorithm, but others could be included if desired. These metrics are derived from the best seen traces, , which record the best seen objective value after objective function evaluations; a sample of for two different optimization methods tested 30 times is shown in Figure 1
. The shaded region represents the inter-quartile range and reminds us that these optimization strategies are inherently stochastic and that any results should be interpreted statistically.
2.1 Metrics Considered
The simplest comparison to make is between the values observed by each strategy after observing the maximum number of function evaluations, denoted here with . This Best Found metric, , is the first we consider when comparing results.
Because each evaluation of is assumed to be expensive, we also study how quickly these methods improve. To do so, we supplement comparisons based on only the best found metric with the Area Under Curve metric, . A specified lower bound on the function ensures the AUC is always positive. The name AUC reflects the physical interpretation of the metric as an approximate integral of the best seen traces. Figure 1 depicts two best seen traces having different AUC values.
2.2 Performance Ranking and Aggregation
We proceed by using our metrics to establish a partial ranking (allowing for ties) of how multiple optimization methods perform on a given function. Because the foremost goal of optimization is to achieve good Best Found values, we reserve comparisons of Area Under Curve for when methods are seen as comparable with respect to the Best Found metric. Specifically, we combine the two metrics in a hierarchical fashion:
First, we use pairwise Mann-Whitney tests at significance on the Best Found results to determine a partial ranking based only on that statistic,
Any tied results from that step are then subject to additional partial ranking using the same test on the Area Under Curve metric,
Ties remaining after ranking attempts using both metrics are left as ties.
This process is carried out on each function in the test suite which, in effect, allows each function to “cast a ballot” listing the optimization methods in order of performance on that function. These ballots are then aggregated using a Borda ranking system (Dwork et al., 2001) and presented in tables in Section 4 alongside simpler aggregation schemes: the number of first place finishes, and the number of at least third place finishes. This approach generalizes to using additional metrics, provided they are applied in a specified order of importance.
As is always the case during the compilation of pairwise statistical tests, we must consider the “family-wise error”2006). These tests are not independent (the same samples are used for multiple tests,) but even if they were, the probability of at least a single type I error is bounded by
where is the number of algorithms under comparison. In Section 4 we use algorithms, thus can be achieved with .
2.2.1 Example Ranking and Aggregation
Suppose we use Method A, Method B, Method C and Method D to optimize a single function 30 times and record the Best Found and Area Under Curve values for each optimization. The Mann-Whitney tests on the best found value may yield the statistically significant results:
By reverse sorting these results in order of number of losses we observe the partial ranking, and resulting Borda values,
Note that this is just a simple way to isolate the worst performers and certainly not the only mechanism of creating a ranking (Cook et al., 2007), e.g., one could rank by number of wins. Any group of ties, such as the group with two wins in this example , is refined by studying the area under curve statistical test; if that test stated that Method A had a larger value than Method B, that information would be present in the final ranking
If, on the other hand, the area under curve test was statistically insignificant, the original ranking and associated Borda values would be used.
If our benchmark suite consisted of only 6 functions, , and each reported the following rankings:
the proposed aggregation strategies would generate the total rankings found in Table 1. Tables of this form appear in Section 4, where the Top Three criterion is more insightful than in this example; to account for ties, all algorithms at the top of a function’s “ballot” receive credit in the Firsts column, and similarly for the Top Three column.
2.2.2 Statistical Considerations
Some previous empirical analysis of optimization methods prefers parametric statistics using the central limit theorem (Eggensperger et al., 2013; Bergstra et al., 2014) to the non-parametric statistics we employ (Hutter et al., 2011b). Potential non-normality of samples of makes parametric statistics on small sample sizes a treacherous endeavor.
As an example, consider optimization using the simple optimization method random search (Bergstra & Bengio, 2012); each suggestion is chosen i.i.d. from . Observed function values
are therefore i.i.d. realizations of some random variable, whose distribution is determined by and . After random samples, the largest observation becomes the best found value, thus .
The max of these i.i.d. values is the th order statistic
, whose cumulative distribution function can be determined through the CDF of the random observations. A similar result phrased in a different context is presented in(Bergstra & Bengio, 2012).
The viability of a -test for studying samples of is dependent on how closely the distribution of sample means matches a normal. The central limit theorem guarantees this for sufficiently large samples, but for small sample sizes, it seems unlikely matches the CDF of a normal. Additionally, as
increases the distribution becomes increasingly skewed negative because there is a maximum value for: , the value that the optimization hopes to find. The impact of this on an example is displayed graphically in Figure 2
This figure shows that sample means from samples of size less than 30 are in danger of non-normality, and thus in danger of a test or normal-based confidence interval being untrustworthy. Furthermore, as increases, and the quality of the optimization improves, the skewness of further amplifies this danger, as is seen in the generally upward trend of these curves. Of course, random search is almost the worst optimization algorithm from a speed of convergence perspective, and thus this skewness is likely to be exacerbated for more efficient methods, including those tested in Section 4. These results strongly suggest the use of non-parametric tests for comparing optimization results.
2.3 Alternate Metrics
Our metrics are by no means the only criteria by which optimization algorithms can, or should, be judged. Best Found measures proximity to the optimal function value
, but not proximity to the optimal vectorwhich could be more insightful in some circumstances. Although we consider the probabilistic nature of the results during ranking, the metrics themselves could be considered probabilistically; for example, the probability of having at least 2 accurate digits, or the probability of more than 10% from the optimal value (dolan2002benchmarking). Knowledge of permits use of the gap metric, (huang2006global; Brochu et al., 2010) which is cleanly scaled between 0 and 1. Cumulative regret, , is useful for strongly penalizing suggestions which do not improve (Srinivas2010; bull2011convergence).
The Area Under Curve metric measures the speed of convergence towards , however the term in the AUC computation serves only a cosmetic purpose. Its omission would produce an equivalent averaged metric. An averaging of the gap metric could be similarly considered. Some real-world problems have variable computational or physical cost for different values, thus the total resources (e.g., CPU time) expended on function evaluations can also be an important metric. The time required for the entire optimization may also be relevant when comparing software implementations (Martinez-Cantin, 2014).
3 Benchmark Functions
One of the prominent applications of Bayesian optimization is the efficient tuning of hyperparameters of machine learning models(Snoek et al., 2012; Eggensperger et al., 2013). While hyperparameter optimization problems are some of the most important benchmarks, it can be difficult to understand why certain methods perform well on particular problems and others do not.
To try to piece together this puzzle, we use a library of test functions having closed-form expressions which trade the authenticity of tuning real machine learning models for speed and transparency. Our library, built on a core set of functions proposed by (Gavana, 2013), is available for inspection and acquisition (McCourt, 2016). Example test functions are shown in Figure 3.
One goal of this library is to help facilitate the transfer of insights garnered from experimentation on these functions to real-world problems, following the path of (Bergstra et al., 2014; Eggensperger et al., 2015) which used cheap test functions and surrogates of hyperparameter problems to evaluate Bayesian optimization software. From a practical standpoint, their negligible evaluation time and high degree of extensibility make artificial test functions a logical tool for testing Bayesian optimization software.
A valid criticism against using artificial test functions is that they might not be representative of realistic problems or the objective surfaces that occur in practice. While this statement about their limitations is perfectly accurate, test functions afford flexibility in their design and easily allow for the presence or absence of certain properties mimicking those that also appear in relevant circumstances, including hyperparameter optimization problems. For example, recent work has argued that hyperparameter optimization problems may have some level of non-stationarity (Snoek et al., 2014b). We believe that an effective way of comparing the performance of optimization methods on these types of problems would be to develop test functions with carefully imposed non-stationarity.
To this end, we have identified characteristics which we think occur with relative frequency in problems of interest; this list is only a sampling of characteristics discussed in this article and not thought to be comprehensive.
Noisy - The function evaluation is altered by for with noise level between and .
Oscillatory - Functions with a long-range trend (possibly zero) and shorter scale oscillatory behavior ; the oscillations have a wavelength long enough to not act as noise.
Unimodal - Functions with only one (or possibly zero) local maxima, with possible saddle points.
Boundary optima - lies on boundary of .
Mixed integer - Some values in must be integers.
Discrete valued - Functions can only take finitely many values, e.g., for can only take the values of the 201 integers between 0 and 200. This can occur in cross-validation settings.
Mostly boring - Functions such that where denotes a Sobolev space and for some . Essentially, these functions have small gradient in except in a small region . This is a version of non-stationarity.
Nonsmooth - Functions which have discontinuous derivatives on manifolds of at least one dimension while still being in .
A noteworthy caveat when using artificial test functions to evaluate performance is that they may suffer from design biases. One example that hampered the initial iteration of this test suite involved having optima in predictable locations, for example, at the domain midpoint or on integer coordinates. Under default settings, certain optimization software invokes a deterministic initialization strategy involving values on the boundary or midpoint of
, which produced unrealistically perfect results on some functions and peculiarly poor results on others. In this benchmark suite we have made an effort to appropriately classify and segregate functions of this type, though further work is required to identify and resolve less obvious biases.
Other work involved in the development of benchmark functions for testing optimization methods has focused on the development of surrogate models to actual machine learning problems (Eggensperger et al., 2015). Although no such surrogates are present here, this idea fits into the framework that we propose, and the low cost of evaluating these surrogates should permit appropriate classification with the various attributes described above, as well as other attributes not considered here.
4 Ranking Demonstration
We conducted numerical experiments on the functions described in Section 3, which consisted of 30–60 optimizations per algorithm per function. The metrics described in Section 2.1 were recorded and aggregated as described in Section 2.2. All algorithms were terminated at 80 function evaluations, unless the function was in dimension when optimization was terminated after evaluations. This decision was made for simplicity of comparison and, in a real setting, each method may have a preferred stopping criteria which should be observed.
Four Bayesian optimization methods are studied in our evaluation. Spearmint (Snoek et al., 2012, 2014b, 2014a) and SigOpt, which is derived from MOE (Clark et al., 2014) and all use Gaussian processes to model . We used the HIPS implementation of Spearmint (Snoek et al., 2014a). SMAC (Hutter et al., 2011b, a) uses random forests to model and we used the python wrapper pySMAC (Falkner, 2014) as implementation in our experiments. Hyperopt (Bergstra et al., 2013b) uses tree-structured Parzen estimators to facilitate the optimization and we used the standard python library implementation (Bergstra et al., 2013a) in our experiments. In addition to these Bayesian methods, we also consider grid search (Grid), random search (Bergstra & Bengio, 2012) (Random
), and particle swarm optimization (PSO) (Kennedy & Eberhart, 1995) as non-Bayesian baselines.
For grid search, we chose a grid resolution per dimension so that approximately 1 million grid points were generated. We then sampled randomly from this collection of grid points during optimization. For PSO, we chose particles in our evaluations where was the dimension of the problem. We used an pyswarm (Lee, 2014) as our PSO implementation in our experiments. We ignored the ability of the Gaussian process methods to incorporate an estimate of noise into their operation. This was done to eliminate the unfair advantage that GP based methods would benefit from by having prior knowledge about the objective function’s noise. Prior knowledge of the objective function’s process noise is also an unrealistic assumption for true black-box optimization.
When presenting results, optimization algorithms are listed alphabetically. Tables 2–5 analyze performance using only the dimension of the function as classification. The classification of functions by their attributes in tables 6–13, however, provides more valuable insight. Again, the functions we used are available in Python (McCourt, 2016); not all functions provided there appear in these results.
Isolating the results by dimension provides some relevant insights, notably, the emergence of HyperOpt into the top tier for higher dimensional functions. It is helpful to balance the Borda ranking, which rewards consistently ranking above methods, with the Top Three ranking, which is more of an absolute measurement because we allow for ties, i.e., five methods could be in the top three.
The GP backed algorithms SigOpt and Spearmint perform, on average, better on most categories.
Spearmint holds slight leads in functions which are mostly boring and those with a boundary optimum.
SigOpt seems to edge ahead in non-smooth and discrete functions.
PSO performs competitively in several function classes, including oscillatory and noisy, despite its non-Bayesian foundation.
Noisy functions are difficult to study because of the randomness in evaluating which augments randomness in the Bayesian optimization. This contributes to the broad representation in the Top Three ranking.
Several methods consistently share the top three rankings for the mixed integer functions. That is also the case for oscillatory functions, but with a larger Borda gap.
Mixed integer and unimodal functions have some dissonance between the Borda ranking and the Top Three ranking.
SMAC is designed for high dimensional problems with categorical or conditional parameters, and none of these functions fit that description.
It should be noted that function attributes are not exclusive—a unimodal function can also be mostly boring and have a boundary optimum. Overlap is, however, kept to a minimum to better focus on the impact of individual attributes. Future experiments could involve the development of larger groups of multiply attributed functions.
4.1 Comparison with -tests
In Section 2.2.2, we explained our preference for the Mann-Whitney test over traditional -tests for comparing optimization performance. Table 14 reinforces the potential differences between the parametric and non-parametric tests; while the results are not strikingly different, Spearmint’s performance is much clearer. Note that all tests, both the -tests and tests, were still conducted with the significance explained in Section 2.2.
4.2 Comparison without AUC specification
As explained in Section 2.2, the ranking scheme we propose involves a hierarchy of metrics: first the Best Found and second the Area Under Curve. Of course, more metrics could be incorporated into the ranking, but two is sufficient for a demonstration. Had we instead only considered the best found metric we would have seen less refined tables with more ties. Table 15 gives an example of the contrast between ranking with and without the AUC.
Problems such as hyperparameter optimization require efficient algorithms because of the cost of conducting the relevant machine learning tasks. Bayesian optimization continues to be the standard tool for hyperparameter selection, but comparing optimization strategies is complicated by the opaque nature of these hyperparameter optimization problems. We have proposed an extensible framework with which to analyze these methods, including a suite of test functions classified by their relevant attributes and an aggregation scheme for interpreting relative performance.
The results presented here demonstrate that by using test functions with desired properties, e.g., non-smooth functions, we can better decipher the performance of optimization methods. Future work will include identifying the corresponding attributes of hyperparameter selection problems so that these results can be appropriately applied. We welcome contributions and extensions to this initial evaluation set and hope it will be useful in future empirical studies.
- Bergstra & Bengio (2012) Bergstra, James and Bengio, Yoshua. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281–305, 2012.
- Bergstra et al. (2013a) Bergstra, James, Yamins, Dan, and Cox, David D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. https://github.com/hyperopt/hyperopt, 2013a.
- Bergstra et al. (2013b) Bergstra, James, Yamins, Daniel, and Cox, David. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of The 30th International Conference on Machine Learning, pp. 115–123, 2013b.
- Bergstra et al. (2014) Bergstra, James, Komer, Brent, Eliasmith, Chris, and Warde-Farley, David. Preliminary evaluation of hyperopt algorithms on HPOLib. In ICML workshop on AutoML, 2014.
- Bergstra et al. (2011) Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, and Kégl, Balázs. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pp. 2546–2554, 2011.
- Brochu et al. (2010) Brochu, Eric, Brochu, Tyson, and de Freitas, Nando. A bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 103–112. Eurographics Association, 2010.
- Clark et al. (2014) Clark, Scott, Liu, Eric, Frazier, Peter, Wang, JiaLei, Oktay, Deniz, and Vesdapunt, Norases. MOE: A global, black box optimization engine for real world metric optimization. https://github.com/Yelp/MOE, 2014.
- Cook et al. (2007) Cook, Wade D, Golany, Boaz, Penn, Michal, and Raviv, Tal. Creating a consensus ranking of proposals from reviewers’ partial ordinal rankings. Computers & Operations Research, 34(4):954–965, 2007.
- Demšar (2006) Demšar, Janez. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006.
- Dwork et al. (2001) Dwork, Cynthia, Kumar, Ravi, Naor, Moni, and Sivakumar, Dandapani. Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pp. 613–622. ACM, 2001.
- Eggensperger et al. (2013) Eggensperger, Katharina, Feurer, Matthias, Hutter, Frank, Bergstra, James, Snoek, Jasper, Hoos, Holger, and Leyton-Brown, Kevin. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, 2013.
Eggensperger et al. (2015)
Eggensperger, Katharina, Hutter, Frank, Hoos, Holger H, and Leyton-Brown,
Efficient benchmarking of hyperparameter optimizers via surrogates.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Falkner (2014) Falkner, Stefan. pysmac : simple python interface to SMAC. https://github.com/automl/pysmac, 2014.
- Gavana (2013) Gavana, Andrea. AMPGO global optimization benchmark functions. https://github.com/andyfaff/ampgo, 2013.
- González et al. (2015) González, Javier, Osborne, Michael, and Lawrence, Neil D. GLASSES: Relieving the myopia of bayesian optimisation. In NIPS Workshop on Bayesian Optimization, 2015.
- Hernández-Lobato et al. (2015) Hernández-Lobato, José Miguel, Gelbart, Michael A, Hoffman, Matthew W, Adams, Ryan P, and Ghahramani, Zoubin. Predictive entropy search for bayesian optimization with unknown constraints. Proceedings of The 32nd International Conference on Machine Learning, 2015.
- Hutter et al. (2011a) Hutter, Frank, Hoos, Holger, Leyton-Brown, Kevin, Murphy, Kevin, and Ramage, Steve. SMAC: Sequential model-based algorithm configuration. http://www.cs.ubc.ca/labs/beta/Projects/SMAC/, 2011a.
- Hutter et al. (2011b) Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization, pp. 507–523. Springer, 2011b.
Kennedy & Eberhart (1995)
Kennedy, James and Eberhart, Russell C.
Particle swarm optimization.
Proceedings of the 1995 IEEE International Conference on Neural Networks, volume 4, pp. 1942–1948, Perth, Australia, IEEE Service Center, Piscataway, NJ, 1995.
- Korolev & Shevtsova (2010) Korolev, Viktor Yu and Shevtsova, Irina G. On the upper bound for the absolute constant in the Berry-Esseen inequality. Theory of Probability & Its Applications, 54(4):638–658, 2010.
- Lee (2014) Lee, Abraham. pyswarm : Particle swarm optimization (PSO) with constraint support. https://github.com/tisimst/pyswarm, 2014.
- Martinez-Cantin (2014) Martinez-Cantin, Ruben. Bayesopt: A bayesian optimization library for nonlinear optimization, experimental design and bandits. The Journal of Machine Learning Research, 15(1):3735–3739, 2014.
- Martinez-Cantin et al. (2007) Martinez-Cantin, Ruben, de Freitas, Nando, Doucet, Arnaud, and Castellanos, José A. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems, pp. 321–328, 2007.
- McCourt (2016) McCourt, Michael. Optimization Test Functions. https://github.com/sigopt/evalset, 2016.
- Snoek et al. (2012) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
- Snoek et al. (2014a) Snoek, Jasper, Swersky, Kevin, Larochelle, Hugo, Gelbart, Michael, Adams, Ryan P, and Zemel, Richard S. Spearmint : Bayesian optimization software. https://github.com/HIPS/Spearmint, 2014a.
- Snoek et al. (2014b) Snoek, Jasper, Swersky, Kevin, Zemel, Richard S., and Adams, Ryan Prescott. Input warping for bayesian optimization of non-stationary functions. In International Conference on Machine Learning, 2014b.
- Snoek et al. (2015) Snoek, Jasper, Rippel, Oren, Swersky, Kevin, Kiros, Ryan, Satish, Nadathur, Sundaram, Narayanan, Patwary, Md. Mostofa Ali, Prabhat, and Adams, Ryan P. Scalable bayesian optimization using deep neural networks. In International Conference on Machine Learning, 2015.
In Section 2.2.2, we alluded to the skewness of , the result of a random search after function evaluations, where each observed function value , is a random variable with distribution determined by the function of interest and domain . Figure 2 demonstrated the impact of this empirically for the simple maximization problem involving for . In particular, it showed that the term sample mean of samples drawn from failed a Kolmogorov-Smirnov test with greater likelihood for larger and smaller
. The KS test tests whether the random variables are normally distributed, which they must be for a hypothesis test or confidence interval invoking the central limit theorem to be appropriate.
We can determine, at least for this simple problem, the exact nature of the Berry-Esseen inequality (Korolev & Shevtsova, 2010) governing the quality of the normal approximation. For a random variable such that , and , the Berry-Esseen inequality says
where is a constant and .
Our goal is to study the i.i.d. summation , but to apply the Berry-Esseen inequality we will have to subtract out the mean. In Section 2.2.2 we showed , and, for this problem, so ; note that, were we interested instead in a minimization problem, this order statistics analysis would yield . Using this CDF, we can compute
This gives us the random variable which satisfies . Using this, we can compute
When we finally compute this term that governs the quality of the normal approximation in the Berry-Esseen inequality, we see
This quotient is monotonically increasing for , which can be determined with the derivative (albeit with a good deal of work) or graphically as in Figure 4.
Of course, all is well as long as is large because as , but for small , Figure 4 demonstrates that as the quality of the solution improves ( increases) the validity of tests and central limit theorem based confidence intervals actually diminishes. This is likely the cause of the similar behavior for the failed Kolmogorov-Smirnov test probability in Figure 2.