Benchmark for global optimization algorithms on various problems
Motivated by the problem of tuning hyperparameters in machine learning, we present a new approach for gradually and adaptively optimizing an unknown function using estimated gradients. We validate the empirical performance of the proposed idea on both low and high dimensional problems. The experimental results demonstrate the advantages of our approach for tuning high dimensional hyperparameters in machine learning.READ FULL TEXT VIEW PDF
This paper presents the results and insights from the black-box optimiza...
Modern machine learning algorithms usually involve tuning multiple (from...
The interpretation of Large Hadron Collider (LHC) data in the framework ...
Bayesian Optimization (BO) is a common approach for hyperparameter
A general approach for anomaly detection or novelty detection consists i...
The local gradient points to the direction of the steepest slope in an
The goal of Machine Learning to automatically learn from data, extract
Benchmark for global optimization algorithms on various problems
Machine learning applications and the design of complex systems usually involve a large number of free parameters. The evaluation of a single set of parameters requires computationally expensive numerical simulations and cross-validations, while the choices of parameters influence the performance of a system dramatically. In the machine learning community, this problem is usually referred to as hyperparameter optimization (HPO)(Hutter et al., 2011) and has been extensively studied in recent years, since the early approaches based on grid search become impractical for high dimensional hyperparameters (Franceschi et al., 2017).
In this paper, we focus on continuous parameters, for which gradient based methods (Maclaurin et al., 2015; Luketina et al., 2016; Pedregosa, 2016; Franceschi et al., 2017; Wu et al., 2017) have attracted attention for its fast convergence. In machine learning applications, the objective of HPO is usually to optimize a validation function evaluated at a stationary point of the training objective, and the gradient of the validation function can be derived from the iterative training procedure (Maclaurin et al., 2015; Luketina et al., 2016). The exact computation of the gradient is the major bottleneck, since it is computationally inefficient and has high space requirements. Pedregosa (2016) proposed the idea of approximating the gradient based on the stationary condition of the training procedure, and managed to efficiently optimize a large number of hyperparameters.
However, there are still several issues to be addressed. First of all, the gradient approximation proposed by Pedregosa (2016) relies on regularity conditions such as the stationary condition for a minimizer of the training loss, which obviously does not hold if we apply early stopping. Furthermore, the approximation is based on the assumption of the smoothness of the objective function, which is too strong in practice. Finally, the algorithms proposed by Pedregosa (2016), Maclaurin et al. (2015) and Franceschi et al. (2017) require hyperparameters such as learning rate, which become hyper hyperparameters in HPO. In their experiments, those hyperparameters are manually adjusted. In practice, devising procedures adaptive to them are needed.
In this paper, we formalize HPO as a problem of optimizing the output value of an unknown function. We propose an alternate idea based on the two-point estimation of the gradient (Nesterov and Spokoiny, 2017), the graduated optimization (Hazan et al., 2016) and the scale free online learning algorithm (Orabona and Pál, 2018). Compared to (Pedregosa, 2016), we do not assume the smoothness or any regularity conditions of the objective function. To avoid introducing further hyper hyperparameters, we apply a simple online gradient descent with an adaptive learning rate. We compare our algorithms against the state-of-the art global optimization algorithms on machine learning problems. The rest of the paper is organized as follows. In section 2, we introduce the problem setting, describe our idea of estimating the gradient and propose the algorithm. In section 3, we present the empirical performance of our algorithm. Section 4 concludes our work with some future research directions.
Let be a function defined on a compact and convex set . Finding the global minimum is challenging in general due to non-convexity, unknown smoothness and possible noisy evaluations of the function. In the context of machine learning, returns the score, such as the cross-validation error, for a given configuration of hyperparameters selected from . We follow the standard procedure from the literature on global optimization (Hutter et al., 2011; Bubeck et al., 2011; Munos, 2011; Malherbe and Vayatis, 2017), which attempts to minimize the function by sequentially exploring the space using a finite budget of evaluations. Formally, we wish to find a sequence , where each point depends on the previous evaluations , such that the last explored point returns a lowest possible value.
The global optimization methods do not or cannot usually leverage the gradient information, which is actually proven to be useful for tuning hyperparameters in machine learning (Maclaurin et al., 2015; Pedregosa, 2016; Franceschi et al., 2017). However, deriving the gradients in those works is expensive, require strong assumptions on , and is not applicable to black-box problems. In contrast, our idea works for more general cases. Assuming is -Lipschitz, its Gaussian approximation (Nesterov and Spokoiny, 2017) is defined as
is a standard Gaussian random vector.is -smooth (Nesterov and Spokoiny, 2017) with bounded bias (Hazan et al., 2016). The gradient can be estimated using the two-point feedback (Nesterov and Spokoiny, 2017)
Arguably, one can estimate the gradient with only one expensive function call (Hazan et al., 2016)
or use other two-point estimators with lower variance(Shamir, 2017). Our choice is more practical, since the evaluation at each could help us trace the best configuration of the parameters evaluated.
Algorithm 1 describes our idea. We can divide it into epochs. In each epoch, we use online gradient descent (Orabona and Pál, 2018) with adaptive learning rate to optimise . If the smoothed functions are locally strongly convex and the global optima of smoothed functions of the successive epochs are close, a point close to the global optimum can be ensured (Hazan et al., 2016) for a large enough budget. Otherwise it converges to a stationary point of as we gradually decrease (Nesterov and Spokoiny, 2017). Compared to the standard gradient descent method used in (Pedregosa, 2016), our algorithm does not assume the -smoothness of and derive the learning rate from , which is unknown for most cases. Unlike those methods employed in (Maclaurin et al., 2015; Franceschi et al., 2017), we do not use momentum term in the gradient descent to avoid additional hyperparameters.
In this section, we compare the empirical performance of GradOpt with the following global optimisation algorithms:
PRS. The Pure Random Search methods samples parameters uniformly randomly from the searching space.
HOO. The tree based global optimisation methods for Hölder continuous functions (Bubeck et al., 2011).
AdaLipo. The sequential strategy for optimising unknown Lipschitz continuous functions while adaptively estimating the Lipschitz constant (Malherbe and Vayatis, 2017).
AdaLipoTR. The practical method combining AdaLipo (Malherbe and Vayatis, 2017) for global search, and trust region for finding the local results 111The implementation of both AdaLipo and AdaLipoTR is taken from the dlib library (http://blog.dlib.net/2017/12/a-global-optimization-algorithm-worth.html).
We adopt the experimental setting of (Malherbe et al., 2016; Malherbe and Vayatis, 2017) and consider the problems of tuning both low dimensional and high dimensional hyperparameters in machine learning. For the low dimensional case, we tune the
-regularizer and the width of a Gaussian kernel ridge regression. The objective is to maximize the 10-fold cross validation score. More specifically, we split the datasetinto 10 folds and consider the following objective function
where denotes the Gaussian RKHS of bandwidth equipped with the norm . The goal is to search for the optimal and from .
To compare the performance for high dimensional hyperparameters, we consider the task of data cleaning for kernel ridge regression, for which we assign a weight from to each data sample. Then we tune the hyperparameters and weights, i.e. to maximize (1) subject to
The decision space for , and for is . Note that, for the convenience of applying HOO and AdaLipo, all the problems are implemented as maximization problems. To apply GradOpt we simply minimise the negative of the objective function.
For each of the problems, we perform runs of the algorithms with a budget of function evaluations. Then for each target value in , we observe the number of function evaluations required to reach the best score found by the algorithms multiplied by the target value.
Table 1 and 2 demonstrate the experimental results.222The source code can be fetched from https://github.com/christiangeissler/gradoptbenchmark Despite the potentially suboptimal stationary points, GradOpt returns a point close to the optimal solutions found by other global optimisation algorithms in all experiments. For the low dimensional problems, the combined approach AdaLipoTR outperforms the other algorithms. However, GradOpt takes only a few more steps compared to AdaLipoTR and outperforms the algorithms relying on global search for all datasets except Yacht. The experimental results for the high dimensional tasks demonstrate the advantage of GradOpt, which obtains most of the best scores for all target values. The other global optimization algorithms don’t scale well for the high dimensional problems, which is also suggested by their theoretical analysis (Bubeck et al., 2011; Malherbe and Vayatis, 2017).
We presented an alternative approach for optimizing an unknown, potentially non-convex and non-smoothed function, which is based on the estimated gradient, adaptive learning rate and graduated optimization. Suggested by the theoretical analysis of the previous work, our approach converges to a stationary point for general cases and to a global optimum if certain conditions are fulfilled. The experimental results have shown that our approach indeed provides global guarantee. For tuning high dimensional hyperparameters, it outperforms the state-of-the-art global optimization algorithms in most of the experiments.
We consider this work as a glimpse of applying graduated optimization to searching for optima of unknown functions. It can be extended and improved in several ways. Firstly, the convergence of our approach is suggested by previous work and empirically shown, yet the actual theoretical performance is unknown. The most important future direction would be to perform a theoretical analysis in an appropriate framework. Furthermore, we assign an equal budget of evaluations to each epoch in this work, which may not be best option. A strategy of allocating budget with theoretical guarantee would be needed in practice. Finally, the experimental results show that the combined method outperforms the rest for the low dimensional problems. To apply our approach to tuning hyperparameters in machine learning, we can also combine it with global optimization algorithms. However, this must be thoroughly evaluated on diverse datasets with different models.
This work is supported in part by the German Federal Ministry of Education and Research (BMBF) under the grant number 01IS16046. We would like to thank Dr. Brijnesh Jain for his valuable feedback.