Machine learning algorithms are the most important tools in different industry and research areas. Different machine learning model like GBDT CNN, DNN is widely used in computer vision, image classfication, and speech recognition problems.
To gain a satisfied machine learning model, it is necessary to adjust hyperparameters in training process. Some hyperparameters play very important roles in machine learning model’s performance, logitisic loss, AUC, classification error in model training phase. Those hyperparameter includes L1, L2 norm in LR and learning rates in stochastic gradient descent algorithms and GBDT.
1.1 The importance of hyperparameters optimization
However, with the development of machine learning and AI technology, machine learning model became more and more complex. Those machine learning model contain huge number of hyperparameters. To adjust those hyperparameters, machine learning workers and researchers have to do tremendous and exhausting work.
To deal with this problem, automl tools is developed in these years such as AutoML in Google, autoWEKA and autosklearn, H2O. Automl tools use different methods to optimize hyperparameters. Those tools reduce machine learning workers and researchers’ workload.
1.2 Tradtional hyperparameters optimization
Besides grid search and random search, bayesian optimization algorithms, especially SMBO algorithms, are the most widely used hyperparameters optimization methods nowadays . Bayesian optimization algorithms are well developed in black-box optimization . Bayesian optimization typically works by assuming the unknown function was sampled from a Gaussian Process. Those methods maintains a posterior distribution for this function as observations are made. SMAC
which employs random forest and Gaussian Processes is the state-of-the-art SMBO algorithm in hyperparameter optimization.
1.3 The shortage of traditional methods
With the development of black-box optimization and automl technology, researchers proposed different kinds of algorithm in their areas. In black-box optimization area, how to use gradient information is a hot discussed topic. In automl area, meta-learning and the gradient based hyperparameter optimization are hot discussed topics.
Although bayesian optimization algorithms achieve satisfied results in convergence speed and output accuracy, yet they do not contain known information, like gradient and model analysis, in their process. Intuitively, Those known information would accelerate convergence process.
1.4 Problem: How to use known information to accelerate bayesian optimization
Although some algorithms use those known information, yet in those methods, they only use parts of known information.
In Bayesian optimization, some work invoke gradient information into Gaussian Process regression, but those methods treat the hyperparameter gradient as prior information. However, in hyperparameter optimization, hyperparameter gradient reflects the character of real hyperparameter-performance function. So, those information should be posterior information in this application. What is more, the computation load is huge for these algorithm.
Model analysis information is mainly use in meta-learning in initial configuration selection instead of in finding optimized hyperparameter process. When the configuration of model and dataset escape from the dataset of meta learning, model analysis information would fail to accelerate hyperparameter optimization.
In this paper, we design algorithm based SMBO and known information: machine learning model analysis information and hyperparameter gradient to accelerate hyperparameter optimization process. In experiments, our methods achieve 200% to 300% than SMAC algorithm in pc4, real-sim, rcv1 datasets.
In section 2, we give brief introduce about SMBO algorithm, Gaussian Process and hyperparameter gradient. In section 3, we descripe the detail of methods in our algorithm. In section 4, we present the experimental results in small scale dataset, e.g. pc4, in middle scale dataset, e.g. rcv1 and in large scale dataset, e.g. real-sime dataset on L2 norm optimization.
Our main contributions are summarized as follows:
(1) Based on bayesian optimization, we propose the methods use known machine learning model analysis information to accelerate bayesian optimization process.
(2) Based on bayesian optimization, we proposed the methods use known hyperparameter gradients information to accelerate bayesian optimization process. In this methods, we used the mean function of posterior distribution to fit gradient information instead of used them as prior information.
(3) Our experimental results show that in L2 norm experiments, our methods using machine learning model analysis information and hyperparameter gradients information achieves 200% to 300% convergence speed than SMAC on different scale datasets.
2 Background and Related Work
In this section, we will give brief introduction on SMBO algorithm and recent technology on hyperparameter gradient.
Hyperparameter optimization focuses on the problem: learning a performance function with finite . A exposes which change the way the learning algorithm , like learning rate in SGD, the number of leaves in GBDT.
Given a given learning algorithm and a limited amount of training data
, the goal of hyperparameters optimization is estimated by splittinginto disjoint training and validation sets and , learning performance functions by applying with to , and evaluating the predictive performance of these functions on . This allows for the hyperparameters optimization problem to be written as:
where is the loss (logistic loss, misclassification rate) achieved by when trained on and evaluated on . We use k-fold cross-validation , which splits the training data into equal-sized partitions
2.2 SMBO and Gaussian Process regression
In SMBO, we do not know the whole information about . We only need to sample different and build ’s response surface model to gain the minimum of .
SMBO shows its ability on both categorical and continuous hyperparameters optimization, can exploit hierarchical structure stemming from conditional parameters. SMBO first builds a model
that captures the dependence of loss functionon hyperparameter settings.
SMBO iterates steps is: 1. use and to determine a promising candidate configuration; 2. evaluate the loss ; 3. update the model with the new data point
SMAC is a kind of SMBO algorithm. In SMAC, in algorithm 1 is random forest and is the method which gain next hyperparameters configuration , i.e. . For TPE, the is the Tree-structure Parzen Estimator.
The widest used acquisition function for Gaussian Process SMBO is expected improvement function. In this paper, we use standard expected improvement function.
2.2.2 Gaussian Process
Many hyperparameter optimization frameworks often use Gaussian Process as .
The Gaussian Process () is a convenient and powerful prior distribution on functions. The GP is defined by the property that any finite set of m points
induces a multivariate Gaussian distribution on. The th of these points is taken to be the function value , and the elegant marginalization properties of the Gaussian distribution allow us to compute marginals and conditionals in closed form. The support and properties of the resulting distribution on functions are determined by a mean function and a positive definite covariance function . The work offers an overview of Gaussian Processes.
2.3 hyperparameter gradient
SMBO algorithm focuses on black-box function optimization. This algorithm does not need ’s any detail information.
Those methods use gradient descent methods. They are different with bayesian optimization methods.
Although Bayesian optimization methods are well developed recent years, yet traditional bayesian optimization algorithms do not take known information into consideration. In this paper, we design the algorithm based on those information to accelerate convergence speed.
3.1 Using Known Parameter direction information to accelerate black-box optimization in SMBO’s step 2
3.1.1 Model information
|GBDT||number of leaves||large|
|Decision tree||number of leaves||large|
Black-box optimization methods, like Bayesian optimization, is designed under the condition that the users do not know any information about objective function. However, in practice, the users often gain information about goal function via experience or rigid math analysis. We list some hyperparameter information as follow:
In meta-learning, those information is used to choose initial configuration in meta-learning dataset. Meta-learning does not exert any influence on hyperparameter optimization process. When the meta-learning dataset is not large enough, meta-learning fails to accelerate hyperparameter
3.1.2 Adjust hyperparameter in SMBO process
In this paper, we propose the algorithm 2. This algorithm use model analysis and empirical information in hyperparameter optimization process.
The main different between algorithm 1 and algorithm 2 is step 2. In traditional SMBO, the output of step 2 is the performance of the configuration predicted by . However, algorithm 2 adjust the configuration in computing process.
In our method, we mainly use -- to adjust hypertparameter in training process, i.e. the step 2. Trail-and-error method is different depending on different hyperparameter optimization.
In this paper, we divide hyperparameter into two groups. One part hyperparameter can be adjust in model training process, like learning rate for SGD, the number of leaves in GBDT, xgboost. Those hyperparameters exert influence on convergence speed. Another part of hyperparameter should be adjusted before training model, likeand norm, the choice of kernels. Our method deal with those hyperparameter in different sub-algorithms. Those algorithms are shown in algorithm 3 and algorithm 4.
3.1.3 Adjust hyperparameter
In this paper, we name the direction of adjusting hyperparameter as positive direction. For the hyperparameter which is the larger the better, the positive direction is increasing hyperparameter. For the hyperparameter which is the smaller the better, the positive direction is decreasing hyperparameter.
In practice, above two algorithms should be used sequentially, i.e. step 2 and step 3 in algorithm 2, because the hyperparameters adjusted in training, like learning rate, often exert influence on convergence speed. Different configurations should be compared at the same convergent degree.
When we attempt to adjust towards positive direction, we usually directly double the hyperparameter value or halve hayperparameter value by default. However, when we deal with the value like , norm on small size dataset, we should enlarge/reduce the hayperparameter value in a small step length, like increasing 10% or decreasing 10%.
3.1.4 The improvement of performance and stop condition
The influence of some hyperparameters to are monotone function. For example, the fixed point of SGD is closer to minimum of objective function with learning rate’s decreasing. However, the increase of model performance would exacerbate the demand of resources and time. It is non-meaningful to cost too much resource on minor performance improvement.
3.1.5 Acclerate SMBO process
For majority of hyperparameters, the character is relatively simple and well-studied, figure 1, like learning rate for SGD   and the number of leaves for GBDT. It is easy to know the range which contains the best configuration.
Algorithm 2 samples more configurations and their performance in above range via adjusting configuration into positive direction. The would gain more information about this range, Figure 2. mainly fit this range which would reduce the cost of resource and gain a better result.
3.2 Using Known Parameter Gradient information to accelerate black-box optimization in SMBO’s
Many researchers proved the methods that compute the gradient of hyperparameter. How to use gradient information into Bayesian optimization algorithms is one of the most hottest topic in machine learning area.
SMBO is one of Bayesian optimization algorithms which is designed for hyperparameter optimization. The most commonly used choices are random forest  and Gaussian Process . However, random forest is hard to add gradient information.
3.2.1 Gaussian Process Regression
In this section, we will show our gradient based Gaussian Process method.
In current gradient based Gaussian Process, the gradient information is used as prior information to build a complex guassian process. However, in hyperparameter optimization, the gradient information is the local charactor of , we prefer the gradient information as poster character for the gradient information logically.
When the gradient information can be computed, the SMBO history is
We introduce some notes and background before introducing our method. We set
is the vector whose dimension is.
is the kernel (covariance) function. We define the vector (dimension : (1*n)) and the vector whose dimension is ( 1 * (d * n)).
We define the matrix whose dimension is (k*n), kernel matrix whose dimension is (n*n) and gradient kernel matrix whose dimension is (n*(d*n)) as
The vector is defined .
In traditional Gaussian Process Regression with noise-free observations, the mean , i.e. . . . The Gaussian Process is .
3.2.2 Gradient based Gaussian Process Regression
Mean function The basic idea for our gradient based Gaussian Process is that find the Gaussian Process regression whose the gradient of mean function is close to the observations.
In traditional Gaussian Process, the mean function
is the fitting function using kernel functions as basis. The variance function exhibits the distance between the predict point and observation point.
To implement the above base, the initial idea is to combine different kernels with different coefficient to fit following equation 4, 3. When the number of kernel is equal to the dimension of , the mean function can reflect all observed information, including point and gradient observed value.
Multi Gaussian Process and variance function The essence of combination of kernel on mean function is the Gaussian Process combination, i.e.
where stands for the Gaussian Process which is regressed by kernel . Above view shows that the observation is produced by the sum of different Gaussian Process. Those Gaussian Process is regressed by different kernel.
Compared with traditional Gaussian Process regressed by one kernel, Gaussian Process combination have more free degree: this method have d+1 variable . Those free degree help our method to fit observation information like gradient and point value.
The variance function of those Gaussian Process reflect the distance of prediction point and observation point . Thus, for , the mean function: , variance function is . For the sum of Gaussian Process is still a Gaussian Process, the is
Approximate and reduce computation load
Although RBF (radial basis functions) kernel satisfy all requriments in above section (like Polyharmonic spline functions), the computation load is still our concern. What is more, as we can see in Figure1, the performance curve is not a smooth curve. The gradient reflect the local information instead of general trend of performance curve. Accurate gradient information and fitting brings negative influence in Gaussian Process regression and increases computation load. A appropriate Approximation is needed before running algorithm.
In our method, we approximate gradient information via reducing the number of kernel and use least square method to deal with gradient equation, i.e. equation 4.
The algorithm5 shows our method which uses 2 kernels. This algorithm is to solve following equation set.
For the final Gaussian Process, we expect the can present the character of observations, i.e. , .
In this paper, we compare the empirical performance of our method.
In this paper, we choose small dataset: pc4 in openml, middle dataset: rcv1 and large dataset in libsvm. The information of dataset is shown in table 2.
In all cases, the dataset is randomly split in three equally sized parts: a train set, which contains 50% samples in dataset, test which contains 30% samples in datasets and a validation set which contain 20% samples in datasets.
4.2 Experiment Setting
In our experiment, we will solve the problem that the regularization parameter in the L2 norm logistic regression model, because it is easy to gain the hyperparameter gradient of L2 norm. In this case, the loss function of the inner optimization problem, i.e., is the regularized logistic loss function. In the case of classification, the outer cost function is logistic loss function, i.e. equation 8. Logistic loss function would overcome the problem that zero-one loss is non-smooth loss function.
Where is the logistic loss, i.e. .
The solver used for the inner optimization problem of the logistic regression problem is stochastic gradient descent. In our problem, we set the search range is .
In this paper, for the dimension of hyperparameter, i.e. , is one, we can at most choose two kernel in our gradient Gaussian Process method: Gaussian radial basis function and cubic radial basis function .
In our experiment, the inverse matrix calculation is the key in algorithm. However, the accuracy of based on cubic radial basis function would be reduced for . Thus, we choose Gaussian radial basis function as .
4.2.3 Positive direction parameters
In experiments, we set is for the model would be close to convergent model under this setting. To show different performance on different parameters setting, the stop condition is improvement 0% (accept when ), and improvement 20% (accept when ).
For norm is usually small, the positive direction is towards to zero. In our experiments, we use as the methods in which we adjust towards positive direction.
4.2.4 Comparison with other hyperparameter optimization methods
We now compare against other hyperparameter optimization methods. The methods against which we compare are:
state of art method. Sequential model-based optimization using random forest. SMAC is the core algorithm in .
As initialization for this method, the first configuration is . we choose 4 challenger in one epoch, using intensify process choose the best one challenger (the default setting of ). The acquisition functions in this our experiments is expected improvement function.
In this paper, we modified the framework. To show different performance on different parameters setting, we use improvement 0% and improvement 20% stop condition settings. In experiments, the first method is named as GradGP&Pos. The second one is named as GradGP&Pos0.2.
As initialization for this method, the first configuration is . we choose 4 challenger in one epoch, and use intensify process choose the best one challenger (the default setting of ). The acquisition functions in this our experiments is expected improvement function.
common methods. The method we present in this paper, with an exponentially decreasing tolerance sequence. In the interval , we sample the 20 configurations uniformly. We compute the performance serially from to .
common method. This is the random search method samples the hyperparameters from a predefined distribution. We choose to samples from a uniform distribution in the interval. To make the all methods have almost the same initial value, the first epoch we compute the performance with .
state of art method. The hyperparameter optimization method based on gradient. The method is similar to gradient descent. In each epoch, the hyperparameter is adjust towards gradient direction. In this case, we use the modified HOAG frame work.
As initialization for this method, the first configuration, , is . To gain the fastest convergence speed in epoch, we set the . To measure the HOAG’s convergence speed, we learned the performance curve before HOAG experiments. Based on the Lipschitz constant in , we set the step length in HOAG is fixed in all experiments.
4.3 Experimental Result and Analysis
Figure 3 shows the experimental results on pc4 dataset. In this case, our methods achieve the fastest convergence speed and find the best hyperparameter. The output from SMAC is the suboptimal. For the dataset scale is small, the character of performance curve in this case is unstable and multimodal function. HOAG shows a bad performance. The output of HOAG does not appear convergence.
Figure 4 shows the experimental results on rcv1 dataset. In this case, our methods achieve the fastest convergence speed and find the best hyperparameter. The character of performance curve in this case is relatively stable and close to unimodal function. The output from HOAG is the suboptimal. The output of SMAC is the almost to random search, because SMAC needs more sample points to build whole curve information.
Figure 5 shows the experimental results on real-sim dataset. In this case, HOAG and our methods achieve the fastest convergence speed and find the best hyperparameter.
Via the Analysis of the results from GradGP&Pos and GradGP&Pos0.2, we gain the conclusion about early stop condition that when the curve is unstable, the stop condition should be sensitive to the performance change.
In this paper, we use known information: machine learning model analysis information and hyperparameter gradient information to accelerate hyperparameter optimization process. In experiments, our methods achieve 200% to 300% than SMAC algorithm in pc4, real-sim, rcv1 datasets.
This work was supported by the National Natural Science Foundation of China under Grant No. 61502450, Grant No. 61432018, Grant No. 61521092, and Grant No. 61272136; National Major Research High Performance Computing Program of China under Grant No. 2016YFB0200800.
-  J. Bergstra and Y. Bengio. Algorithms for hyper-parameter optimization. In International Conference on Neural Information Processing Systems, pages 2546–2554, 2011.
-  P. Brazdil. Metalearning: Applications to data mining. Cognitive Technologies, 2009.
E. Brochu, V. M. Cora, and N. De Freitas.
A tutorial on bayesian optimization of expensive cost functions‚ with application to active user modeling and hierarchical reinforcement learning.arXiv: Learning, 2010.
-  C. B. Do, C. S. Foo, and A. Y. Ng. Efficient multiple hyperparameter learning for log-linear models. In International Conference on Neural Information Processing Systems, pages 377–384, 2007.
-  J. C. Duchi. Introductory Lectures on Stochastic Convex Optimization. Park City Mathematics Institute, Graduate Summer School Lectures, 2016.
-  M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, and F. Hutter. Efficient and robust automated machine learning. pages 2962–2970, 2015.
-  E. Hazan, A. Klivans, and Y. Yuan. Hyperparameter optimization: A spectral approach. 2017.
-  F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Learning and Intelligent Optimization - International Conference, Lion 5, Rome, Italy, January 17-21, 2011. Selected Papers, pages 507–523, 2012.
-  D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998.
-  L. Li, K. Jamieson, G. Desalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18:1–52, 2016.
-  H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and F. Hutter. Towards automatically-tuned neural networks. pages 58–65, 2016.
-  P. Mineiro and N. Karampatziakis. Loss-proportional subsampling for subsequent erm. In International Conference on Machine Learning, pages 522–530, 2013.
-  F. Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on International Conference on Machine Learning, pages 737–746, 2016.
-  C. E. Rasmussen and H. Nickisch. Gaussian Processes for Machine Learning (GPML) Toolbox. JMLR.org, 2010.
-  R. G. Regis and C. A. Shoemaker. Constrained global optimization of expensive black box functions using radial basis functions. Journal of Global Optimization, 31(1):153–171, 2005.
-  J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In International Conference on Neural Information Processing Systems, pages 2951–2959, 2012.
-  P. Sun, T. Zhang, and J. Zhou. A convergence rate analysis for logitboost, mart and their variant. pages 1251–1259, 2014.
-  C. Thornton, F. Hutter, H. H. Hoos, and K. Leytonbrown. Auto-weka: combined selection and hyperparameter optimization of classification algorithms. knowledge discovery and data mining, pages 847–855, 2013.
-  J. Wu, M. Poloczek, A. G. Wilson, and P. I. Frazier. Bayesian optimization with gradients. 2017.
-  M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. pages 2595–2603, 2010.