Fast Hyperparameter Optimization of Deep Neural Networks via Ensembling Multiple Surrogates

11/06/2018 ∙ by Yang Li, et al. ∙ 0

The performance of deep neural networks crucially depends on good hyperparameter configurations. Bayesian optimization is a powerful framework for optimizing the hyperparameters of DNNs. These methods need sufficient evaluation data to approximate and minimize the validation error function of hyperparameters. However, the expensive evaluation cost of DNNs leads to very few evaluation data within a limited time, which greatly reduces the efficiency of Bayesian optimization. Besides, the previous researches focus on using the complete evaluation data to conduct Bayesian optimization, and ignore the intermediate evaluation data generated by early stopping methods. To alleviate the insufficient evaluation data problem, we propose a fast hyperparameter optimization method, HOIST, that utilizes both the complete and intermediate evaluation data to accelerate the hyperparameter optimization of DNNs. Specifically, we train multiple basic surrogates to gather information from the mixed evaluation data, and then combine all basic surrogates using weighted bagging to provide an accurate ensemble surrogate. Our empirical studies show that HOIST outperforms the state-of-the-art approaches on a wide range of DNNs, including feed forward neural networks, convolutional neural networks, recurrent neural networks, and variational autoencoder.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Deep neural networks (DNNs) have made great success in many artificial intelligence fields 

[Goodfellow, Bengio, and Courville2016]. Their performance crucially depends on good hyperparameter configurations, but it is poorly understood how these hyperparameters collaboratively affect the performance of the resulting model. Consequently, practitioners often carry out either hand-tuning or automated brute-force methods, such as grid search and random search [Bergstra and Bengio2012], to find a good hyperparameter configuration.

Recently Bayesian optimization (BO) has become a very efficient framework in hyperparameter optimization [Snoek, Larochelle, and Adams2012, Hutter, Hoos, and Leyton-Brown2011, Bergstra et al.2011, Bergstra, Yamins, and Cox2013, Ilievski et al.2017]. In the traditional setting of BO, the ML algorithm’s loss (e.g., validation error) given a hyperparameter configuration is treated as a black-box problem. The goal is to find , where the only mode of interaction with the objective is to evaluate the given configuration . BO methods use a probabilistic surrogate model to approximate , which describes the relationship between a hyperparameter configuration and its performance [Močkus1975]. With these basic ingredients, BO methods iterate the following three steps: 1) use surrogate to select a promising configuration for the next evaluation; 2) evaluate configuration to get its evaluation performance with , and add the resulting data point to the set of evaluation data ; 3) update with the augmented . In order to approximate accurately, BO methods need sufficient evaluation data to train an accurate surrogate . However, the black-box assumption in BO requires that BO methods need to evaluate with the complete runs to get its performance. Furthermore, each complete evaluation of DNNs might take several hours or even days. The two factors cause too few evaluation data for BO to train an accurate surrogate within a limited time. Therefore, there exists the problem —“insufficient evaluation data”, which hampers BO’s widespread use in optimizing the hyperparameters of DNNs.

To make BO more efficient, recent extensions of BO methods relax the black-box assumption of

and resort to the inspiration from hand-tuning. Early stopping is a widely-used trick in hand-tuning. Humans terminate the badly-performing evaluations earlier, enabling humans to conduct more evaluations in a given time. Some heuristic methods mimic this procedure, and stop the bad evaluations earlier by estimating the overall performance according to the intermediate performance of short runs  

[Domhan, Springenberg, and Hutter2015, Klein et al.2017, Li et al.2018]. As a further improvement, several methods combine vanilla BO methods with early stopping methods to achieve a better performance [Domhan, Springenberg, and Hutter2015, Klein et al.2017, Falkner, Klein, and Hutter2018a]. These methods intrinsically accelerate hyperparameter optimization by evaluating more configurations in a limited time, thus producing more evaluation data. But these methods conduct BO only using the complete evaluation data gathered from the complete evaluations, and ignore the intermediate evaluation data obtained from the early-stopping evaluations. In other words, they do not fully utilize the generated evaluation data . Therefore, we ask the question – Is the intermediate evaluation data really useless for BO? If not, how can we use it to improve BO?

We observe that the intermediate evaluation data is useful to reveal information about the objective function . There are two situations about . If we terminate the evaluation of a nearly converged DNN model with configuration , we can directly treat this intermediate performance as the overall performance . If we terminate the evaluation of a DNN model, which does not converge yet and has a relatively bad , we can use to predict at the expense of certain accuracy loss. In summary, although the intermediate evaluation data can be less accurate than the complete evaluation data, it still contains potential information about .

Then we explore how to utilize to accelerate BO. The most straightforward method is to train a surrogate on . However, due to early stopping, consists of several groups of data points obtained with different training resources. Due to the diversity of , different groups of do not conform to the same distribution. Owing to the violation of the i.i.d criterion, we cannot apply BO methods directly. In order to utilize effectively, we need to tackle two problems — 1) how to extract useful information from ; 2) how to exploit these useful information to speed up BO.

Different from those methods trying to produce more evaluation data within a given time, we propose an orthogonal method to accelerate hyperparameter optimization. This method increases the utilization of evaluation data by augmenting the original training data with the intermediate evaluation data , and addresses the former two problems effectively. First, we train multiple basic surrogates on different groups of without incorporating any additional evaluation cost. Then, we combine these basic surrogates using weighted bagging to obtain an ensemble surrogate , which provides an accurate approximation for .

Our contributions can be summarized as follows:

  • We study the feasibility of using intermediate evaluation data to accelerate BO. To the best of our knowledge, this is the first work to exploit the intermediate evaluation data.

  • We propose a novel ensemble method for BO, HOIST, that combines multiple basic surrogates trained on the mixed evaluation data to provide a more accurate approximation for .

  • We develop an efficient learning method to update the weight vector in the ensemble model, which determines the contribution made by each basic surrogate to approximating

    .

  • Extensive experiments demonstrate the superiority of HOIST over the state-of-the-art approaches on a wide range of DNNs. HOIST achieves speedups of 3 to 6 fold in finding a near-optimal configuration with fewer evaluations, and reaches the best performance on the test set.

Related Works

The state-of-the-art performance in hyperparameter optimization is achieved by BO methods, which aim to identify good configurations more quickly than standard baselines like random search. BO methods construct a probabilistic surrogate model to describe the relationship between a hyperparameter configuraion and its performance . Among these methods, Spearmint [Snoek, Larochelle, and Adams2012] uses Gaussian process [Rasmussen2004] to model . SMAC [Hutter, Hoos, and Leyton-Brown2011]

uses a modified random forest

[Breiman2001] to yield an uncertain estimate of . Besides, TPE [Bergstra et al.2011] is a special instance of BO method, which uses the tree-structured Parzen density estimators over good and bad configurations to model . An empirical evaluation of these three methods [Eggensperger et al.2013, Eggensperger et al.2015] shows that SMAC performs best on the benchmarks with high-dimensional, categorical, and conditional hyperparameters, closely followed by TPE. Spearmint only performs well for low-dimensional continuous hyperparameters, and do not support complex configuration spaces (e.g., conditional hyperparameters). Since DNNs involve high-dimensional and various types of hyperparameters, we will use SMAC as the basic surrogate model in our study.

Many BO methods relax the traditional black-box assumption and exploit cheaper information about [Swersky, Snoek, and Adams2013, Swersky, Snoek, and Adams2014, Klein et al.2016, Kandasamy et al.2017, Poloczek, Wang, and Frazier2017]. For example, multi-task BO [Swersky, Snoek, and Adams2013] transfers knowledges between a finite number of correlated tasks, which are cheaper to evaluate. FABOLAS [Klein et al.2016] evaluates configurations on subsets of the training data, in order to quickly get information about good hyperparameter settings. Unlike these methods, our method focuses on early stopping to obtain cheaper information about , instead of creating additional tunning tasks.

Human experts can automatically identify and terminate bad evaluations in a short run. Several methods mimic the early termination of bad evaluations to save the evaluation overhead. A probabilistic model [Domhan, Springenberg, and Hutter2015] is used to predict the overall performance according to the already observed part of learning curve, enabling us to terminate the bad evaluations earlier. Based on this, the LCNet with a learning layer  [Klein et al.2017] is developed to improve the prediction of learning curve. Besides, Hyperband [Li et al.2018] is a bandit early stopping method. It dynamically allocates resources to randomly sampled configurations, and uses successive halving algorithm [Jamieson and Talwalkar2016] to drop those badly-performing configurations. We will describe this detailedly in Section 3. Despite its simplicity, Hyperband outperforms the state-of-the-art BO methods within a limited time. However, due to the random sampling of configurations, Hyperband achieves worse performance than BO methods if given sufficient time.

To accelerate hyperparameter optimization, several methods combine BO methods with early stopping methods. The probabilistic learning curve model [Domhan, Springenberg, and Hutter2015] is used to terminate the badly-performing evaluations in the setting of BO methods. one method [Klein et al.2017] proposes a model-based Hyperband. Instead of random sampling, it samples configurations based on the LCNet. Besides, BOHB [Falkner, Klein, and Hutter2018b] is also a model-based Hyperband, which combines the benefit of both Hyperband and BO methods by replacing the random sampling of Hyperband with a TPE-based sampling. However, these methods do not exploit the intermediate evaluation data generated by early stopping methods. Therefore, the current methods do not reach the full potentials of this framework — combining BO with early stopping methods.

Preliminaries

HOIST follows the framework — combining BO with early stopping methods. As discussed in Section 2, HOIST chooses SMAC as the basic surrogate model, and uses Hyperband to carry out early stopping. We now describe SMAC and Hyperband in more detail.

SMAC is the basic surrogate model in HOIST. In the BO iteration, SMAC uses a probabilistic random forest model to fit the objective function based on the already observed data points ; then selects a promising configuration by finding the maximum of acquisition function

, which is a heuristic function that uses the posterior mean and variance to balance the exploration and exploitation. SMAC uses expected improvement criterion (EI)

[Jones, Schonlau, and Welch1998] as the acquisition function:

where is the best performance value in . Given a configuration , this random forest surrogate outputs the prediction of with mean and variance , so satisfies .

Figure 1: Successive halving loop of Hyperband in tuning LENet on MNIST with , and

; one unit of resource corresponds to 36 epochs.

(a)
(b)
(c)
(d)
Figure 2:

Validation error of 900 LeNet configurations (30 settings of dropout probability

and 30 settings of the learning rate on a base-10 log scale in [-7, -2]) on the MNIST dataset using different training epochs .
Figure 3: The framework of HOIST

Hyperband is a principled early stopping method. This method has two components:
Inner Loop: successive halving (SH).  Given a budget of training resource (e.g., the number of iteration), Hyperband first uniformly samples configurations, evaluates each configuration with units of resources, and ranks them by the evaluation performance. Then Hyperband drops the bad configurations, and continues the top configurations (usually ), according to the previous rankings. And each configuration is equipped with times larger resources . This operation is repeated until there is only one configuration left with the maximum resource, i.e., , , and . We illustrate this procedure in Figure 1.
Outer Loop: grid search of n.  For a fixed budget , there is no prior whether we should use a larger with a small training resource , or a smaller with a larger training resource. Hyperband addresses this by performing a grid search over feasible values of in the outer loop.

Anatomy of Intermediate Evaluation Data

Different from previous works under this framework, HOIST utilizes all generated evaluation data to accelerate BO. First, we analyze the intermediate evaluation data created in Hyperband, and then specify three properties of intermediate evaluation data.

Hyperband produces two types of evaluation data in the SH loop: 1) intermediate evaluation data obtained with training resource less than , and 2) complete evaluation data gathered with the maximum training resource . We use the example in Figure 1 to illustrate this, where and units of resources. There are two early-stopping stages. The first early-stopping stage produces intermediate evaluation data () with training resource unit, and the second stage creates intermediate evaluation data () with 3 times larger training resource units. Besides, only one () complete evaluation data () with units is created in each SH loop. As mentioned before, BOHB simply uses the scarce complete evaluation data (). Therefore, this method also suffers from the “insufficient evaluation data” problem, which greatly reduces the efficiency of BO.

Properties of Intermediate Evaluation Data

Based on our empirical studies, we summarize three properties about the intermediate evaluation data:

Property 1

A group of intermediate evaluation data , obtained at a later early-stopping stage (larger i), has a smaller size and a larger training resource .

Property 2

A group of evaluation data , obtained at the early-stopping stage with the same training resource , conforms to the same distribution . All intermediate evaluation data is sampled from several different distributions .

Property 3

The distribution with a larger is more accurate to approximate the objective distribution .

Property 1 can be easily concluded from the SH loop in Hyperband. Next, we give an intuitive verification of Properties 2 and 3 via an experiment (see A3 in the Supplemental Material for more experiments). We validated 900 configurations of LeNet with two hyperparameters using different training resources epochs. Figure 2 visualizes the validation error as heat maps, where good configurations with low validation error are marked by the yellow region. The first three figures illustrate the intermediate performance of configurations using epochs, and the last one displays the overall performance using epochs. Due to the different shapes and areas of yellow regions, the Property 2 holds. Besides, the shape of yellow region with a larger is more similar to the one with . Therefore, the Property 3 also holds.

According to Property 2, we cannot train a single surrogate due to violating i.i.d criterion. Therefore, we need to design a new BO method to utilize these evaluation data.

Hoist

In this section, we first give an overview of HOIST, and then elaborate each component.

Overview of HOIST

As shown in Figure 3, HOIST consists of three components:

Multiple basic surrogates

Instead of training a single surrogate, HOIST trains multiple basic surrogates on the mixed evaluation data using SMAC. is determined by Hyperband’s setting, and usually is less than 7.

Ensemble surrogate

Then HOIST combines these basic surrogates using weighted bagging to yield an ensemble surrogate , which provides a more accurate approximation for . Each basic surrogate has a weight , determining the contribution made by to approximating .

Weight vector learning

We design a learning method to learn the weight vector in the ensemble surrogate. This method updates weight vector by measuring these basic surrogates’ accuracy when approximating .

In each SH loop of Hyperband, HOIST uses the ensemble surrogate to sample configurations, instead of randomly sampling. Concretely, HOIST utilizes the ensemble surrogate and EI to select promising configurations for each SH loop in Hyperband. When each loop finishes, HOIST updates all basic surrogates with the augmented evaluation data, and learns a weight vector to form a new ensemble surrogate.

Basic Components: Multiple Basic Surrogates

In order to take full advantage of these mixed evaluation data, HOIST uses SMAC to train basic surrogates. Each basic surrogate models a group of intermediate evaluation data from the early-stopping stage, and the surrogate models the complete evaluation data . According to SMAC, the probabilistic form of surrogate is , and we abbreviate it as for simplicity. After each SH loop in Hyperband, is augmented with the new evaluation data from the corresponding early-stopping stage, and then HOIST updates the surrogate with the augmented .

Note that each has different accuracy when approximating . We summarize their characteristics as followed:

  1. For a larger , since has fewer data points with a larger (Property 1), is insufficient to train a surrogate ; but with enough data points, is more accurate to approximate (Property 3).

  2. For a smaller , has more data points with a smaller (Property 1). is sufficient to train a , however, is less accurate to approximate (Property 3).

Based on the above discussion, no single surrogate can approximate accurately. Therefore, we investigate how to utilize these basic surrogates to provide an accurate approximation for .

Ensemble Surrogate with Weighted Bagging

Inspired by the ensemble learning method — bagging, we combine multiple basic surrogates to obtain an ensemble surrogate, providing a more accurate approximation for . Instead of averaging all basic surrogates’ predictions, we give this global surrogate using weighted bagging:

The range of is , and . Weight vector determines the proportion of each surrogate’s output in the global surrogate. A more accurate surrogate has a larger proportion in (larger ). Given a configuration , the global surrogate outputs its prediction about , which satisfies:

Here, for the simplicity of calculation, we assume that is independent with . So the and functions can be defined by:

where the and functions of each basic surrogate are given in SMAC.

Overall, due to the lack of training data, BO methods cannot approximate accurately. Our solution is to use an ensemble surrogate to represent , which combines multiple basic surrogates trained on the mixed evaluation data to achieve a more accurate approximation. Next, we specify the learning method of weight vector.

Weight Vector Learning Method

As described above, weight is proportional to the accuracy of when approximating . If is more accurate to approximate , and will have a stronger relationship. Because correlation coefficient can effectively capture the degree of relationship between and , we use it to calculate the weight vector. Besides, since is unknown, we utilize the samples of to calculate the correlation coefficient. Concretely, we first use each surrogate to predict the performance of configurations in . Then we calculate the correlation coefficient between the predictive performance of and the true performance of in . The resulting correlation coefficient is referred as the raw weight vector .

In addition, two techniques are designed to refine the raw weight vector. First, we use the weight amplification operation to decrease the weight of bad surrogate and amplify the weight of good surrogate. Due to , this operation — the normalization of the square of — has a discriminative scaling effect on different weights. Second, we use the following rule to update :

This smooth update method prevents changing drastically in the beginning, and it does this by adding a fraction of the vector of the previous step to the weight vector . We set , and initialize each item of with in practice.

Input: evaluation data , surrogates , current weight vector , update ratio
Output: updated weight vector
1 scale evaluation performance in each using min-max normalization respectively;
2 for  do
3       , and is the configurations in ;
4       is the performance data in ;
5       ;
6      
7 end for
8max operation: ;
9 weight amplification operation: ;
10 update weight vector: ;
Algorithm 1 Weight Vector Learning Algorithm

Algorithm 1 shows the pseudocode for weight vector learning method. In line 1, since the performance value in different has a different numerical range, we use min-max normalization to rescale into the range in each , and this linear normalization will not affect the result. In line 2-6, we calculate the raw weight vector based on correlation coefficient. In line 7, max operation ignores the inaccurate surrogate with negative , by setting to . In line 8, we conduct the weight amplification operation. At last, the smooth update rule is used in line 9.

Experiments and Results

To evaluate our proposed hyperparameter optimization method, we performed experiments on a broad range of DNN tasks.

Experimental Setup

In our empirical evaluations of HOIST, we focused on the following four tuning tasks:

  • [noitemsep]

  • FCNet: We optimized 10 different hyperparameters of a feed forward neural network (FCNet) on MNIST [LeCun et al.1998], and the maximum number of training epochs for each configuration is 81.

  • CNN: We optimized 25 different hyperparameters of a 6-layer convolutional neural network (CNN) on the CIFAR-10 [Krizhevsky, Sutskever, and Hinton2012] benchmark with epochs.

  • RNN: We optimized 4 different hyperparameters of a recurrent neural network (RNN) on IMDB [Maas et al.2011] with epochs.

  • VAE: We optimized 4 different hyperparameters of a variational autoencoder (VAE) [Kingma and Welling2013]. We trained this network on MNIST to optimize the approximation of the lower bound with epochs.

Figure 4: Tuning FCNet on MNIST

We compared HOIST with five baselines: 1) an effective vanilla BO method — SMAC (vanilla BO), 2) a bandit-based early stopping method — Hyperband (HB), 3) a model-based Hyperband (HB-LCNet), 4) combining Hyperband with BO (BOHB) and 5) batch BO method [González et al.2016] (Batch BO), which utilizes parallel mechanism to evaluate configurations concurrently. Batch BO produces times more complete evaluation data than the vanilla BO within the same time, thus can achieve a better performance.

For each method, we tracked the wall clock time (including optimization overhead and the cost of evaluations), and stored the smallest validation error after each evaluation. We ran each method 10 times with different random starts, and plotted the averaged validation error across these runs. To test the final performance, we applied the best models found by these methods to the test data and reported their test error. At last, we compared these methods according to two metrics: 1) the time for reaching the same validation error; 2) the final performance on the test data. (See A1 in Supplemental Material for the detailed hyperparameter description and implementations about these methods)

FCNet on MNIST

In the first experiment, we trained FCNet on MNIST. This network contains two fully connected layers, and each layer is followed by a dropout layer [Srivastava et al.2014]. We optimized 10 hyperparameters that control the training procedure (learning rate, momentum, decay, batch size, L2 regularizer, dropout 111There are two dropout values and each dropout layer has one.

and batch normalization) and the architecture (units per layer).

Results for FCNet on MNIST

Figure 4 illustrates the results of tuning FCNet. In the beginning, with the help of intermediate evaluation data from early stopping, HOIST shows the fastest convergence speed among all methods. As the complete evaluation data increases, BOHB, Vanilla BO and Batch BO reaches a better performance than HB and HB-LCNet, but they still do not exceed HOIST. HOIST can approximate accurately with small evaluation overhead — 0.75 hours (i.e., 2800 seconds). In contrast, other methods spend more than 5 hours (i.e., 18000 seconds), and still cannot find a good configuration like HOIST does within 0.75 hours. Therefore, HOIST achieves at least 6-fold speedups for reaching the same validation error. Besides, the performance on the test data is given in Table 1, showing that HOIST find a better configuration than other approaches do.

Method FCNet CNN RNN VAE
Vanilla BO 7.49% 16.34% 15.39% 0.0992
Batch BO 7.47% 16.33% 15.27% 0.0996
HB 7.55% 17.48% 15.42% 0.0992
HB-LCNET 7.52% 17.68% 15.32% 0.0990
BOHB 7.48% 16.76% 15.41% 0.0991
HOIST 7.41% 16.19% 14.99% 0.0986
Table 1: Mean test error with the best hyperparameters found by each method. See A2 for details about variance
Figure 5: Tuning CNNs on CIFAR-10

CNN on CIFAR-10

In the second experiment, we verified the effectiveness of HOIST under a high-dimensional hyperparameter space. We trained convolutional neural networks with 25 hyperparameters on CIFAR-10 without data augmentation. The number of convolutional layer is a hyperparameter, and the maximum value is set to 6. For each convolutional layer, its structure and training process are controlled by 3 hyperparameters – the number of filters, kernel initialization and kernel regularization. Dropout is optionally applied to the fully connected layer. Besides, RMSprop with 3 hyperparameters is used to train this network. To sum up, there are total 25 hyperparameters: 7 training hyperprameters and 6 layers

3 hyperparameters per layer.

Results for CNN on CIFAR-10

Figure 5 illustrates the speedups that our method yields with a high-dimensional hyperparameter space. Since the amount of needed training data increases exponentially with the dimension of hyperparameters, BO related methods fail to optimize hyperparameters within 3.75 hours (i.e., 13500 seconds). Specifically, HB-LCNet, Batch BO and BOHB achieve a similar performance as the bandit-based search — HB, and Vanilla BO gets the worst performance. In contrast, HOIST achieves a validation error of within 3.75 hours, while it takes other methods nearly 11.25 hours (i.e., 40500 seconds) to reach a similar validation error. Therefore, HOIST yields 3 times speedups in wall clock time. Obviously, our method can effectively handle the scenarios with a high-dimensional hyperparameter space. Finally, Table 1 lists the test performance, and it shows that HOIST gets the best result on the test data.

Figure 6: Tuning RNNs on IMDB

RNN on IMDB

In this experiment, although the quality of intermediate evaluation data is poor, HOIST still can achieve a competitive performance. We performed sentiment classification with LSTMs [Hochreiter and Schmidhuber1997] on IMDB. First, the maximum sequence length is set to 250, and word vectors are from GloVe [Pennington, Socher, and Manning2014]. Then, the input data is fed into an LSTM network, and a dropout layer [Gal and Ghahramani2016]

wraps that LSTM cell. At last, a softmax layer with 2 units is added to get the label. In this RNN, we optimized these hyperparameters: 1) the number of LSTM units, 2) learning rate in Adam optimizer

[Kingma and Ba2014], 3) batch size and 4) the keep probability in dropout layer. Overall, this network has 4 hyperparameters, resulting in different hyperparameter configurations (See A1 in the Supplemental Material for details).

Results for RNN on IMDB

Figure 6 illustrates the performance of different methods on RNN. In the beginning, all baselines except vanilla BO perform better than HOIST. After analyzing the weight vector in HOIST, we found that only a small part of the intermediate evaluation data is useful to reveal information about . In other words, most entries in are zero or near-zero. The poor quality of intermediate evaluation data limits the convergence speed of HOIST in the first 3.47 hours (i.e., 12500 seconds). With the increase of useful intermediate evaluation data, HOIST then outperforms all baselines quickly. This shows that HOIST is able to efficiently gather useful information from the intermediate evaluation data with poor quality. Furthermore, HOIST reaches a validation error of within hours (i.e., 17000 seconds). Whereas other methods take more than 27.7 hours, and still cannot reach the same validation error. Therefore, our method achieves at least 5.8-fold speedups. In addition, Table 1 illustrates that HOIST reaches the best performance on the test data.

Figure 7: Tuning VAEs on MNIST

VAE on MNIST

In this experiment, we optimized the variational lower bound of a variational autoencoder with the same architecture in auto-encoding variational bayes [Kingma and Welling2013]. In VAE, the number of hidden units in the encoder/decoder and the dimension of latent space determine the architecture. The learning rate of Adam and batch size control the training procedure. Therefore, VAE has 4 hyperparameters, resulting in different configurations (See A1 in the Supplemental Material for details).

Results for VAE on MNIST

Figure 7 illustrates the results of tuning VAE. Similar to the results in FCNet, HOIST shows the fastest convergence speed in the beginning. Because BO-based methods (e.g., BOHB, Vanilla BO and Batch BO) have very few complete evaluation data, they converge slowly in this period. With the increase of complete evaluation data, the baselines gradually converge, while their results, obtained with more than 28 hours, are still worse than the result of HOIST got within 5.5 hours (i.e., 20000 seconds). To summarize, HOIST achieves at least 5 times speedups for reaching the similar validation performance — . Besides, as shown in Table 1, HOIST gets the best performance on the test dataset. This demonstrates the effectiveness of HOIST in accelerating hyperparameter optimization by exploiting the intermediate evaluation data.

Conclusion

In this paper, we introduced HOIST, a fast hyperparameter optimization method, which utilizes both the complete evaluation data and intermediate evaluation data from early stopping to speed up hyperparameter optimization. We proposed a novel ensemble method, which combines multiple basic surrogates to provide a more accurate approximation for the objective function. We evaluated the performance of HOIST on a broad ranges of benchmarks, and demonstrated its superiority over the state-of-the-art approaches. In addition, HOIST actually is a general ensemble framework to accelerating hyperparameter optimization, and it also applies to other BO methods and early stopping methods. In the future work, we plan to explore 1) embedding HOIST into parallel and distributed computing environments; 2) using transfer learning techniques to combine basic surrogates.

References

  • [Bergstra and Bengio2012] Bergstra, J., and Bengio, Y. 2012. Random search for hyper-parameter optimization.

    Journal of Machine Learning Research

    13(Feb):281–305.
  • [Bergstra et al.2011] Bergstra, J. S.; Bardenet, R.; Bengio, Y.; and Kégl, B. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, 2546–2554.
  • [Bergstra, Yamins, and Cox2013] Bergstra, J.; Yamins, D.; and Cox, D. D. 2013. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.
  • [Breiman2001] Breiman, L. 2001. Random forests. Machine learning 45(1):5–32.
  • [Domhan, Springenberg, and Hutter2015] Domhan, T.; Springenberg, J. T.; and Hutter, F. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, volume 15, 3460–8.
  • [Eggensperger et al.2013] Eggensperger, K.; Feurer, M.; Hutter, F.; Bergstra, J.; Snoek, J.; Hoos, H.; and Leyton-Brown, K. 2013. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, volume 10,  3.
  • [Eggensperger et al.2015] Eggensperger, K.; Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2015. Efficient benchmarking of hyperparameter optimizers via surrogates. In AAAI, 1114–1120.
  • [Falkner, Klein, and Hutter2018a] Falkner, S.; Klein, A.; and Hutter, F. 2018a. BOHB: Robust and efficient hyperparameter optimization at scale. In Proceedings of the 35th International Conference on Machine Learning, 1436–1445.
  • [Falkner, Klein, and Hutter2018b] Falkner, S.; Klein, A.; and Hutter, F. 2018b.

    Practical hyperparameter optimization for deep learning.

  • [Gal and Ghahramani2016] Gal, Y., and Ghahramani, Z. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, 1019–1027.
  • [González et al.2016] González, J.; Dai, Z.; Hennig, P.; and Lawrence, N. 2016. Batch bayesian optimization via local penalization. In Artificial Intelligence and Statistics, 648–657.
  • [Goodfellow, Bengio, and Courville2016] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Hutter, Hoos, and Leyton-Brown2011] Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2011. Sequential model-based optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, 507–523. Springer.
  • [Ilievski et al.2017] Ilievski, I.; Akhtar, T.; Feng, J.; and Shoemaker, C. A. 2017. Efficient hyperparameter optimization for deep learning algorithms using deterministic rbf surrogates. In AAAI, 822–829.
  • [Jamieson and Talwalkar2016] Jamieson, K., and Talwalkar, A. 2016. Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, 240–248.
  • [Jones, Schonlau, and Welch1998] Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Efficient global optimization of expensive black-box functions. Journal of Global optimization 13(4):455–492.
  • [Kandasamy et al.2017] Kandasamy, K.; Dasarathy, G.; Schneider, J.; and Poczos, B. 2017. Multi-fidelity bayesian optimisation with continuous approximations. arXiv preprint arXiv:1703.06240.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • [Klein et al.2016] Klein, A.; Falkner, S.; Bartels, S.; Hennig, P.; and Hutter, F. 2016. Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079.
  • [Klein et al.2017] Klein, A.; Falkner, S.; Springenberg, J. T.; and Hutter, F. 2017. Learning curve prediction with bayesian neural networks. ICLR.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.
  • [LeCun et al.1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • [Li et al.2018] Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; and Talwalkar, A. 2018. Hyperband: A novel bandit-based approach to hyperparameter optimization. ICLR 1–48.
  • [Maas et al.2011] Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts, C. 2011.

    Learning word vectors for sentiment analysis.

    In ACL, 142–150. Association for Computational Linguistics.
  • [Močkus1975] Močkus, J. 1975. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, 400–404. Springer.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    , 1532–1543.
  • [Poloczek, Wang, and Frazier2017] Poloczek, M.; Wang, J.; and Frazier, P. 2017. Multi-information source optimization. In Advances in Neural Information Processing Systems, 4288–4298.
  • [Rasmussen2004] Rasmussen, C. E. 2004. Gaussian processes in machine learning. In Advanced lectures on machine learning. Springer. 63–71.
  • [Snoek, Larochelle, and Adams2012] Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
  • [Swersky, Snoek, and Adams2013] Swersky, K.; Snoek, J.; and Adams, R. P. 2013. Multi-task bayesian optimization. In Advances in neural information processing systems, 2004–2012.
  • [Swersky, Snoek, and Adams2014] Swersky, K.; Snoek, J.; and Adams, R. P. 2014. Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896.