Hyperparameter optimization, a classical problem for model selection, receives increasing attention due to the rise of automated machine learning, where not only traditional hyperparameters such as learning rates(Zeiler, 2012), but also neural architectures (Liu et al., 2019), data augmentation policies (Cubuk et al., 2018) and other plausible variables are to be set automatically (Elsken et al., 2019; Zöller and Huber, 2019). This can be perceived as a search problem in a large hyperparameter search space – the goal is to find the hyperparameters that maximize the generalization capability of the model if trained with the searched hyperparameters. There are mainly three challenges: the initial design (Jones et al., 1998; Konen et al., 2011; Brockhoff et al., 2015; Zhang et al., 2019), the sampling method (Rasmussen, 2003; Hutter et al., 2011; Bergstra et al., 2011; Springenberg et al., 2016; Srinivas et al., 2010; Hennig and Schuler, 2012; Hernández-Lobato et al., 2014; Wang and Jegelka, 2017; Ru et al., 2018), and the evaluation method (Thornton et al., 2013; Karnin et al., 2013; Jamieson and Talwalkar, 2016; Li et al., 2016; Falkner et al., 2018).
In the literature, initial designs including Latin hypercube design (Jones et al., 1998) and orthogonal array (Zhang et al., 2019) are used to replace random initialization for improving experimental effect and reducing the time required for convergence. One straightforward idea of sampling methods is to use grid search (Wu and Hamada, 2011) or random search (Bergstra and Bengio, 2012). However, the required number of function evaluations, which involve model training and thus computationally expensive, grows exponentially with the number of hyperparameters. Population-based methods (Hansen, 2016; Jaderberg et al., 2017)
such as genetic algorithms, evolutionary strategies, and particle swarm optimization use guided search and usually outperform random search. These algorithms are often time-consuming for the evolution of the population. To develop efficient sampling strategies, model-based methods are paid more attention in which a surrogate model of the objective function is built, allowing to choose the next hyperparameters to evaluate in an informed manner. Among them, Bayesian optimization (BO) becomes popular with different probabilistic surrogate models, such as Gaussian processes (GPs)(Snoek et al., 2012) et al., 2011)
, or Tree-structure Parzen Estimator (TPE)(Bergstra et al., 2011). These model-based methods are shown to outperform random search in terms of sample efficiency (Thornton et al., 2013; Snoek et al., 2015). In this paper we focus on the hyperparameter evaluation problem. In many machine learning problems such as Neural Architecture Search (NAS) and Data Augmentation (DA), neural architectures and augmentation policies can always be parameterized. In most scenarios, the performance of a hypothesis relies heavily on these hyperparameters. And each evaluation of hyperparameter configurations is expensive. Thus, an efficient evaluation method is a key point in hyperparameter optimization (HPO). A general solution for fast, accurately and robustly evaluating these hyperparameters, neural architectures or augmentation policies from a large search space is desirable.
While the many previous studies focus on sampling strategies, multi-fidelity methods concentrate on accelerating function evaluation which could be extremely expensive for big data and complex models (Krizhevsky et al., 2012). These methods use low-fidelity results derived with small amount of resources to select configurations quickly. Details refer to Feurer and Hutter (2018). The successive halving (SH) algorithm (Jamieson and Talwalkar, 2016) is proposed to identify the best configuration among configurations. It evaluates all hyperparameter configurations, throws the worst half, doubles the budgets and repeats until one configuration left. However, SH allocates exponentially more resources to more promising configurations. Besides, the number of configurations is required as an input, but there is no guidance for choosing a proper . HyperBand (HB) (Li et al., 2016) addresses this trade-off by implementing the SH algorithm with multiple values of , with each such run of SH as a “bracket”. Falkner et al. (2018) further combined Bayesian optimization (BO) and HyperBand (HB), termed BOHB, which extracted the information in the current bracket to fit a Bayesian surrogate model for sampling configurations in the next bracket. So far, there is a complete solution to the problem of slow evaluation of hyperparameter optimization. However, it is worth noting that the HB algorithm was proposed for identifying the best configuration among the alternative set, not for collecting data to estimate the sampling model in BO. Hence, combining BO and HB directly is not appropriate, which will be addressed in our work. These highly relevant works will be explained in more detail in Section 2. Recently, there many interesting work about HPO. Paul et al. (2019) proposed an algorithm for tuning hyperparameters of the policy gradient methods in RL. Keshtkaran and Pandarinath (2019)
tuned hyperparameters of a sequential autoencoder for spiking neural data.Law et al. (2019) learned a joint model on hyperparameters and data representations from many previous tasks. Li et al. (2020) considered hyperparameter selection when fine-tuning from one domain to another. Klein et al. (2019) presented a meta-surrogate model for task generation for HPO which facilitates developing and benchmarking of HPO algorithms.
In this paper we focus on the need for collecting high-quality data with multi-fidelity methods. We propose a new efficient and general-purpose algorithm for fast evaluation called Sub-Sampling (SS). Compared with the HB algorithm, SS evaluates the potential of the configurations based on the sub-samples of observations. It does not focus on finding the best configuration only as in SH, but instead guarantees the perfect overall performance in the sense that the cumulative regret of SS is asymptotically optimal. Section 5 provides theoretical analysis and Figure 1 compares SS and SH on one example. SS has a jump in the figure since its criterion is based on the potential rather than the current performance of the configurations.
Further, we combine BO and SS to deal with many popular machine learning tasks in Section 8
. The estimation of the surrogate model in the BO framework is more accurate based on these promising data obtained by SS. In the next bracket, we can sample better configurations from the model with higher probability. As a result, the trajectory of the search procedure is more reliable. Extensive experiments on various problems show the superior performance of BOSS, including Neural Architecture Search (NAS), Data Augmentation (DA), bounding box scaling in Object Detection (OD), and tuning hyperparameters of PPO(Schulman et al., 2017) in Reinforcement Learning (RL).
2 Related Work
In this section, we will provide a slice of details regarding the multi-fidelity methods. The shortcomings of historical methods will be revealed.
2.1 Successive Halving
The basic objective of multi-fidelity methods is to identify the best configuration out of a given finite set of configurations based on low-fidelity approximations of their performance. In order to achieve this objective, it is possible to drop hyperparameter configurations if they perform badly with small computing resources. Based on this idea, Jamieson and Talwalkar (2016) proposed the Successive Halving algorithm originally introduced by Karnin et al. (2013) for HPO. Query all configurations with a given initial budget for each one; then, remove the half that performed worst, double the budget and successively repeat until only a single configuration is left, which is described in Algorithm 1. The main theorem in Jamieson and Talwalkar (2016) indicates that SH can return the true best configuration when the total used budget is larger than a certain value. SH is an extremely simple, yet powerful, and therefore popular policy for the selection of multi-fidelity methods. Li et al. (2018) presented Asynchronous Successive Halving Algorithm (ASHA), a practically parallelized version of SH for dealing that configurations are typically orders of magnitude more than available parallel workers.
As discussed in Section 1, SH suffers from the trade-off between the budget and the number of configurations. Given a total budget of , users need to decide the number of configurations . Trying larger and assigning a small budget to each can result in prematurely terminating good configurations while trying only a few and assigning them a larger budget can result in wasting resources on evaluating a poor configuration.
Hyperband (Li et al., 2016) is presented to combat the trade-off problem in SH. It uses different values of and calls the SH algorithm as a subroutine. There are two components to HB shown in Algorithm 2; (1) the inner loop invokes SH for fixed values of and ; (2) the outer loop iterates over different values of and .
HyperBand begins with the most aggressive bracket , which sets to maximize exploration, subject to the constraint that at least one configuration is allocated resources. Each subsequent bracket reduces by a factor of approximately until the final bracket, , in which every configuration is allocated resources (this bracket simply performs the classical random search).
In practice, HB works very well and typically outperforms random search and Bayesian optimization methods for small total budgets. However, the brackets in HB are independent of each other. And, the configurations in the next bracket are still sampled randomly without any guidance which means the information of previous brackets wastes.
2.3 Bayesian Optimization and HyperBand
To overcome the limitation that HB does not adapt the configuration sampling methods to the evaluations, the recent approach BOHB (Falkner et al., 2018) combines Bayesian optimization and HyperBand. Its idea is to build the relation between brackets by the Bayesian model and to replace HB’s random search by BO. Particularly, as shown in Algorithm 3, BOHB relies on HB to determine how many resources used to evaluate configurations, but it replaces the random selection of configurations in each bracket by a model-based search. Once the number of configurations in a bracket is determined, the standard SH algorithm runs with these configurations. Then, we collect these configurations and their performance to fit a Bayesian surrogate model. For the next bracket, the configurations will be sampled from the model. Compared to the original HB, BOHB guides the search. There are two other works that also attempted to combine Bayesian optimization with HyperBand. Bertrand et al. (2017) used a Gaussian process with a squared exponential kernel as the Bayesian surrogate model instead of the TPE method of BOHB and dealt with model selection tasks. Wang et al. (2018) sampled trial points one by one using BO building the relation of configurations in the bracket, not between brackets.
Unfortunately, HB is designed to identify the best configuration. There is no guarantee for the evaluation of other configurations while BO needs that all configurations not only the best one used to estimate the model have high-level performance. In other words, it will lead to a wrong estimation of the surrogate model. Therefore, it is necessary to propose another criterion for fast evaluation.
2.4 Bandit-based Methods concerning Cumulative Regret
In the literature, there are many bandit policies proposed to get optimal cumulative regret, which means they pursue an optimal sequence of configurations instead of the final return configuration only. Lai and Robbins (1985) gave an asymptotic lower bound for the regret in the multi-armed bandit problem and proposed an index strategy that achieved this bound. Lai (1987)
showed that when probability distributions belong to a specified exponential family, a policy that pull the arm of the largest upper confidence bound (UCB) is optimal. The UCB policy is constructed from Kullback-Leibler (KL) information between estimated observation distributions of the arms.Agrawal (1995)
modified the UCB policy without knowing the total sample size. To better describe applications, the first work to venture outside the realm of parametric modeling assumptions appeared inYakowitz and Lowe (1991)
. As opposed to the traditional multi-armed bandit problem, they proposed non-parametric policies not based on KL-information but under some moment conditions.Auer et al. (2002) provided the policy named UCB which can achieve logarithmic regret if observation distributions are supported on . Chan (2019) proposed an efficient non-parametric solution and proved optimal efficiency of the policy. However, their observation distribution must belong to an one-parameter exponential family which would be extended in our work. Moreover, they only handled the standard multi-armed bandit problems while HPO is the ultimate goal in our work.
3 Multi-Armed Bandit Problem for HPO
In this section the notation and the standard multi-armed bandit problem with configurations (arms) will be introduced for discussing another criterion, cumulative regret.
Recall the traditional setup for the classic multi-armed bandit problem. Let be a given set of configurations. Consider a sequential procedure based on past observations , where represents the observation time from the -th configuration. Let be the number of observations from the -th configuration with different budgets, and is the number of total observations. For each configuration, the observations are assumed to be independent and identically distributed with expectation given by and
. In HPO problems, the randomness of the observations comes from the randomness of initialization, because the performance of an under-trained neural network with small budgets strongly depends on its initialization. For simplicity, assume without loss of generality that the best configuration is unique which is also assumed inPerchet and Rigollet (2013) and Chan (2019).
is a sequence of random variablesdenoting that at each time , the configuration is selected to evaluate. Note that depends only on previous observations. The objective of a good policy is to minimize the cumulative regret
Note that for a data-driven policy , the regret monotonically increases with respect to . Hence, minimizing the growth rate of is an important criterion which is considered in the later section. The successive elimination method (Perchet and Rigollet, 2013) is like SH to eliminate bad configurations successively and is given an upper bound of the cumulative regret for any . However, the asymptotic property that we need in big data is not optimal.
An arm allocation policy is said to be uniformly good if
Moreover, if is uniformly good and some additional regularity conditions are satisfied, Lai and Robbins (1985) provided a lower bound of the growth rate:
is the Kullback-Leibler divergence between density functionsand , that is,
where denotes expectation with respect to . Chan (2019) proposed an arm allocation policy in a strong assumption on the distribution of reward that made regrets achieve the lower bound in Equation (1). We say that it has the optimal rate. And, if the growth of the cumulative regret has the order of , it is called a nearly optimal policy.
4 Proposed Sub-Sampling for Efficient Hyperparameter Evaluation
In this section, we propose a novel efficient nonparametric solution called Sub-Sampling (SS) for evaluating a pool of configurations that minimizes the total regrets. The main idea is to assign more budgets to the configuration of more potential. Let be the validation loss of at the -th evaluation, be the number of evaluations of so far, and be the total number of evaluations for all the configurations. To compare configurations for selecting next ones to evaluate with designed budgets, SS uses all available data generated so far from different rounds with different initializations111 is the accuracy of the corresponding network trained with the certain budgets used in this round and the -th initialization. to measure the potential of configurations while SH just uses the data from the current round. Note that in the early stage of network training, the performance relies strongly on the initialization which has high randomness. SS uses data which has samples of for each to reduce the impact of randomness. This reduces probability of misjudging in SS.
Specifically, is said to has more potential than , denoted by , if and (a) or (b) and , for some , where . In this definition, sub-samples from samples of are used to compare and , thus called sub-sampling. The parameter is given to balance exploration and exploitation. It is a nonnegative increasing threshold for SS such that and as .
In case (a), it reveals that the performance of is observed too little, so we cannot judge its potential. Thus, needs to be explored. In case (b), it indicates that has the potential to exceed , although the current performance may be worse than . Hence, needs to be exploited with more budgets.
The proposed SS method is described in Algorithm 4. The sequence of the configurations is a sampling strategy for HPO, where is the number of total observations. This strategy is a sequence of random variables denoting that at each time , the -th configuration is selected to evaluate. Let denote the round number. In the first round, all configurations are evaluated with minimum budget since there is no information about them. In round , we select the leader of configurations which is the one evaluated the most times, and the budgets increase as . If two configurations have the same number of observations , we choose the one with lower as the leader. In each round , non-leaders will be evaluated, if they have more potential than the leader, otherwise, the leader will be evaluated.
SS vs SH
In HPO problems, the randomness of the early observations comes from the randomness of initialization, since the performance of an under-trained neural network with small budgets strongly depends on its initialization. Thus, the comparisons at the first stage of SH are unreliable and the abandoned configurations have no chance to be evaluated again. In contrast, SS uses data of all stages to weaken the impact of initial value and always gives a chance for each configuration to make the comparison of potential. Consequently, SS can get more reliable results. In Section 8.1, Table 1 shows that SS has higher accuracy of finding the optimal configuration than SH especially in the case of big randomness.
Both SS an SH are different from the UCB-based procedures (Burtini et al., 2015) which must know the underlying probability distributions to measure the potential of a configuration by an upper confidence bound of the observation value. However, only SS achieves asymptotically optimal efficiency which will be discussed in Section 5.
It should be noted that sub-sampling is also used in Chan (2019) for multi-armed bandit (MAB). One key difference is that different comparing criteria for different distributions are needed in Chan (2019) while we can use the same criterion for these distributions and achieve the optimality (Section 5).
5 Theoretical Results
In this section, we prove that the proposed SS method is asymptotically optimal using the tools from multi-armed bandit (MAB) (Agarwal et al., 2012; Sparks et al., 2015; Jamieson and Talwalkar, 2016). Each arm corresponds to a fixed hyperparameter setting, the arm collection corresponds to the set of configurations , pulling an arm corresponds to a fixed number of training iterations, budget corresponds to the number of samples in one pull and the loss corresponds to the validation loss.
Given an arm allocation policy , consider the cumulative regret , where and . Lai and Robbins (1985) provided an asymptotic lower bound of with , where is a density function of , , and the function is the Kullback-Leibler divergence. The cumulative regret is called near optimality if , or optimality if it achieves this optimal growth rate specified by the lower bound.
In the literature, most researchers only considered a one-parameter exponential family (Perchet and Rigollet, 2013; Chan, 2019). For wider applications, we study a general exponential family defined by
where functions and are decided by a specific probability distribution. Whenare included with . Let for . Let , where and , are the corresponding parameters. Note that and . Let be the inverse function of . The following theorem gives the property of the proposed sub-sampling method (Algorithm 4) for the exponential family. The near optimality is obtained by bounding the tails of distributions. The proof is given in Appendix.
and is thus nearly optimal.
Since the large deviation rate function is not divergence under the exponential family, this upper bound is not optimal. But, it has the nearly optimal order of . Consider the classical case of the one-parameter exponential family, i.e., for any , the large deviation rate function turns out to be divergence by direct calculation. It means that the sub-sampling method is optimal under the one-parameter exponential family (Chan, 2019). Corollary 5 reveals that the proposed policy is optimal for the one-parameter exponential family. The upper bound of the regret given in Perchet and Rigollet (2013) does not have this optimality.
For the one-parameter exponential family, the SS policy given in Algorithm 4 satisfies
and is thus optimal.
6 Bayesian Optimization via Sub-Sampling
In this section, we propose a novel algorithm called BOSS, which combines Bayesian Optimization (BO) and Sub-Sampling (SS), to search out the optimal configurations efficiently and reliably.
Bayesian optimization is a sequential design strategy for optimizing black-box functions. In hyperparameter optimization problems, the validation performance of a machine learning algorithm can be regarded as a function of hyperparameters and the goal is to find the optimal . In most cases, does not admit an analytic form, which is approximated by a surrogate model in BO, such as Gaussian processes, random forests, or TPE, based on the data collected on the fly where and the error term is dependent on the budget and satisfies that . The standard algorithmic procedure of BO is stated as follows.
Assume an initial surrogate model.
Compute an acquisition function based on the current model.
Sample a batch of hyperparameter configurations based on the acquisition function.
Evaluate the configurations and refit the model.
Repeat steps 2-4 until the stop condition is reached.
The BO methods differ in different surrogate models for the conditional probability . In this work, we adopt TPE (Bergstra et al., 2011)
as the surrogate model, which utilizes a kernel density estimator to model the data densities to deal with both discrete and continuous hyperparameters simultaneously. The specific procedure of TPE with the expected improvement (EI) as the acquisition function is given as follows.
This strategy models instead of and uses two densities below to estimate:
where is a given demarcation point and its value will be discussed later. is the density satisfying that the observation was less than and is assumed to be irrelevant with the specific value of . It is estimated by a kernel density estimator with the observations such that corresponding observation was less than . And, is the similar density formed by using the remaining observations.
According to the model, we set an acquisition function to sample next point that is most likely to be optimal. A common acquisition function is the expected improvement (EI):
where is the indicator function.
Note that the observations usually represent the value of a loss function in model training. We pay more attention in the area of lower losses. Hence, the estimation ofis much more vital. So, we want smaller to be better. Unfortunately, if is too small, there is no enough data for estimation. In practice, the TPE algorithm chooses
to be some quantileof the observed values.
In the literature of BO, surrogate models (step ) (Springenberg et al., 2016), acquisition functions (step ) (Wang and Jegelka, 2017) and batch sampling methods (step ) (González et al., 2016) are well studied, but the problem of hyperparameter configuration’s evaluation (step ) remains largely open. While one could evaluate a hyperparameter configuration by training the corresponding model until convergence, this is quite expensive for complex models and big data especially when there are many configurations to evaluate. In BOHB, this issue is addressed with SH. However, SH only guarantees the performance of the best configuration, implying that the total data used to estimate the surrogate model can be extremely poor. In multi-armed bandit, total regrets are more desired than the regret of the best one, which is the key motivation in our work.
The proposed algorithm BOSS is described in Algorithm 5. The key in our design is a sub-sampling method for hyperparameter configurations evaluation, which we theoretically prove to be asymptotically optimal.
During initialization, we calculate the maximum stages through the maximum budget allowed for a single configuration and the ratio . Then, the number of configurations and the minimum budget in each bracket are obtained. For initialization, we sample the configurations from a uniform distribution and call the SS algorithm to evaluate . The collected data helps to refit the model and update the acquisition function for sampling in the next bracket. This BO framework gives the final best configuration.
7 Asynchronous Parallelization
In the BO framework, parallelization cannot happen between different brackets since the model needs to be updated. Hence, parallel acceleration occurs within each bracket. Both BOHB and BOSS can be only accelerated in SH or SS. The difference is that in SH, we know the number of configurations which need to be evaluated in the next round, so we can choose good configurations in advance. It is not necessary to wait until all configurations are evaluated. This asynchronous parallelization is a simple version of ASHA (Li et al., 2018) and implemented in BOHB. For SS, we do not know the number of configurations which need to be evaluated in the next round. This makes asynchronous parallelization like ASHA fails.
For the same reason, the total budget in BOHB is known while it is not clear in BOSS. It makes us only compare the two methods with the same maximum budget , not with the same total budget. In order to give a fair comparison and for asynchronous parallelization, we propose a modified version.
We modify Algorithm 4 to a new version by defining a criterion to sort these configurations in round , where . If the conserved factor is larger, case (a) in Section 4 will have a higher priority than case (b). The modified sub-sampling (MSS) algorithm is described in Algorithm 6.
The procedure of MSS is similar to SH except that the sorting criterion is changed from to . Therefore, it is easy to utilize parallel resources like ASHA. Finally, BOSS can be accelerated in parallel by replacing SS with MSS in step in Algorithm 5.
8 Experimental Results
This section illustrates the benefits of SS and AMSS over SH and ASHA respectively by synthetic experiments. Then, we apply the proposed BOSS to a variety of machine learning problems ranging from Neural Architecture Search (NAS), Data Augmentation (DA), Object Detection (OD) to Reinforcement Learning (RL) which can all be cast as HPO problems. Their common point is that it takes a long time to evaluate a single hyperparameter configuration. Code for BOSS is publicly available at https://github.com/huawei-noah/vega which also contains many other HPO methods.
8.1 Synthetic Experiments
In this subsection, we first compare the proposed SS and SH in synthetic experiments from two aspects, the cumulative regret and the probability of finding the true optimal configuration. Suppose there are configurations, and the response from the -th configuration, follows the normal distribution , where ,
is its standard deviation and. The -th evaluation of the -th configuration with budget means randomly sampling samples from and returning their mean denoted by . A sequence of configurations and the number of budgets used to evaluate these configurations are designed by the HPO algorithms.
Figure 2 demonstrates that the average regrets, defined by
, of these algorithms are large in the beginning, since they try all configurations in the early iterations. After that, they use collected information to determine exploration or exploitation. For configurations with small variances, the difference between them can be easily distinguished with a small amount of budgets. Hence, SH quickly converges while SS as a more conservative policy converges slower. However, the performance of SH is greatly affected by the responses’ instability. For the responses with large variances, SH fails to find the optimal configuration since it wrongly evaluates the configurations with small budgets and has no chance to fix it. On the contrary, SS is more stable to find the optimal one. For different choices of, this conclusion still holds. These figures also reveal that SS has more robust results than SH with different .
Moreover, Table 1 demonstrates the proportion of selecting the optimal configuration by SH and SS with . Under different settings, SS performs better than SH. The deviation of responses has a great influence on SH. SH gets a high accuracy in the circumstance of small deviations, but the accuracy decreases quickly as the deviation gets larger. When , SS can always find the optimal configuration under different deviations even with a large deviation of . Note that we have compressed the mean value between and , and the deviation of can make it difficult to distinguish between configurations. When , it sometimes fails to get the optimal configuration since it needs more steps to do exploration and exploitation. We also consider the effect of maximum budgets. Figure 3 shows the results with and different maximum budgets. The performances of the algorithms are highly similar. The parameter equals .
Figure 4 illustrates the comparison of SH, ASHA, SS, MSS and AMSS. For the new parameter , we also make a comparison to find its effect.
We can see that MSS and AMSS outperform SH and ASHA in all circumstances. In the cases with small deviations, these algorithms converge to the same point. In the cases with large deviations, the advantages of MSS and AMSS over SH and ASHA are more obviously. As for the effect of the parameter , it reveals that can control the degree of conservation. The algorithms with large take conservative behaviors to explore more configurations in the initial stages. This phenomenon is especially obvious in the context of small deviations. When , the performances of are the worst at the beginning. The reason is that large considers the number of observations only. They ignores the current performance of the configurations. However, their final performances are highly similar. Note that AMSS has the same result with the same trials, and asynchronous parallelization can reduce running time which makes AMSS more efficient.
8.2 Data Augmentation
Data augmentation is an effective technique to generate more samples from data by rotating, inverting or other operations for improving the accuracy of image classifiers. However, most implementations are manually designed with a few exceptions.Cubuk et al. (2018) proposed a simple procedure called AutoAugment to automatically search for improved data augmentation policies. Unfortunately, it is very time-consuming, e.g., it takes GPU hours in searching procedure for CIFAR100 (Krizhevsky et al., 2009). More recently, Ho et al. (2019) and Lim et al. (2019) designed more efficient algorithms for this particular task.
In their search space, a policy consists of sub-policies with each sub-policy consisting of two image operations to be applied in sequence. Additionally, each operation is also associated with two hyperparameters: (1) the probability of applying the operation, and (2) the magnitude of the operation. In total, there are operations in the search space. Each operation also comes with a default range of magnitudes. These settings are described in Cubuk et al. (2018). For this problem, we need to tune two hyperparameters of each sub-policy and choose the best five sub-policies to form a policy. This is a natural HPO problem. BOSS can be adopted directly without any modification. As for the setting of the parameters and in Algorithms 5, we set the ratio refers to the same setting as SH and BOHB. The maximum budget
equals to one third of the number of epochs for convergence. In our experience given in Figure3, small changes of have no effect on the performance. The following applications all use this setting.
We search the data augmentation policy in the image classification tasks of CIFAR-10 and CIFAR-100. We follow the setting in AutoAugment (Cubuk et al., 2018) to search for the best policy on a smaller data set, which consists of randomly chosen examples, to save time for training child models during the augmentation search process. For the child model architecture, we use WideResNet-28-10 (28 layers - widening factor of 10) (Zagoruyko and Komodakis, 2016). The augmentation policy is combined with standard data pre-processing: on one image, we normalize the data in the following order, use the horizontal flips with
probability, zero-padding and random crops, augmentation policy, and finally Cutout withpixels (DeVries and Taylor, 2017). We run BOSS, BOHB and HB with budgets of epochs using parallel workers for iterations. We run BO, SH and Random Search with budgets of epochs using parallel workers for iterations. Every two workers use one NVIDIA V100 GPU for parallel training. For each sub-task, we use a SGD optimizer with a weight decay of , momentum of , learning rate of .
We use the found policies to train final models on CIFAR-10, CIFAR-100 with epochs. All the results of the baselines are replicated in our experiments and match the previously reported results (Falkner et al., 2018; Cubuk et al., 2018). Instead of searching the augmentation policy in a large amount of GPU days, we use the searched policy reported in Cubuk et al. (2018) to evaluate its performance.
Figure 5 demonstrates the performance during the searching procedure. The budget means the total epochs used in the searching process. The results show that BOHB performs better than BOSS in the beginning, but BOSS converges to better configurations in the end. This is caused by the conservative and the asymptotic optimality of BOSS. The performances of BO, HB, SH and Random Search are weaker than BOHB and BOSS, since they either sample configurations uniformly without considering the information brought by previous trials or evaluate configurations slowly. Note that the error rates of these methods are high especially on CIFAR100. This is because in the searching procedure, used budgets are much smaller than the training procedure for the searched policy.
The results listed in Table 2 show that the naive HPO methods including Random Search, SH, HB and BO have similar results as the original RL-based DA method, and the proposed efficient search scheme BOSS has the best accuracy and BOHB is close behind.
|Method||Test Accuracy (std)|
|AA||97.09 (0.14)||82.42 (0.17)|
|Random||97.01 (0.11)||82.41 (0.23)|
|SH||96.93 (0.14)||81.49 (0.23)|
|HB||97.00 (0.10)||82.07 (0.11)|
|BO||97.11 (0.16)||82.19 (0.25)|
|BOHB||97.25 (0.13)||82.52 (0.15)|
|BOSS||97.32 (0.12)||83.03 (0.22)|
8.3 Neural Architecture Search
One crucial aspect of the deep learning development is novel neural architectures. Designing architectures manually is a time-consuming and error-prone process. Because of this, there is a growing interest in automated neural architecture search. Elsken et al. (2019) provided an overview of existing work in this field of research. We use the search space of DARTS (Liu et al., 2019) as an example to illustrate HPO methods on NAS. Particularly, their goal is to search for a cell as a basic unit. In each cell, there are
nodes forming a fixed directed acyclic graph (DAG). Each edge of the DAG represents an operation, such as skip-connection, convolution, max pooling, etc., weighted by the architecture parameter. For the search procedure, the training loss and validation loss are denoted by and respectively. Then the architecture parameters are learned with the following bi-level optimization problem:
Here, are hyperparameters in the HPO framework. For evaluating , we need to optimize its network parameters . It is usually time-consuming. Note that BOSS is exactly proposed to make a fast and efficient evaluation for this problem.
We search the neural architectures in the image classification tasks of CIFAR10 and CIFAR100. We follow the settings of DARTS (Liu et al., 2019). The architecture parameter determines two kinds of basic units: normal cell and reduction cell. We run BOHB, BOSS and HB with maximum budgets of epochs using parallel workers for iterations. We run BO, SH and Random Search with total budgets as the same as BOSS using parallel workers for iterations. For a sampled architecture parameter, we fix it in the training process of updating model parameters. A SGD optimizer is used with learning rate of , momentum of , weight decay of , and a cosine learning decay with an annealing cycle.
After searching out the optimal configuration, we evaluate it with epochs. Figure 6 shows the highly similar results of that in data augmentation.
The results are listed in Table 3. Note that the search space and search procedure of DARTS are designed on CIFAR10. The naive HPO methods including Random Search, SH, HB and BO are worse than DARTS while BOSS and BOHB have comparable accuracy. On CIFAR100, the advantage of these HPO methods increases, and BOSS is significantly better than other search procedures.
|Method||Test Accuracy (std)|
|DARTS||97.26 (0.13)||81.71 (0.31)|
|Random||96.86 (0.19)||81.65 (0.34)|
|SH||96.55 (0.20)||81.19 (0.18)|
|HB||96.99 (0.18)||81.63 (0.37)|
|BO||96.87 (0.15)||82.37 (0.24)|
|BOHB||97.23 (0.17)||82.67 (0.27)|
|BOSS||97.29 (0.15)||83.10 (0.23)|
Moreover, when transferred to ImageNet in the mobile setting, the same architectures searched from CIFAR10 and CIFAR100 by BOSS achieve high accuracy ofand respectively while DARTS reported the accuracy of . This phenomenon reveals the weakness of DARTS in that it is prone to search architectures with many skip-connect operations which is not preferred. (Zela et al., 2020; Liang et al., 2019) reduced the number of skip-connect operations by early stopping. However, this issue disappears in BOSS naturally which is depicted in Figures 7 and 8, because DARTS changes architecture parameters and network parameters in turn while BOSS does not change the architecture during the training of network.
8.4 Object Detection
Object detection is to use an anchor box to cover the target object. Most state-of-the-art OD systems follow an anchor-based diagram. Anchor boxes are densely proposed over the images and the network is trained to predict the boxes’ position offset as well as the classification confidence. Existing systems use ad-hoc heuristic adjustments to define the anchor configurations based on predefined anchor box shapes and size. However, this might be sub-optimal or even wrong when a new dataset or a new model is adopted. Hence, the parameters of anchor boxes for object detection need to be automatically optimized including number, scales, and ratios of anchor boxes.
In the literature, the anchor shapes are typically determined by manual selection (Dai et al., 2016; Liu et al., 2016; Ren et al., 2015) or naive clustering methods (Redmon and Farhadi, 2017). Different from the traditional methods, there are several works focusing on utilizing anchors more effectively and efficiently (Yang et al., 2018; Zhong et al., 2018).
Table 4 compares BOSS with several existing anchor initialization methods as follows.
Set anchor scales and ratios like most detection systems by manually searching.
Use -means method proposed in YOLOv2 (Redmon and Farhadi, 2017) to obtain clusters and treat them as initial anchors.
Use random search to determine anchors.
Use BOHB to determine anchors.
Use BOSS to determine anchors.
The standard criteria of object detection, mean Average Precision (mAP) and Average Recall (AR), are used to measure the performance. The subscript of means different IoU thresholds, and the subscript of represents different numbers of given objects per image. S, M and L refer to small, medium and large size of an object respectively. For fairness of comparing these different search schemes, we use the same training procedure which is Faster-RCNN (Ren et al., 2015) combined with FPN (Lin et al., 2017) as detector, ResNet-50 (He et al., 2016) as backbone on MSCOCO (Lin et al., 2014).
The results reveal that all three HPO methods outperform two classical ones. Moreover, BOSS has uniformly better performance than other two HPO methods which shows the usefulness and effectiveness of BOSS. In addition, since the training procedures of MetaAnchor (Yang et al., 2018) and Zhong’s method (Zhong et al., 2018) are different from ours, we just compare the difference before and after using hyperparameter searching. BOSS brings about improvement while MetaAnchor and Zhong’s method increase by and , respectively. This advantage of BOSS is that it considers both sampling strategy and efficient evaluation while the other two methods just consider the former.
8.5 Reinforcement Learning
In last few years, several different approaches have been proposed for reinforcement learning with neural network function approximators, e.g., deep Q-learning (Mnih et al., 2015), “vanilla” policy gradient methods (Mnih et al., 2016), trust region policy gradient methods (Schulman et al., 2015), and proximal policy optimization (PPO) (Schulman et al., 2017). In these methods, there are a few hyperparameters that need to be determined. As an example, we tune the hyperparameters for the PPO Cartpole task (Falkner et al., 2018) including the numbers of units in layers and , batch size, learning rate, discount, likelihood ratio clipping, and entropy regularization. Different from the previous three applications, the budget becomes the number of trials used by an agent.We run each configuration for nine trials and report the average number of episodes until the PPO has converged, which means that the learning agent achieves the highest possible reward for five consecutive episodes. For each configuration, we stop training after the agent has either converged or ran for a maximum of episodes. For each hyperparameter optimization method, we implement five independent runs.
|Min terminated epoch||102.55||80.33||72.77|
Table 5 shows that the agent using the policy obtained by BOSS is the fastest learner. The agent in RL needs to learn a stable policy which matches the conservation of BOSS.
In this paper we have proposed BOSS which combines BO and SS for HPO problems. The major contribution is to develop SS as an efficient and effective evaluation procedure and its asynchronously parallel version MSS. The proposed methods are for fast hyperparameter search evaluation by measuring the potential of hyperparameter configurations. It promises to improve robustness compared to algorithms based on successive halving. The main result is the asymptotic optimality of the cumulative regret of the proposed SS algorithm. This advantage is suitable for the BO iteration since it can collect high-quality data to estimate the surrogate model more efficiently. It can get the evaluation with less budgets which is a kind of early-stop methods. Experiments show that SS can find the best configuration quickly and correctly and BOSS works for many popular applications. Future work to improve BOSS may further learn the search space (Perrone et al., 2019) or hyperparameter importance (Hoos and Leyton-Brown, 2014).
On the other hand, the ensuing problem is that when we need the early-stop techniques. Obviously, if there is no relationship between the performance with different budgets, these techniques will not work. Thus, how to judge whether a task is suitable for early stopping becomes an important prerequisite which this work is not involved in. We suggest to consider different criteria for different goals. For best arm identification, SH based algorithms have theoretical advantages. However, for the BO iteration, this goal is no longer appropriate which is replaced by the cumulative regret in this work.
Proof of Theorem 5
First, we prove a lemma which is an extension of Theorem in Chan (2019),
For this purpose, the next main process is to develop a new Chernoff bound for the exponential family like Lemma in Chan (2019). Now we claim that the following two inequalities hold for the exponential family.
where the function is the large deviation rate function.
The generic Chernoff bound for a random variable with is
When is the mean of i.i.d. random variables , optimizing over , we get
Let , we can simplify by direct calculation that
Hence, when , we have
which means the first inequality holds.
Similarly, when , we have
The second inequality is obtained together with
- Oracle inequalities for computationally adaptive model selection. arXiv preprint arXiv:1208.0129. Cited by: §5.
- SAMPLE mean based index policies with regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. Cited by: §2.4.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2), pp. 235–256. Cited by: §2.4.
- Random search for hyper-parameter optimization. Journal of Machine Learning Research 13 (1), pp. 281–305. Cited by: §1.
- Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pp. 2546–2554. Cited by: §1, §1, §6, §6.
- Hyperparameter optimization of deep neural networks: combining hyperband with Bayesian model selection. In Conférence sur l’Apprentissage Automatique, Cited by: §2.3.
The impact of initial designs on the performance of matsumoto on the noiseless bbob-2015 testbed: a preliminary study.
Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, pp. 1159–1166. Cited by: §1.
- A survey of online experiment design with the stochastic multi-armed bandit. arXiv preprint arXiv:1510.00757. Cited by: §4.
- The multi-armed bandit problem: an efficient non-parametric solution. Annals of Statistics, pp. To appear. Cited by: Proof of Theorem 5, Proof of Theorem 5, §2.4, §3, §3, §4, §5, §5.
- Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §1, §8.2, §8.2, §8.2, §8.2.
- R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §8.4.
Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §8.2.
- Neural architecture search: a survey.. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §1, §8.3.
- BOHB: robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774. Cited by: §1, §1, §2.3, §8.2, §8.5.
- Hyperparameter optimization. In AutoML: Methods, Sytems, Challenges, F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.), pp. 3–37. Note: To appear. Cited by: §1.
- Batch bayesian optimization via local penalization. In Artificial intelligence and statistics, pp. 648–657. Cited by: §6.
- The cma evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772. Cited by: §1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §8.4.
- Entropy search for information-efficient global optimization. Journal of Machine Learning Research 13 (Jun), pp. 1809–1837. Cited by: §1.
- Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926. Cited by: §1.
- Population based augmentation: efficient learning of augmentation policy schedules. arXiv preprint arXiv:1905.05393. Cited by: §8.2.
- An efficient approach for assessing hyperparameter importance. In International conference on machine learning, pp. 754–762. Cited by: §9.
- Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Cited by: §1, §1.
- Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §1.
- Non-stochastic best arm identification and hyperparameter optimization. In Artificial Intelligence and Statistics, pp. 240–248. Cited by: §1, §1, §2.1, §5.
- Efficient global optimization of expensive black-box functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §1, §1.
- Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246. Cited by: §1, §2.1.
- Enabling hyperparameter optimization in sequential autoencoders for spiking neural data. In Advances in Neural Information Processing Systems, pp. 15911–15921. Cited by: §1.
- Meta-surrogate benchmarking for hyperparameter optimization. arXiv preprint arXiv:1905.12982. Cited by: §1.
- Tuned data mining: a benchmark study on different tuners. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pp. 1995–2002. Cited by: §1.
- Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §8.2.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4–22. Cited by: §2.4, §3, §5.
- Adaptive treatment allocation and the multi-armed bandit problem. Annals of Statistics 15 (3), pp. 1091–1114. Cited by: §2.4.
- Hyperparameter learning via distributional transfer. In Advances in Neural Information Processing Systems, pp. 6801–6812. Cited by: §1.
- Rethinking the hyperparameters for fine-tuning. In International Conference on Learning Representations, External Links: Cited by: §1.
- Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934. Cited by: §2.1, §7.
- Hyperband: a novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §1, §1, §2.2.
- Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §8.3.
- Fast autoaugment. arXiv preprint arXiv:1905.00397. Cited by: §8.2.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §8.4.
- Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §8.4.
- DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Cited by: §1, §8.3, §8.3.
- Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §8.4.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §8.5.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §8.5.
- Fast efficient hyperparameter tuning for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 4618–4628. Cited by: §1.
- The multi-armed bandit problem with covariates. Annals of Statistics 41 (2), pp. 693–721. Cited by: §3, §3, §5, §5.
Learning search spaces for bayesian optimization: another view of hyperparameter transfer learning. In Advances in Neural Information Processing Systems, pp. 12751–12761. Cited by: §9.
- Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §1.
- YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: item (b), §8.4.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §8.4, §8.4.
- Fast information-theoretic bayesian optimisation. In International Conference on Machine Learning, pp. 4384–4392. Cited by: §1.
- Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §8.5.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §8.5.
- Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959. Cited by: §1.
- Scalable Bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §1.
- Tupaq: an efficient planner for large-scale predictive analytic queries. arXiv preprint arXiv:1502.00068. Cited by: §5.
- Bayesian optimization with robust bayesian neural networks. In Advances in neural information processing systems, pp. 4134–4142. Cited by: §1, §6.
- Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning, pp. 1015–1022. Cited by: §1.
- Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 847–855. Cited by: §1, §1.
- Combination of hyperband and Bayesian optimization for hyperparameter optimization in deep learning. arXiv preprint arXiv:1801.01596. Cited by: §2.3.
- Max-value entropy search for efficient bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3627–3635. Cited by: §1, §6.
- Experiments: planning, analysis, and optimization. Vol. 552, John Wiley & Sons. Cited by: §1.
- Nonparametric bandit methods. Annals of Operations Research 28 (1), pp. 297–312. Cited by: §2.4.
- MetaAnchor: learning to detect objects with customized anchors. In Advances in Neural Information Processing Systems, pp. 320–330. Cited by: §8.4, §8.4.
- Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §8.2.
- Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §1.
- Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, External Links: Cited by: §8.3.
- Deep neural network hyperparameter optimization with orthogonal array tuning. In International Conference on Neural Information Processing, pp. 287–295. Cited by: §1, §1.
- Anchor box optimization for object detection. arXiv preprint arXiv:1812.00469. Cited by: §8.4, §8.4.
- Survey on automated machine learning. arXiv preprint arXiv:1904.12054 9. Cited by: §1.