hyperband_benchmarks
None
view repo
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While current methods offer efficiencies by adaptively choosing new configurations to train, an alternative strategy is to adaptively allocate resources across the selected configurations. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinitely many armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce Hyperband for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with state-of-the-art methods on a suite of hyperparameter optimization problems. We observe that Hyperband provides five times to thirty times speedup over state-of-the-art Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems.
READ FULL TEXT VIEW PDFNone
In recent years, machine learning models have exploded in complexity and expressibility at the cost of staggering computational costs and a growing number of tuning parameters that are difficult to set by standard optimization techniques. These ‘hyperparameters’ are inputs to machine learning algorithms that govern how the algorithm’s performance generalizes to new, unseen data; examples of hyperparameters include those that impact model architecture, amount of regularization, and learning rates. The quality of a predictive model critically depends on its hyperparameter configuration, but it is poorly understood how these hyperparameters interact with each other to affect the quality of the resulting model. Consequently, practitioners often default to brute-force methods like random search and grid search.
In an effort to develop more efficient search methods, the problem of hyperparameter optimization has recently been dominated by Bayesian optimization methods (Snoek et al., 2012; Hutter et al., 2011; Bergstra et al., 2011) that focus on optimizing hyperparameter configuration selection. These methods aim to identify good configurations more quickly than standard baselines like random search by selecting configurations in an adaptive manner; see Figure 1(a). Existing empirical evidence suggests that these methods outperform random search (Thornton et al., 2013; Eggensperger et al., 2013; Snoek et al., 2015)
. However, these methods tackle a fundamentally challenging problem of simultaneously fitting and optimizing a high-dimensional, non-convex function with unknown smoothness, and possibly noisy evaluations. To overcome these difficulties, some Bayesian optimization methods resort to heuristics to model the objective function or speed up resource intensive subroutines. In contrast to naive random search, methods that rely on these heuristics are not endowed with any theoretical consistency guarantees.
^{1}^{1}1Random search will asymptotically converge to the optimal configuration regardless of the smoothness or structure of the function being optimized by a simple covering argument. Moreover, these adaptive configuration selection methods are intrinsically sequential and thus difficult to parallelize.Instead, we explore a different direction for hyperparameter optimization that focuses on speeding up configuration evaluation; see Figure 1(b). These approaches are adaptive in computation, allocating more resources to promising hyperparameter configurations while quickly eliminating poor ones. Resources can take various forms, including size of training set, number of features, or number of iterations for iterative algorithms. By adaptively allocating these resources, these approaches aim to examine orders of magnitude more hyperparameter configurations than approaches that uniformly train all configurations to completion, thereby quickly identifying good hyperparameters. Configuration evaluation approaches could ostensibly be coupled with either random search or Bayesian optimization approaches. Random search offers a simple, parallelizable, and theoretically principled launching point, while Bayesian optimization may offer improved empirical accuracy.
To better understand the tradeoff between random and adaptive configuration selection, we revisited a recent empirical study by Feurer et al. (2015a) to extensively compare two state-of-the-art Bayesian optimization methods — SMAC (Hutter et al., 2011) and TPE (Bergstra et al., 2011) — to random search across 117 datasets (see Section 3.2.1 for details of the experimental setup). Figure 2(a) presents a rank plot, which is standard in this field (Feurer et al., 2015a; Dewancker et al., 2016; Feurer et al., 2015b), in which the accuracy of each method is ranked for each dataset, and the average rank across 117 datasets is reported (lower is better). Random search is soundly beaten by SMAC and TPE in these rank plots. However, the test error bar chart in Figure 2(b) of 20 randomly sampled datasets from the 117 tells a different story. While these plots confirm that the Bayesian methods consistently outperform random sampling, they further show that the performance gap is quite small, a subtlety which is lost in rank-based evaluations. In fact, under the same experimental setup, running random search on two machines (denoted as ‘random_2x’ in the plots) yields superior results to these two Bayesian optimization methods. In light of these results, along with the amenability of random search to theoretical analysis, we focus on speeding up random search using our configuration evaluation approach.
We develop a novel configuration evaluation approach by formulating hyperparameter optimization as a pure-exploration adaptive resource allocation problem addressing how to allocate resources among randomly sampled hyperparameter configurations. Our procedure, Hyperband, relies on a principled early-stopping strategy to allocate resources, allowing it to evaluate orders of magnitude more configurations than black-box procedures like Bayesian optimization. Hyperband is a general-purpose technique that makes minimal assumptions unlike prior configuration evaluation approaches (Domhan et al., 2015; Swersky et al., 2014; György and Kocsis, 2011; Agarwal et al., 2011; Sparks et al., 2015; Jamieson and Talwalkar, 2015). Our theoretical analysis demonstrates the ability of Hyperband to adapt to unknown convergence rates and to the behavior of validation losses as a function of the hyperparameters. In addition, Hyperband is to faster than state-of-the-art Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems. A theoretical contribution of this work is the introduction of the pure-exploration, infinitely-many armed bandit problem in the non-stochastic setting, for which Hyperband is one solution. When Hyperband is applied to the special-case stochastic setting, we show that the algorithm comes within factors of known lower bounds in both the infinite Carpentier and Valko (2015) and finite -armed bandit setting Kaufmann et al. (2015).
The rest of the paper is organized as follows. Section 2 describes Hyperband and provides intuition for the algorithm through a detailed example. In Section 3 we present a wide range of empirical results comparing Hyperband with state-of-the-art competitors. Section 4 frames the hyperparameter optimization problem as an infinitely many armed bandit problem and summarizes the theoretical results for Hyperband. Section 5 summarizes related work in two areas: (1) hyperparameter optimization, and (2) pure-exploration bandit problems. Finally, Section 6 discusses possible extensions of Hyperband.
In this section, we present the Hyperband algorithm. We provide intuition for the algorithm, highlight the main ideas via a simple example that uses iterations as the adaptive resource, and present a few guidelines on how to deploy Hyperband in practice.
Hyperband extends the SuccessiveHalving algorithm proposed for hyperparameter optimization in Jamieson and Talwalkar (2015) and calls it as a subroutine. The idea behind the original SuccessiveHalving algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configurations remains. The algorithm allocates exponentially more resources to more promising configurations. Unfortunately, SuccessiveHalving requires the number of configurations as an input to the algorithm. Given some finite time budget (e.g. an hour of training time to choose a hyperparameter configuration), resources are allocated on average across the configurations. However, for a fixed , it is not clear a priori whether we should (a) consider many configurations (large ) with a small average training time; or (b) consider a small number of configurations (small ) with longer average training times.
We use a simple example to better understand this tradeoff. Figure 3 shows the validation loss as a function of total resources allocated for two configurations with terminal validation losses and . The shaded areas bound the maximum deviation from the terminal validation loss and will be referred to as “envelope” functions. It is possible to differentiate between the two configurations when the envelopes diverge. Simple arithmetic shows that this happens when the width of the envelopes is less than , i.e. when the intermediate losses are guaranteed to be less than away from the terminal losses. There are two takeaways from this observation: more resources are needed to differentiate between the two configurations when either (1) the envelope functions are wider or (2) the terminal losses are closer together.
However, in practice, the optimal allocation strategy is unknown because we do not have knowledge of the envelope functions nor the distribution of terminal losses. Hence, if more resources are required before configurations can differentiate themselves in terms of quality (e.g., if an iterative training method converges very slowly for a given dataset or if randomly selected hyperparameter configurations perform similarly well) then it would be reasonable to work with a small number of configurations. In contrast, if the quality of a configuration is typically revealed using minimal resources (e.g., if iterative training methods converge very quickly for a given dataset or if randomly selected hyperparameter configurations are of low-quality with high probability) then
is the bottleneck and we should choose to be large.Certainly, if meta-data or previous experience suggests that a certain tradeoff is likely to work well in practice, one should exploit that information and allocate the majority of resources to that tradeoff. However, without this supplementary information, forcing the practitioner to make this tradeoff severely hinders the applicability of existing configuration evaluation methods.
Hyperband, shown in Algorithm 1, addresses this “ versus ” problem by considering several possible values of for a fixed , in essence performing a grid search over feasible value of . Associated with each value of is a minimum resource that is allocated to all configurations before some are discarded; a larger value of corresponds to a smaller and hence more aggressive early stopping. There are two components to Hyperband; (1) the inner loop invokes SuccessiveHalving for fixed values of and (lines 3-9) and (2) the outer loop which iterates over different values of and (lines 1-2). We will refer to each such run of SuccessiveHalving within Hyperband as a “bracket.” Each bracket is designed to use about total resources and corresponds to a different tradeoff between and . A single execution of Hyperband takes a finite number of iterations, we recommend repeating it indefinitely.
Hyperband requires two inputs (1) , the maximum amount of resource that can be allocated to a single configuration, and (2) , an input that controls the proportion of configurations discarded in each round of SuccessiveHalving. The two inputs dictate how many different brackets are considered; specifically, different values for are considered with . Hyperband begins with the most aggressive bracket , which sets to maximize exploration, subject to the constraint that at least one configuration is allocated resources. Each subsequent bracket reduces by a factor of approximately until the final bracket, , in which every configuration is allocated resources (this bracket simply performs classical random search). Hence, Hyperband performs a geometric search in the average budget per configuration and removes the need to select for a fixed budget at the cost of approximately times more work than running SuccessiveHalving for a single value of . By doing so, Hyperband is able to exploit situations in which adaptive allocation works well, while protecting itself in situations where more conservative allocations are required.
Hyperband requires the following methods to be defined for any given learning problem:
get_hyperparameter_configuration() - a function that returns a set of i.i.d. samples from some distribution defined over the hyperparameter configuration space. In this work, we assume uniformly sampling of hyperparameters from a predefined space (i.e. hypercube with min and max bounds for each hyperparameter), which immediately yields consistency guarantees. However, the more aligned the distribution is with quality hyperparameters (i.e. a useful prior), the better Hyperband will perform (see Section 6 for further discussion).
run_then_return_val_loss(, ) - a function that takes a hyperparameter configuration () and resource allocation () as input and returns the validation loss after training the configuration for the allocated resources.
top_k(configs, losses, ) - a function that takes a set of configurations as well as their associated losses and returns the top performing configurations.
We next present a concrete example to provide further intuition about Hyperband
. We work with the MNIST dataset and optimize hyperparameters for the LeNet convolutional neural network trained using mini-batch SGD.
^{2}^{2}2Code and description of algorithm used is available at http://deeplearning.net/tutorial/lenet.html. Our search space includes learning rate, batch size, and number of kernels for the two layers of the network as hyperparameters (details are shown in Table 2 in Appendix A).We further define the number of iterations as the resource to allocate, with one unit of resource corresponding to one epoch or a full pass over the dataset. We set
to 81 and use the default value of , resulting in and thus 5 brackets of SuccessiveHalving with different tradeoffs between and . The resources allocated within each bracket are displayed in Table 1.Figure 1 shows an empirical comparison of the average test error across 70 trials of the different brackets of Hyperband if they were used separately as well as standard Hyperband. In practice we do not know a priori which bracket will be most effective in identifying good hyperparameters, and in this case neither the most () nor least aggressive () setting is optimal. But we note that Hyperband does nearly as well as the optimal bracket () and vastly outperforms the baseline uniform allocation (i.e. random search), which is equivalent to bracket .
While the previous example focused on iterations as the resource, Hyperband naturally generalizes to various types of resources:
Time - Early-stopping in terms of time can be preferred when various hyperparameter configurations differ in training time and the practitioner’s chief goal is to find a good hyperparameter setting in a fixed wall-clock time. For instance, training time could be used as a resource to quickly terminate straggler jobs in distributed computation environments.
Dataset Subsampling - Here we consider the setting of a black-box batch training algorithm that takes a dataset as input and outputs a model. In this setting, we treat the resource as the size of a random subset of the dataset with corresponding to the full dataset size. Subsampling dataset sizes using Hyperband, especially for problems with super-linear training times like kernel methods, can provide substantial speedups.
Feature Subsampling - Random features or Nyström-like methods are popular methods for approximating kernels for machine learning applications (Rahimi and Recht, 2007). In image processing, especially deep-learning applications, filters are usually sampled randomly with the number of filters having an impact on the performance. Downsampling the number of features is a common tool used when hand-tuning hyperparameters, Hyperband can formalize this heuristic.
The resource and (which we address next) are the only required inputs to Hyperband. As mentioned in Section 2.2, represents the maximum amount of resources that can be allocated to any given configuration. In most cases, there is a natural upper bound on the maximum budget per configuration that is often dictated by the resource type (e.g., training set size for dataset downsampling; limitations based on memory constraint for feature downsampling; rule of thumb regarding number of epochs when iteratively training neural networks). If there is a range of possible values for , a smaller will give a result faster (since the budget for each bracket is a multiple of ) but a larger will give a better guarantee of successfully differentiating between the configurations.
Moreover, for settings in which either is unknown or not desired, we provide an infinite horizon version of Hyperband in Section 4. In this version of the algorithm, we use a budget that doubles over time, , and for each choice of we consider all possible values of . For each choice of and , we run an instance of the (infinite horizon) SuccessiveHalving algorithm that grows with increasing . The main difference between this algorithm and Algorithm 1 is that the number of unique brackets is growing over time instead of just looped over. We will analyze this version of Hyperband in more detail in Section 4 and use it as the launching point for the theoretical analysis of the standard (finite horizon) Hyperband.
Note that is also the number of configurations evaluated in the bracket that performs the most exploration, i.e . In practice one may want to limit overhead associated with training many configurations on a small budget, i.e. costs associated with initialization, loading a model, and validation. In this case, set . Alternatively, one can redefine one unit of resource so that is artificially smaller (i.e. if the desired maximum iteration is 100k, defining one unit of resource to be 100 iterations will give , whereas defining one unit to be 1k iterations will give ). Thus, one unit of resource can be interpreted as the minimum desired resource and as the ratio between maximum resource and minimum resource.
The value of can be viewed as a knob that can be tuned based on practical user constraints. Larger values of correspond to a more aggressive elimination schedule and thus fewer rounds elimination; specifically, each round retains configurations for a total of rounds of elimination with configurations. If one wishes to receive a result faster at the cost of a sub-optimal asymptotic constant, one can increase to reduce the budget per bracket . We stress that results are not very sensitive to the choice of . If our theoretical bounds are optimized (see Section 4) they suggest choosing but in practice we suggest taking to be equal to 3 or 4 (if you don’t know how to choose , use ).
Tuning will also change the number of brackets and consequently the number of different tradeoffs that Hyperband tries. Usually, the possible range of brackets is fairly constrained since the number of brackets is logarithmic in ; specifically, there are brackets. However, for large , using or 4 can give more brackets than desired. The number of brackets can be controlled in a few ways. First, as mentioned in the previous section, if is too large and overhead is an issue, then one may want to control the overhead by limiting the maximum number of configurations to , thereby also limiting . If overhead is not a concern and aggressive exploration is desired, one can (1) increase to reduce the number of brackets while maintaining as the maximum number of configurations in the most exploratory bracket or (2) still use or 4 but only try brackets that do a baseline level of exploration, i.e. set and only try brackets from to . For computational intensive problems that have long training times and high-dimensional search spaces, we recommend the latter. Intuitively, if the number of configurations that can be trained to completion (i.e. trained using resources) in a reasonable amount of time is on the order of the dimension of the search space and not exponential in the dimension, then it will be impossible to find a good configuration without using an aggressive exploratory tradeoff between and .
The theoretical properties of Hyperband are best demonstrated through an example. Suppose there are configurations, each with a given terminal validation error for . Without loss of generality, index the configurations by performance so that corresponds to the best performing configuration, to the second best, and so on. Now consider the task of identifying the best configuration. The optimal strategy would allocate to each configuration the minimum resource required to distinguish it from , i.e. enough so that the envelope functions (see Figure 3) bound the intermediate loss to be less than away from the terminal value. In contrast, the naive uniform allocation strategy, which allocates to each configuration, has to allocate to every configuration the resource required to distinguish from . Remarkably, the budget required by SuccessiveHalving is only a small factor of the optimal because it capitalizes on configurations that are easy to distinguish from .
The relative size of the budget required for uniform allocation and SuccessiveHalving depends on the envelope functions bounding deviation from terminal losses as well as the distribution from which ’s are drawn. The budget required for SuccessiveHalving is smaller when the optimal versus tradeoff discussed in Section 2.1 requires fewer resources per configuration. Hence, if the envelope functions tighten quickly as a function of resource allocated, or the average distances between terminal losses is large, then SuccessiveHalving can be substantially faster than uniform allocation. These intuitions are formalized in Section 4 and associated theorems/corollaries are provided that take into account the envelope functions and the distribution from which ’s are drawn. Of course we do not have knowledge of either function in practice, so we will hedge our aggressiveness with Hyperband. We show in Section 4.3.3 that Hyperband, despite having no knowledge of the envelope functions nor the distribution of ’s, requires a budget that is only log factors larger than that of SuccessiveHalving.
In this section, we evaluate the empirical behavior of Hyperband with three different resource types: iterations, dataset subsamples, and feature samples. For all experiments, we compare Hyperband with three state-of-the-art Bayesian optimization algorithms — SMAC, TPE, and Spearmint. Whereas SMAC and TPE are tree-based Bayesian optimization methods, Spearmint (Snoek et al., 2012) uses Gaussian processes to model the problem. We exclude Spearmint from the comparison set when there are conditional hyperparameters in the search space because it does not natively support them (Eggensperger et al., 2013). For the deep learning experiments described in the next section, we also compare against a variant of SMAC named SMAC_early that uses the early termination criterion proposed in Domhan et al. (2015) for deep neural networks. Additionally, we show results for SuccessiveHalving corresponding to repeating the most exploration bracket of Hyperband. Finally for all experiments, we benchmark against standard random search and random_2, which is a variant of random search with twice the budget of other methods (as described in Section 1).
We study a convolutional neural network with the same architecture as that used in Snoek et al. (2012) and Domhan et al. (2015) from cuda-convnet.^{3}^{3}3The model specification is available at http://code.google.com/p/cuda-convnet/. The search spaces used in the two previous works differ, and we used a search space similar to that of Snoek et al. (2012) with 6 hyperparameters for stochastic gradient decent and 2 hyperparameters for the response normalization layers (see Appendix A for details). In line with the two previous works, we used a batch size of 100 for all experiments.
Datasets: We considered three image classification datasets: CIFAR-10 (Krizhevsky, 2009), rotated MNIST with background images (MRBI) (Larochelle et al., 2007), and Street View House Numbers (SVHN) (Netzer et al., 2011). CIFAR-10 and SVHN contain RGB images while MRBI contains grayscale images. Each dataset is split into a training, validation, and test set: (1) CIFAR-10 has 40k, 10k, and 10k instances; (2) MRBI has 10k, 2k, and 50k instances; and (3) SVHN has close to 600k, 6k, and 26k instances for training, validation, and test respectively. For all datasets, the only preprocessing performed on the raw images was demeaning.
Hyperband Configuration: For these experiments, one unit of resource corresponds to 100 mini-batch iterations (10k examples with a batch size of 100). For CIFAR-10 and MRBI, is set to 300 (or 30k total iterations). For SVHN, is set to 600 (or 60k total iterations) to accommodate the larger training set. was set to 4 for all experiments, resulting in 5 SuccessiveHalving brackets for Hyperband.
Results: Ten independent trials were performed for each searcher. In each trial, the searcher is given a total budget of to return the best possible hyperparameter configuration. For Hyperband, the budget is sufficient to run the outer loop twice (for a total of 10 SuccessiveHalving brackets). For SMAC, TPE, and random search, the budget corresponds to training 50 different configurations completion. The experiments took the equivalent of over 1 year of GPU hours on NVIDIA GRID K520 cards available on Amazon EC2 g2.8xlarge instances. We set a total budget constraint in terms of iterations instead of compute time to make comparisons hardware independent.^{4}^{4}4Most trials were run on Amazon EC2 g2.8xlarge instances but a few trials were run on different machines due to the large computational demand of these experiments. Comparing progress by iterations instead of time ignores overhead costs not associated with training like cost of configuration selection for Bayesian methods and model initialization and validation costs for Hyperband. While overhead is hardware dependent, the overhead for Hyperband is below 5% on EC2 g2.8xlarge machines, so comparing progress by time passed would not impact results significantly.
For CIFAR-10, the results in Figure 4(a) show that Hyperband is over an order of magnitude faster than its competitiors. For MRBI, Hyperband is over an order of magnitude faster than standard configuration selection approaches and 5 faster than SMAC with early stopping. For SVHN, while Hyperband finds a good configuration faster, Bayesian optimization methods are competitive and SMAC with early stopping outperforms Hyperband. We view SMAC with early stopping to be a combination of adaptive configuration selection and configuration evaluation. This result demonstrates that there is merit to incorporating early stopping with configuration selection approaches.
Across the three datasets, Hyperband and SMAC_early are the only two methods that consistently outperform random_. On these datasets, Hyperband is over 20 faster than random search while SMAC_early is faster than random search within the evaluation window. In fact, the first result returned by Hyperband after using a budget of 5 is often competitive with results returned by other searchers after using 50. Additionally, Hyperband is less variable than other searchers across trials, which is highly desirable in practice (see Appendix A for plots with error bars).
As discussed in Section 2.6, for computationally expensive problems in high dimensional search spaces, it may make sense to just repeat the most exploratory brackets. Similarly, if meta-data is available about a problem or it is known that the quality of a configuration is evident after allocating a small amount of resource, then one should just repeat the most exploration bracket. Indeed, for these experiments, repeating the most exploratory bracket of Hyperband outperforms cycling through all the brackets. In fact, bracket vastly outperforms all other methods on CIFAR-10 and MRBI and is nearly tied with SMAC_early for first on SVHN.
While we set for these experiments to facilitate comparison to Bayesian methods and random search, it is also reasonable to not limit the maximum number of iterations and deploy infinite horizon Hyperband. We evaluate infinite horizon Hyperband for CIFAR-10, using and a starting budget . Figure 4(a) shows that infinite horizon Hyperband is competitive with other methods but does not perform as well as finite horizon Hyperband within the 50 times max iteration limit on total budget. Hence, we limit our focus on the finite horizon version of Hyperband for the remainder of our empirical studies.
Finally, CIFAR-10 is a very popular dataset and state-of-the-art models achieve much better accuracies than what is shown in Figure 4. The difference in performance is mainly attributable to higher model complexities and data manipulation (i.e. using reflection or random cropping to artificially increase the dataset size). If we limit the comparison to published results that use the same architecture and exclude data manipulation, the best human expert result for the dataset is 18% error and hyperparameter optimized result is 15.0% for Snoek et al. (2012)^{5}^{5}5We were unable to reproduce this result even after receiving the optimal hyperparameters from the authors through a personal communication. and 17.2% for Domhan et al. (2015). These results are better than our results on CIFAR-10 because they use 25% more data by including the validation set and also train for more epochs. The best model found by Hyperband achieved a test error of 17.0% when trained on the combined training and validation data for 300 epochs.
We studied two different hyperparameter search optimization problems for which Hyperband uses dataset subsamples as the resource. The first adopts an extensive framework presented in Feurer et al. (2015a) that attempts to automate preprocessing and model selection. Due to certain limitations of the framework that fundamentally limit the impact of dataset downsampling, we focus on tuning a kernel classification task in the second experiment.
We use the framework introduced by Feurer et al. (2015a)
, which explores a structured hyperparameter search space with a total of 110 hyperparameters comprised of 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods. Similar to
Feurer et al. (2015a), we impose a 3GB memory limit, a 6-minute timeout for each hyperparameter configuration and a one-hour time window to evaluate each searcher on each dataset. Twenty trials of each searcher were performed per dataset and all trials in aggregate took over a year of CPU time on n1-standard-1 instances from Google Cloud Compute. Additional details about our experimental framework are available in Appendix A.Datasets: Feurer et al. (2015a) used 140 binary and multiclass classification datasets from OpenML, but 23 of them are incompatible with the latest version of the OpenML plugin (Feurer, 2015), so we worked with the remaining 117 datasets. Due to the limitations of the experimental setup (discussed in Appendix A), we also separately considered 21 of these datasets which, based on preliminary evaluation with subsampled datasets, demonstrate at least modest (though still sublinear) training speedups due to subsampling. Specifically, each of these 21 datasets showed on average at least a 3 speedup due to 8 downsampling on 100 randomly selected hyperparameter configurations.
Hyperband Configuration: We run Hyperband with , i.e., each run of Successive Halving throws out 2/3 of the arms and keeps the remaining 1/3. is set to equal the full training set size for each dataset and an upper bound on the maximum number of configurations for any round of SuccessiveHalving is set to . This ensures that the most exploratory bracket of Hyperband will downsample at least twice and the minimum sample size allocated will be . As mentioned in Section 2.6, when is specified, the only difference when running the algorithm is instead of .
Results: The results on all 117 datasets in Figure 5(a,b) show that Hyperband outperforms random search in test error rank despite performing worse in validation error rank. Bayesian methods, which also exhibit overfitting, still outperform both random and Hyperband on test error rank. Notably, random_2 outperforms all other methods. However, for the subset of 21 datasets, Figure 5(c) shows that Hyperband outperforms all other searchers on test error rank, including random_2 by a very small margin. While these results are more promising, the effectiveness of Hyperband was restricted in this experimental framework; for smaller datasets, the startup overhead was high relatively to total training time, while for larger datasets, only a handful of configurations could be trained within the hour window. Additionally, the results for the most exploratory bracket of Hyperband are not shown due to aforementioned limitations of the framework.
Hyperband demonstrates modest improvements when using dataset downsampling with sublinear training algorithms, as illustrated in the previous results. We will next show that Hyperband can offer far greater speedups in settings where training time is superlinear in the number of training instances. In this setup, we focus on optimizing hyperparameters for a kernel-based classification task on CIFAR-10. We use the multi-class regularized least squares classification model which is known to have comparable performance to SVMs (Rifkin and Klautau, 2004; Agarwal et al., 2014) but can be trained significantly faster.^{6}^{6}6The default SVM method in Scikit-learn is single core and takes hours to train on CIFAR-10 whereas a block coordinate least squares solver takes less than 10 minutes on an 8 core machine. The hyperparameters considered in the search space include preprocessing method, regularization, kernel type, kernel length scale, and other kernel specific hyperparameters (see Appendix A for more details). Hyperband is run with and , with each unit of resource representing 100 datapoints. Similar to previous experiments, these inputs result in a total of 5 brackets. Each hyperparameter optimization algorithm is run for ten trials on Amazon EC2 m4.2xlarge instances; for a given trial, Hyperband is allowed to run for two outer loops, bracket is repeated 10 times, and all other searchers are run for 12 hours.
Figure 3.2.2 shows that Hyperband returns a good configuration after just the first SuccessiveHalving bracket in approximately 20 minutes; other searchers fail to reach this error rate on average even after the entire 12 hours. Notably, Hyperband was able to evaluate over 250 configurations in this first bracket of SuccessiveHalving, while competitors were able to evaluate only three configurations in the same amount of time. Consequently, Hyperband is over 30 faster than Bayesian optimization methods and 70 faster than random search. Bracket sightly outperforms Hyperband but the terminal performance for the two algorithms are the same. Random_2 is competitive with SMAC and TPE.
We next demonstrate the performance of Hyperband when using features as a resource, focusing on random feature approximations for kernel methods. Features are randomly generated using the method described in Rahimi and Recht (2007)
to approximate the RBF kernel, and these random features are then used as inputs to a ridge regression classifier. We consider hyperparameters of a random feature kernel approximation classifier trained on CIFAR-10, including preprocessing method, kernel length scale, and
penalty. While it may seem natural to use infinite horizon Hyperband since we theoretically improve our approximation as we increase the number of features, in practice the amount of available machine memory imposes a natural upper bound on the number of features. We thus use finite horizion Hyperband with an upper bound of 100k random features, which will comfortably fit into a machine with 60GB of memory. Additionally, we set one unit of resource to be 100 features for an , which gives 5 different brackets with . Each searcher is run for 10 trials, with each trial lasting 12 hours on a n1-standard-16 machine from Google Cloud Compute. The results in Figure 3.2.2 show that Hyperband is around 6x faster than Bayesian methods and random search. Hyperband performs similarly to bracket . Random_2 outperforms Bayesian optimization algorithms.For a given , the most exploratory SuccessiveHalving round performed by Hyperband evaluates configurations using a budget of , which gives an upper bound on the potential speedup over random search. If training time scales linearly with the resource, the maximum speedup offered by Hyperband compared to random search is . For the values of and used in our experiments, the maximum speedup over random search is approximately given linear training time. However, we observe a range of speedups from to faster than random search. The differences in realized speedup can be explained by two factors: (1) the scaling properties of total evaluation time as a function of the allocated resource and (2) the difficulty of finding a good configuration.
Total evaluation time consists of the time to train the model on a given amount of resource as well as overhead costs. If training time is superlinear as a function of the resource, then Hyperband can offer higher speedups. More generally, if training scales like a polynomial of degree , the maximum speedup of Hyperband over random search is approximately . In the kernel least square classifier experiment discussed in Section 3.2.2, the training time scaled quadratically as a function of the resource. This is why the realized speedup of is higher than that expected given linear scaling. However, the maximum speedup of is not realized for two reasons: (1) overhead associated with evaluating many configurations on fewer resources and (2) insufficient difficulty of the search space.
Overhead costs include costs associated with initializing a model, resuming previously trained models, and calculating validation error. If overhead costs are too high, the total evaluation time can be sublinear, hence giving poor speedups over random search. In the case of the downsampling experiments on 117 datasets presented in Section 3.2.1, Hyperband did not provide significant speedup because many datasets could be trained in a matter of a few seconds and the initialization cost was high relative to training time, hence many datasets had sublinear scaling of training time as a function of sample size.
Whether the full speedup is realized is determined by how hard it is to find a good configuration; if 10 randomly sampled configurations is sufficient to find a good hyperparameter setting then the benefit of Hyperband is muted whereas if it takes more than a few hundred configurations then Hyperband can offer significant speedup. Generally the difficulty of the problem scales with the dimension of the search space since coverage diminishes with dimensionality. For low dimensional problems, the number of configurations evaluated by random search and Bayesian methods is exponential in the number of dimensions so good coverage can be achieved; i.e. if as in the features subsampling experiment, then . Hence, Hyperband is only faster than random search on the feature subsampling experiment because 256 configurations are not needed to find a good configuration. For the neural network experiments however, we hypothesize that faster speedups are observed for Hyperband because the dimension of the search space is higher.
For all our experiments, Hyperband outperformed random search by a healthy margin because the validation error was for the most part monotonically decreasing with increasing resource, i.e. overfitting did not pose a problem. However, this convergence property can be violated if regularization hyperparameters are themselves a function of the resource. In the experiments shown in Section 3
, the regularization terms were either independent of the number the resource in the case of the neural network experiments or scaled naturally with sample size or number of features as in the kernel experiments. In practice, it is important to choose regularization hyperparameters that are independent of the resource when possible. For example, when training random forests, the optimal maximum tree depth hyperparameter depends on the sample size but the relationship is unknown and can be dataset specific. An alternative regularization hyperparameter is minimum samples per leaf, which would scale better with size of training data as the resource.
In this section, we introduce the pure-exploration non-stochastic infinite-armed bandit (NIAB) problem, a very general setting which encompasses our hyperparameter optimization problem of interest. As we will show, Hyperband is in fact applicable far beyond just hyperparameter optimization. We begin by formalizing the hyperparameter optimization problem and then reducing it to the pure-exploration NIAB problem. We subsequently present a detailed analysis of Hyperband in both the infinite and finite horizon settings.
Let
denote the space of valid hyperparameter configurations which could include continuous, discrete, or categorical variables that can be constrained with respect to each other in arbitrary ways (i.e.
need not be limited to a subset of ). For letbe a sequence of loss functions defined over
. For any hyperparameter configuration , represents the validation error of the model trained using with units of resources (e.g. iterations). In addition, for some , define and . Note that for all , , and are all unknown to the algorithm a priori. In particular, it is uncertain how quickly varies as a function of for any fixed , and how quickly as a function of for any fixed .We assume hyperparameter configurations are sampled randomly from a known probability distribution over
. If is a random sample from this probability distribution, thenis a random variable whose distribution is unknown since
is unknown. Since it is unknown how varies as a function of or one cannot necessarily infer anything about given knowledge of for any , . As a consequence, we reduce the hyperparmeter optimization problem down to a much simpler problem that ignores all underlying structure of the hyperparameters: we only interact with some through its loss sequence for . With this reduction, the particular value of does nothing more than index or uniquely identify the loss sequence.Without knowledge of how fast or how is distributed, the goal of Hyperband is to identify a hyperparameter configuration that minimizes by drawing as many random configurations as desired, but using as few total resources as possible.
We now formally define the bandit problem of interest, and relate it to the problem of hyperparameter optimization. Each “arm” in the NIAB game is associated with a sequence that is drawn randomly from a distribution over sequences. If we “pull” the th drawn arm exactly times, we observe a loss . At each time, the player can either draw a new arm (sequence) or pull a previously drawn arm an additional time. There is no limit on the number of arms that can be drawn. We assume the arms are identifiable only by their index (i.e. we have no side-knowledge or feature representation of an arm), and we also make the following two additional assumptions:
For each the limit exists and is equal to .^{7}^{7}7We can always define so that convergence is guaranteed, i.e. taking the infimum.
The objective of the NIAB problem is to identify an arm with small using as few total pulls as possible. We are interested in characterizing as a function of the total number of pulls from all the arms. Clearly, the hyperparameter optimization problem described above is an instance of the NIAB problem.
In order to analyze the behavior of Hyperband in the NIAB setting, we must define a few additional objects. Define to satisfy
(1) |
and let . Define as the pointwise smallest, monotonically decreasing function satisfying
(2) |
The function is guaranteed to exist by Assumption 1 and bounds the deviation from the limit value as the sequence of iterates increases. For hyperparameter optimization, is the deviation of the validation error of a configuration trained on a subset of resources versus the maximum number of allocatable resources. Define as the first index such that if it exists, otherwise set . For let , using the convention that which we recall can be infinite.
As previously discussed, there are many real-world scenarios in which is finite and known. For instance, if increasing subsets of the full dataset is used as a resource, then the maximum number of resources cannot exceed the full dataset size, and thus for all where is the (known) full size of the dataset. In other cases such as iterative training problems, one might not want to or know how to bound . We separate these two settings into the finite horizon setting where is finite and known, and the infinite horizon setting where no bound on is known and it is assumed to be infinite. While our empirical results suggest that the finite horizon may be more practically relevant for the problem of hyperparameter optimization, the infinite horizon case has natural connections to the literature, and we begin by analyzing this setting.
Consider the Hyperband algorithm of Figure 6. The algorithm uses SuccessiveHalving (Figure 6
) as a subroutine that takes a finite set of arms as input and outputs an estimate of the best performing arm in the set. We first analyze
SuccessiveHalving (SHA) for a given set of limits and then consider the performance of SHA when are drawn randomly according to . We then analyze the Hyperband algorithm. We note that the algorithm of Figure 6 was originally proposed by Karnin et al. (2013) for the stochastic setting. However, Jamieson and Talwalkar (2015) analyzed it in the non-stochastic setting and also found it to work well in practice. By a simple modification of the proof of Jamieson and Talwalkar (2015) we have the following theoremFix arms. Let and assume . For any let
If the SuccessiveHalving algorithm of Figure 6 is run with any budget then an arm is returned that satisfies . Moreover, .
The next technical lemma will be used to characterize the problem dependent term when the sequences are drawn from a probability distribution.
Fix . Let . For any define
and so that
If arms are drawn randomly according to whose limits correspond to , then
for any with probability at least .
Setting in Theorem 4.3 and using the result of Lemma 4.3 that , we immediately obtain the following corollary. Fix and . Let where is defined in Lemma 4.3. If the SuccessiveHalving algorithm of Figure 6 is run with the specified and arm configurations drawn randomly according to , then an arm is returned such that with probability at least we have . In particular, if and then with probability at least .
Note that for any fixed we have for any
which implies . That is, needs to be sufficiently large so that it is probable that a good limit is sampled. On the other hand, for any fixed , Corollary 6 suggests that the total resource budget needs to be large enough in order to overcome the rates of convergence of the sequences described by . Next, we relate SHA to a naive approach that uniformly allocates resources to a fixed set of arms.
The non-adaptive uniform allocation strategy takes as inputs a budget and arms, allocates to each of the arms, and picks the arm with the lowest loss. The following results allow us to compare with SuccessiveHalving.
Suppose we draw random configurations from , train each with iterations, and let . Without loss of generality assume . If
(3) |
then with probability at least we have . In contrast, there exists a sequence of functions that satisfy and such that if
then with probability at least , we have , where is a constant that depends on the regularity of .
For any fixed and sufficiently large , Corollary 6 shows that SuccessiveHalving outputs an that satisfies with probability at least . This guarantee is similar to the result in Proposition 4.3.1. However, SuccessiveHalving achieves its guarantee as long as^{8}^{8}8We say if there exist constants such that .
(4) |
and this sample complexity may be substantially smaller than the budget required by uniform allocation shown in Eq. eq:unif_budget of Proposition 4.3.1. Essentially, the first term in Eq. eq:succ_halv_loose represents the budget allocated to the constant number of arms with limits while the second term describes the number of times the sub-optimal arms are sampled before discarded. The next section uses a particular parameterization for and to help better illustrate the difference between the sample complexity of uniform allocation (Equation 3) versus that of SuccessiveHalving (Equation 4).
To gain some intuition and relate the results back to the existing literature we make explicit parametric assumptions on and . We stress that all of our results hold for general and as previously stated, and this parameterization is simply a tool to provide intuition. First assume that there exists a constant such that
(5) |
Note that a large value of implies that the convergence of is very slow.
We will consider two possible parameterizations of . First, assume there exists positive constants such that
(6) |
Here, a large value of implies that it is very rare to draw a limit close to the optimal value . Fix some . As discussed in the preceding section, if arms are drawn from then with probability at least we have . Predictably, both uniform allocation and SuccessiveHalving output a that satisfies with probability at least provided their measurement budgets are large enough. Thus, if and the measurement budgets of the uniform allocation (Equation 3) and SuccessiveHalving (Equation 4) satisfy
then both also satisfy with probability at least ^{9}^{9}9These quantities are intermediate results in the proofs of the theorems of Section 4.3.3. SuccessiveHalving’s budget scales like , which can be significantly smaller than the uniform allocation’s budget of . However, because and are unknown in practice, neither method knows how to choose the optimal or to achieve this accuracy. In Section 4.3.3 we show how Hyperband addresses this issue.
The second parameterization of is the following discrete distribution:
(7) |
for some set of unique scalars . Note that by letting this discrete CDF can approximate any piecewise-continuous CDF to arbitrary accuracy. In this setting, we have that both uniform allocation and SuccessiveHalving output a that is within the top fraction of the arms with probability at least if their budgets are sufficiently large. Thus, if and the measurement budgets of the uniform allocation (Equation 3) and SuccessiveHalving (Equation 4) satisfy
then an arm that is in the best -fraction of arms is returned, i.e. and , with probability at least . We remark that the value of in Corollary 6 is carefully chosen to make the SuccessiveHalving budget and guarantee work out. Also note that one would never take because is sufficient to return the best arm.
The Hyperband algorithm of Figure 6 addresses the tradeoff between the number of arms versus the average number of times each one is pulled by performing a two-dimensional version of the so-called “doubling trick.” For each fixed , we non-adaptively search a predetermined grid of values of spaced geometrically apart so that the incurred loss of identifying the “best” setting takes a budget no more than times the budget necessary if the best setting of were known ahead of time. Then, we successively double so that the cumulative number of measurements needed to arrive at the necessary is no more than . The idea is that even though we do not know the optimal setting for to achieve some desired error rate, the hope is that by trying different values in a particular order, we will not waste too much effort.
Fix . For all pairs defined in the Hyperband algorithm of Figure 6, let . For all define to be the event that
for some , and
.
By Corollary 6 we have . For any , let be the empirically best-performing arm output from SuccessiveHalving of round of Hyperband of Figure 6 and let be the largest value that makes hold. Then after plugging in the relevant quantities we obtain
Also note that on stage at most total samples have been taken. While this guarantee holds for general , the value of , and consequently the resulting bound, is difficult to interpret. The following corollary considers the parameterizations of and , respectively, of Section 4.3.2 for better interpretation.
Assume that Assumptions 1 and 2 of Section 4.2 hold and that the sampled loss sequences obey the parametric assumptions of Equations 5 and 6. Fix . For any , let be the empirically best-performing arm output from SuccessiveHalving from the last round of Hyperband of Figure 6 after total samples had been taken from all rounds, then
for some constant where .
By a straightforward modification of the proof, one can show that if uniform allocation is used in place of SuccessiveHalving in Hyperband, the uniform allocation version achieves . We apply the above theorem to the stochastic infinite-armed bandit setting in the following corollary.
[Stochastic Infinite-armed Bandits] For any step in the infinite horizon Hyperband algorithm with arms drawn, consider the setting where the th pull of the th arm results in a stochastic loss such that and . If then with probability at least we have ,
Consequently, if after total pulls we define as the mean of the empirically best arm output from the last fully completed round , then with probability at least
The result of this corollary matches the anytime result of Section 4.3 of Carpentier and Valko (2015) whose algorithm was built specifically for the case of stochastic arms and the parameterization of defined in Eq. eq:F_beta_assumption. Notably, this result also matches the lower bounds shown in that work up to poly-logarithmic factors, revealing that Hyperband is nearly tight for this important special case. However, we note that this earlier work has a more careful analysis for the fixed budget setting.
Assume that Assumptions 1 and 2 of Section 4.2 hold and that the sampled loss sequences obey the parametric assumptions of Equations 5 and 7. For any , let be the empirically best-performing arm output from SuccessiveHalving from the last round of Hyperband of Figure 6 after total samples had been taken from all rounds. Fix and and let . Once total pulls have been made by Hyperband we have with probability at least where hides factors.
Appealing to the stochastic setting of Corollary 4.3.3 so that , we conclude that the sample complexity sufficient to identify an arm within the best proportion with probabiltiy , up to log factors, scales like . One may interpret this result as an extension of the distribution-dependent pure-exploration results of Bubeck et al. (2009); but in our case, our bounds hold when the number of pulls is potentially much smaller than the number of arms . When this implies that the best arm is identified with about which matches known upperbounds Karnin et al. (2013); Jamieson et al. (2014) and lower bounds Kaufmann et al. (2015) up to factors. Thus, for the stochastic -armed bandit problem Hyperband recovers many of the known sample complexity results up to factors.
In this section we analyze the algorithm described in Section 2, i.e., finite horizon Hyperband. We present similar theoretical guarantees as in Section 4.3 for infinite horizon Hyperband, and fortunately much of the analysis will be recycled. We state the finite horizon version of the SuccessiveHalving and Hyperband algorithms in Figure 7.