Weighting Is Worth the Wait: Bayesian Optimization with Importance Sampling

02/23/2020 ∙ by Setareh Ariafar, et al. ∙ Northeastern University 0

Many contemporary machine learning models require extensive tuning of hyperparameters to perform well. A variety of methods, such as Bayesian optimization, have been developed to automate and expedite this process. However, tuning remains extremely costly as it typically requires repeatedly fully training models. We propose to accelerate the Bayesian optimization approach to hyperparameter tuning for neural networks by taking into account the relative amount of information contributed by each training example. To do so, we leverage importance sampling (IS); this significantly increases the quality of the black-box function evaluations, but also their runtime, and so must be done carefully. Casting hyperparameter search as a multi-task Bayesian optimization problem over both hyperparameters and importance sampling design achieves the best of both worlds: by learning a parameterization of IS that trades-off evaluation complexity and quality, we improve upon Bayesian optimization state-of-the-art runtime and final validation error across a variety of datasets and complex neural architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The incorporation of more parameters and more data, coupled with faster computing and longer training times, has driven state-of-the-art results across a variety of benchmark tasks in machine learning. However, careful model tuning remains critical in order to find good configurations of hyperparameters, architecture and optimization settings. This tuning requires significant experimentation, training many models, and is often guided by expert intuition, grid search, or random sampling. Such experimentation multiplies the cost of training, and incurs significant financial, computational, and even environmental costs (Strubell et al., 2019).

Bayesian optimization (BO) offers an efficient alternative when the tuning objective can be effectively modeled by a surrogate regression (Bergstra et al., 2011; Snoek et al., 2012), or when one can take advantage of related tasks (Swersky et al., 2013) or strong priors over problem structure (Swersky et al., 2014; Domhan et al., 2015). BO optimizes an expensive function by iteratively building a relatively cheap probabilistic surrogate and evaluating a carefully balanced combination of uncertain and promising regions (exploration vs. exploitation).

In the context of neural network hyperparameter optimization, BO typically involves an inner loop of training a model given a hyperparameter configuration, and then evaluating validation error as the objective to be optimized. This inner loop is expensive and its cost grows with the size of the dataset: querying modern models even once may require training for days or weeks.

One strategy to mitigate the high cost of hyperparameter tuning is to enable the BO algorithm to trade off between the value of the information gained from evaluating a hyperparameter setting and the cost of that evaluation. For example, Swersky et al. (2013) and Klein et al. (2016) allow BO to evaluate models trained on randomly chosen subsets of data to obtain more, but less informative, evaluations. We propose an alternative approach: our method, Importance-based Bayesian Optimization (IBO), dynamically learns when spending additional effort training a network to obtain a higher fidelity observation is worth the incurred cost. To achieve this, in addition to considering the hyperparameters, IBO takes into account the underlying training data and focuses the computation on more informative training points. Specifically, IBO models a distribution over the location of the optimal hyperparameter configuration, and allocates experimental budget according to cost-adjusted expected reduction in entropy Hennig and Schuler (2012). Therefore, higher fidelity observations provide a greater reduction in entropy, albeit at a higher evaluation cost.

(a) GP model of with many noisy points
(b) GP model of with few noiseless points
Figure 1: Motivating example: compared to querying many points with lower fidelity (left), observing few points with higher fidelity (right) can significantly improve the model’s predicted minimum location (here, lower-fidelity queries always over-estimate

the target function, reflecting the impact of limiting the number of SGD iterations during training). However, obtaining higher-quality estimates can significantly slow down the overall runtime of Bayesian optimization; by learning to optimize the tradeoff between the value and cost of obtaining high-fidelity estimates, we show in Section 

4 that IBO achieves the best of both worlds.

To decide how much effort to allocate to training a network and which training examples to prioritize, IBO leverages both the properties of stochastic gradient descent (SGD) and recent work on importance sampling 

(Johnson and Guestrin, 2018). At each SGD iteration, IBO estimates how much each training example will impact the model; based on this estimate, IBO either does a normal round of SGD, or a more costly importance-weighted gradient update.

Balancing the cost of the inner loop of neural network training and the outer loop of BO is a non-trivial task; if done naively, the overall hyperparameter tuning procedure will be substantially slower. To address this issue, we adopt a multi-task Bayesian optimization formulation for IBO and develop an algorithm that dynamically adjusts for the trade-off between the cost of training a network at higher fidelity and getting more but noisier evaluations (Fig. 1). This approach allows us to obtain higher quality black-box function evaluations only when worthwhile, while controlling the average cost of black-box queries. As a consequence, we are able to tune complex network architectures over challenging datasets in less time and with better results than existing state-of-the-art BO methods. Tuning a ResNet on CIFAR-100, IBO improves the validation error by over the next best method; other baselines are not able to reach IBO’s performance, even with additional computational budget.

Contributions.

We introduce a multi-task Bayesian optimization framework, IBO (Importance-based Bayesian Optimization), which takes into account the contribution of each training point during the evaluation of a candidate hyperparameter. To do so, IBO optimizes the importance sampling tradeoff between quality and runtime while simultaneously searching hyperparameter space. We show on extensive benchmark experiments that the computational burden incurred by importance sampling is more than compensated for by the principled search through hyperparameter space that it enables. We show across these experiments that IBO consistently improves over a variety of baseline Bayesian optimization methods. On more complex datasets, IBO converges significantly faster in wall-clock time than existing methods and furthermore reaches lower validation errors, even as other methods are given larger time budgets.

2 Related work

Several different methods have been proposed to accelerate the hyperparameter tuning process. Swersky et al. (2013) proposed Multi-Task Bayesian Optimization (MTBO), which performs surrogate cheap function evaluations on a small subset of training data which is then used to extrapolate the performance on the entire training set. Motivated by this work, Klein et al. (2016) introduced Fabolas, which extends MTBO to also learn the sufficient size of training data. MTBO and Fabolas avoid costly function evaluations by training on small datasets where data is uniformly chosen at the beginning of each training round.

Another body of related work involves modeling the neural network’s loss as a function of both the hyperparameters and the inner training iterations. Then, the goal is to extrapolate and forecast the ultimate objective value and stop underperforming training runs early. Work such as (Swersky et al., 2014; Domhan et al., 2015; Dai et al., 2019; Golovin et al., 2017) falls under this category. These methods generally have to deal with the cubic cost of Gaussian processes — , for observed hyperparameters and iterations. In practice, these methods typically apply some type of relaxation. For example, the freeze-thaw method Swersky et al. (2014) assumes that training curves for different hyperparameter configurations are independent conditioned on their prior mean, which is drawn from another global GP.

Moreover, an alternative approach to Bayesian optimization solves the hyperparameter tuning problem through enhanced random search. Hyperband (Li et al., 2016) starts from several randomly chosen hyperparameters and trains them on a small subset of data. Following a fixed schedule, the algorithm stops underperforming experiments and then retrains the remaining ones on larger training sets. Hyperband outperforms standard BO in some settings, as it is easily parallelized and not subject to model misspecification. However, Hyperband’s exploration is necessarily limited to the initial hyperparameter sampling phase: the best settings chosen by Hyperband inevitably will correspond to one of initial initializations, which were selected uniformly and in an unguided manner. To address this issue, several papers, including (Falkner et al., 2018; Wang et al., 2018; Bertrand et al., 2017), have proposed the use of Bayesian optimization to warm-start Hyperband and perform a guided search during the initial hyperparameter sampling phase.

Finally, IBO belongs to the family of multi-fidelity Bayesian optimization methods (Kandasamy et al., 2016; Forrester et al., 2007; Huang et al., 2006; Klein et al., 2016) methods, which take advantage of cheap approximations to the target black-box function. Of those methods, Fabolas (Klein et al., 2016) focuses specifically on hyperparameter tuning, and is included as a baseline in all our experiments. Fabolas uses cheap evaluations of the network validation loss by training the network on a randomly sampled subset of the training dataset. Hence, both IBO and Fabolas depend directly on training examples to vary the cost of querying the black-box function; Fabolas by using fewer examples for cheap evaluations, whereas IBO uses the per-example contribution to training to switch to costlier evaluations.

Existing literature on hyperparameter tuning weighs all training examples equally and does not take advantage of their decidedly unequal influence. To the best of our knowledge, IBO is the first method to exploit the informativeness of training data to accelerate hyperparameter tuning, merging Bayesian optimization with importance sampling.

Terminology.

We refer to one stochastic gradient descent (SGD) update to a neural network as an inner optimization round. Conversely, an outer optimization round designates one iteration of the BO process: fitting a GP, optimizing an acquisition function, and evaluating the black-box function.

3 Importance sampling for BO

Bayesian optimization is a strategy for the global optimization of a potentially noisy, and generally non-convex, black-box function . The function is presumed to be expensive to evaluate in terms of time, resources, or both.

In the context of hyperparameter tuning, is the space of hyperparameters, and is the validation error of a neural network trained with hyperparameters .

Given a set of hyperparameter configurations and associated function evaluations (which may be subject to observation noise), Bayesian optimization starts by building a surrogate model for over . Gaussian processes (GPs), which provide a flexible non-parametric distribution over smooth functions, are a popular choice for this probabilistic model, as they provide tractable closed-form inference and facilitate the specification of a prior over the functional form of  (Rasmussen, 2003).

3.1 Surrogate model quality vs. computational budget

Given a zero-mean prior with covariance function , the GP’s posterior belief about the unobserved output at a new point after seeing data

is a Gaussian distribution with mean

and variance

such that

(1)

where , , and is the variance of the observation noise, that is, .

Given this posterior belief over the value of unobserved points, Bayesian optimization selects the next point (hyperparameter set) to query by solving

(2)

where is the acquisition function, which quantifies the expected added value of querying at point , based on the posterior belief on given by Eq. (1).

Typical choices for the acquisition function include entropy search (ES) (Hennig and Schuler, 2012) and its approximation predictive entropy search (Hernández-Lobato et al., 2014), knowledge gradient (Wu et al., 2017), expected improvement (Močkus, 1975; Jones et al., 1998) and upper/lower confidence bound (Cox and John, 1992, 1997).

Entropy search quantifies how much knowing reduces the entropy of the distribution over the location of the best hyperparameters :

(3)

where is the entropy function and the expectation is taken with respect to the posterior distribution over the observation at hyperparameter .

The more accurate the observed values , the more accurate the GP surrogate model (1). A more accurate surrogate model, in turn, defines a better acquisition function (2), and, finally, a more valuable Bayesian optimization outer loop.

Previous work has tackled this trade-off during the BO process by early-stopping training that is predicted to yield poor final values (Swersky et al., 2014; Dai et al., 2019). IBO takes the opposite route, detecting when to spend additional effort to acquire a more accurate value of .

Crucially, hyperparameter tuning for neural networks is not an entirely black-box optimization setting, as we know the loss minimization framework in which neural networks are trained. We take advantage of this by allocating computational budget at each SGD iteration; based on the considered training points, IBO switches from standard SGD updates to the more computationally intensive importance sampling updates. This is the focus of the following section.

3.2 Importance sampling for loss minimization

The impact of training data points on one (batched) SGD iteration has benefited from significant attention in machine learning Needell et al. (2014); Schmidt et al. (2015); Zhang et al. (2017); Fu and Zhang (2017). For the purposes of IBO, we focus on importance sampling (IS) (Needell et al., 2014; Zhao and Zhang, 2015). IS minimizes the variance in SGD updates;111IS also benefits SGD with momentum (Johnson and Guestrin, 2018). Although we focus our analysis on pure SGD, IBO also extends to certain SGD variants. however, IS is parameterized by the per-example gradient norm for the current weights of the network, and as such incurs a significant computational overhead.

Specifically, let be the training loss, where is the number of training examples and is the loss at point . To minimize , SGD with importance sampling iteratively computes estimate of by sampling

with probability

, then applying the update

(4)

where is the learning rate. Update (4) provably minimizes the variance of the gradient estimate, which in turn improves the convergence speed of SGD.222Standard SGD is recovered by setting .

Various solutions to efficiently leverage importance sampling have been suggested  (Zhao and Zhang, 2015). We leverage recent work Katharopoulos and Fleuret (2018), which speeds up batched SGD with IS by a cheap subroutine that determines whether IS’s variance reduction justifies the incurred computational cost at each SGD step.

To achieve efficient IS for batches of size , Katharopoulos and Fleuret (2018) introduce a pre-sample batch size hyperparameter . At each SGD step, points are first sampled uniformly at random, from which a batch of size is then subsampled. These points are sampled either uniformly or with importance sampling, depending on an upper bound on the variance reduction permitted by IS.

3.3 Multi-task BO for importance sampling

In (Katharopoulos and Fleuret, 2018), the authors state that the added value of importance sampling is extremely sensitive to the number of the pre-sampled data points; we verify this empirically in §4, showing that naively replacing standard SGD with the IS algorithm of (Katharopoulos and Fleuret, 2018) does not improve upon standard BO hyperparameter tuning. To maximize the utility of importance sampling, we instead opt for a multi-task BO framework, within which the search through hyperparamater space is done in parallel to a second task: optimization over .

Multi-task Bayesian optimization (MTBO) (Swersky et al., 2013) extends BO to evaluating a point on multiple correlated tasks. To do so, MTBO optimizes an objective function over a target task which, although expensive to evaluate, provides the maximum utility for the downstream task. MTBO exploits cheap evaluations on surrogate tasks to extrapolate performance on the target task; here, the target task evaluates when sampling a batch from all training data, whereas the surrogate task evaluates when subsampling from a super-batch of datapoints at each SGD iteration MTBO uses the entropy search acquisition function (Eq. 3), and models an objective function over points and tasks via a multi-task GP (Journel and Huijbregts, 1978; Bonilla et al., 2008). The covariance between two pairs of points and corresponding tasks is defined through a Kronecker product kernel:

(5)

where models the relation between the hyperparameters and describes the correlation between tasks.

For our case, the subsampling size is the task variable while the optimal task sets to the size of the entire training set. Let denote the validation error value at hyperparameter after training iterations using IS with pre-sample size . We define the multi-task kernel for the GP that models as

with the sub-task kernels defined as

(6)

Additionally, following Snoek et al. (2012), we penalize the evaluation of any point by the computational cost of training a model for SGD iterations at hyperparameter with subsampling size . This penalty guides the hyperparameter search towards promising yet relatively inexpensive solutions. We model the training cost using a multi-task GP fitted to the log cost of observations that are collected during BO. We choose the covariance function

where this time we modify the kernel on to reflect that larger increases training time:

(7)

Our choices for and follow (Klein et al., 2016), who recommend the associated feature maps.

Our resulting acquisition function is thus:

(8)

where is the posterior mean of the GP modeling the training cost; as previously, is the probability that is the optimal solution at the target task given data .

Our algorithm is presented in Algorithm 1. The initialization phase follows the MTBO convention: we collect initial data at randomly chosen inputs , and evaluate each hyperparameter configuration with a randomly selected value for . DoSGD is the subroutine proposed by (Katharopoulos and Fleuret, 2018); it determines if the variance reduction enabled by importance sampling is worth the additional cost at the current SGD iteration.

  Obtain initial data
  for  do
     Fit multi-task GPs to and given
     
      model initialized with hyperparams
     for  do
         uniformly sampled training points
        if DoSGD then
           
        else
           
        end if
     end for
      validation error of
      time used to train
     
  end for
  return with best predicted error at
Algorithm 1 Importance-based BO
Remark 1.

Whereas Fabolas speeds up the evaluation of by limiting the number of training points used during training, IBO uses the entire training data, reweighting points based on their relevance to the training task. Thus, each IBO iteration is slower than a Fabolas iteration. However, because IBO carries out a more principled search through hyperparameter space and queries higher fidelity evaluations, IBO requires less BO iterations — and hence potentially less time — to find a good hyperparameter.

4 Experiments

Problem Time Budget IBO (ours) Fabolas Fabolas-IS ES ES-IS
CNN (CIFAR10) % 0.28 (0.25,0.32) 0.29 (0.26,0.36) 0.4 (0.38,0.9) 0.38 (0.29,0.83) 0.3 (0.26,0.35)
% 0.27 (0.26,0.29) 0.26 (0.25.0.27) 0.38 (0.27,0.9) 0.28 (0.27.0.37) 0.25 ( 0.25,0.26)
% 0.25(0.24,0.29) 0.25 (0.24.0.27) 0.38 (0.26,0.9) 0.28 (0.26.0.28) 0.26 (0.25,0.26)
% 0.23 (0.23,0.23) 0.25 (0.24.0.27) 0.33 (0.26,0.38) 0.28 (0.26.0.28) 0.26 (0.25,0.26)
ResNet (CIFAR10) % 0.11 (0.11,0.12) 0.11 (0.11,0.12) 0.11 (0.11,0.2) 0.12 (0.11,0.21) 0.11 (0.11,0.21)
% 0.1 (0.1,0.1) 0.11 (0.1,0.11) 0.11 (0.11,0.11) 0.11 (0.1,0.2) 0.12 (0.11,0.21)
% 0.09 (0.09,0.1) 0.11 (0.1,0.11) 0.11 (0.11,0.11) 0.11 (0.1,0.17) 0.12 (0.11,0.19)
% 0.09 (0.09,0.1) 0.11 (0.1,0.11) 0.11 (0.11,0.11) 0.11 (0.1,0.17) 0.12 (0.11,0.18) )
ResNet (CIFAR100) % 0.38 (0.35,0.39) 0.37 (0.37,0.39) 0.39 (0.38,0.44) 0.38 (0.38,0.44) 0.38 (0.37,0.38)
% 0.33 (0.33,0.37) 0.37 (0.36,0.39) 0.39 (0.38,0.44) 0.38 (0.38,0.44) 0.37 (0.36,0.38)
% 0.33 (0.33,0.34) 0.37 (0.36,0.39) 0.39 (0.38,0.44) 0.38 (0.38,0.42) 0.38 (0.36,0.39)
% 0.32 (0.32,0.34) 0.37 (0.36,0.39) 0.39 (0.39,0.45) 0.38 (0.37,0.41) 0.36 (0.34,0.38)
Table 1: Test error of models trained using hyperparameters found by the different methods. Each method is allocated the same amount of time; results reflect each method’s choice of best hyperparameter after the different percentages of time have elapsed. Test error is obtained by a model trained on the full training set using vanilla minibatch SGD. Across all three experiments, IBO reaches the lowest test error, confirming that the computational cost incurred by importance sampling is amortized by the more efficient search over hyperparamater space that IS enables. Notably, IBO also achieves lower test errors earlier than other methods on the more difficult benchmarks.

We evaluate our proposed method, IBO,333The code for IBO will be released upon acceptance.

on four benchmark hyperparameter tuning tasks: a feed-forward network on MNIST, a convolutional neural network (CNN) on CIFAR-

, a residual network on CIFAR-, and a residual network on CIFAR-. We include the following baselines:

  • [label=–,leftmargin=*,parsep=0pt]

  • ES: Bayesian optimization with the entropy search acquisition function (Hennig and Schuler, 2012),

  • ES-IS: BO with entropy search; inner optimization is performed using IS. For each black-box query, we draw the presample size uniformly at random from batch size as prescribed in (Katharopoulos and Fleuret, 2018); is constant during the rounds of SGD.

  • Fabolas (Klein et al., 2016): BO in which each inner-loop optimization uses a fraction of the training set. The value of is learned via multi-task BO; this sub-training set does not evolve during the inner SGD iterations.

  • Fabolas-IS: Fabolas, training with SGD-IS. For this method, a fraction of the training data is uniformly chosen as in Fabolas, but training is performed with SGD-ISo. The pre-sample batch size is the randomly uniformly sampled in batch size.

ES-IS acts as an ablation test for IBO’s multi-task framework, as it does not reason about the cost-fidelity tradeoff of IBO. Thus, we keep the training procedure for ES-IS and Fabolas-IS similar to IBO, switching to IS only if variance reduction is possible and using IS is advantageous (Alg. 1, lines ). We run all methods on a PowerEdge R730 Server with NVIDIA Tesla K80 GPUs (experiment 4.2) or on a DGX- server with NVIDIA Tesla V100 GPUs (rest).

4.1 Implementation Details

For IBO, we use task kernels (Eq. 6) and (Eq. 7), with kernel hyperparameters and . Following Snoek et al. (2012), we marginalize out the GPs’ hyperparameters using MCMC for all methods.

To set the time budget, we fix a total number of BO iterations for each method; the time at which the fastest method completes its final iteration acts as the maximum amount of time available to any other method. All initial design evaluations also count towards the runtime; this slightly advantages non-IS methods, which have cheaper initializations.

We report the performance of each method as a function of wall-clock time, since the methods differ in per-iteration complexity (App. 6 reports results vs. iteration number).

We measure the performance of each method by taking the predicted best hyperparameter values after each BO iteration, then training a model with hyperparameters , using the entire training set and vanilla SGD. Recall that for Fabolas, Fabolas-IS, and IBO, the incumbent is the set of hyperparameters with the best predicted objective on the target task (e.g., using the full training data for Fabolas).

We run each method five times unless otherwise stated, and report the median performance and and

percentiles (mean and standard deviation results are included in Appendix

C.1 for completeness). ES and Fabolas variations are run using RoBO.444https://github.com/automl/RoBO For importance sampling, we used the code provided by Katharopoulos and Fleuret (2018).555https://github.com/idiap/importance-sampling

All methods are initialized with 5 hyperparameter configurations drawn from a Latin hypercube design. For IBO, we evaluate each configuration on the maximum value of its target task . For Fabolas,  Klein et al. (2016) suggest initializing by evaluating each hyperparameter on an increasing series of task values. This aims to capture the task variable’s effect on the objective. However, we empirically observed that following an initial design strategy similar to IBO’s, i.e., evaluating each hyperparameter on the maximum target value , worked better in practice for both Fabolas and Fabolas-IS. This is the method we use in our experiments; App. C includes results for both initialization schemes.

For IBO, Fabolas-IS and ES-IS, we reparameterize the pre-sample size as . As was recommended by Katharopoulos and Fleuret (2018), we set . For Fabolas-IS, if is larger than the training subset size, we use the entire subset to compute the importance distribution.

4.2 Feed-forward Neural Network on MNIST

Our first experiment is based on a common Bayesian optimization benchmark problem (Falkner et al., 2018; Domhan et al., 2015; Hernández-Lobato et al., 2016)

. We tune a fully connected neural network using RMSProp on MNIST 

(LeCun, 1998)

. The number of training epochs

and the number of BO rounds are set to . We tune six hyperparameters: number of hidden layers, number of units per layer, batch size, learning rate, decay rate, and dropout rate (see App. A).

Figure 2: Hyper-parameter tuning of a CNN on CIFAR-10. After around 9 hours, our model (IBO) outperforms all other methods. The ablation test Fabolas-IS shows the weakest performance with a large uncertainty. The ablation test ES-IS shows slightly better performance than IBO in the first half of the time horizon. However, IBO overall surpasses ES-IS and achieves the best final performance among all methods with a negligible variance, confirming the value of our multi-task formulation.
Figure 3: Hyper-parameter tuning of a ResNet on CIFAR-10. IBO outperforms all other baselines at one third of the time budget and keep improving until the end. Conversely, Fabolas-IS is unable to progress after one third of the time horizon while Fabolas achieves a minor improvement (compared to IBO) at around 9 hours. ES-IS shows the weakest performance of all and suffers a large uncertainty. This is yet another evidence that simply augmenting BO with importance sampling is not robust.

Given the well-known straightforwardness of the MNIST dataset, we do not expect to see significant gains when using importance sampling during training. Indeed, we see (Table 2) that all methods perform similarly after exhausting their BO iteration budget, although Fabolas does reach a low test error slightly earlier on, since training on few data points is sufficient; see Appendix A for more details.

4.3 CNN on CIFAR-10

We next tune a convolutional neural network (CNN) using RMSProp on the CIFAR-10 dataset (Krizhevsky et al., 2009)

. We fix an architecture of three convolutional layers with max-pooling, followed by a fully connected layer, in line with previous benchmarks on this problem 

(Falkner et al., 2018; Klein et al., 2016; Dai et al., 2019). Following Dai et al. (2019), we tune six hyperparameters: number of convolutional filters , number of units in the fully connected layer , batch size , initial learning rate , decay rate , and regularization weight . All methods are run for BO iterations and trained using SGD epochs.

IBO, Fabolas and ES-IS exhibit the best performance (Fig. 3) but switch ranking over the course of time. However, after spending roughly half of the budget, IBO outperforms Fabolas and all other baselines, achieving the best final error with the lowest uncertainty.

ES-IS shows that adding IS naively can improve upon base entropy search; however, IBO outperforms both ES and ES-IS, confirming the importance of a multi-task setting that optimizes IS. Furthermore, simply adding importance sampling during SGD is not guaranteed to improve upon any method: Fabolas-IS performs poorly compared to Fabolas.

4.4 Residual Network on CIFAR-10

We next tune the a residual network trained on CIFAR-10. We follow the wide ResNet architecture in (Zagoruyko and Komodakis, 2016), and tune four hyperparameters: initial learning rate , decay rate , momentum and regularization weight . Following Klein et al. (2016), all but the momentum are optimized over a log-scale search space.

Figure 4: Hyper-parameter tuning of a ResNet on CIFAR-100. IBO outperforms all other methods as both a function of iterations and time. The performance over iteration (left plot) in particular shows that IBO is able to achieve a low test error in a limited number of function evaluations, i.e., less than 20. This roughly equals spending 20 of the time budget (right plot). Moreover, IBO is able to further improve after 60 hours. Conversely, both Fabolas and Fabolas-IS are unable to progress after an early stage (roughly 10 out of 150 iterations and 15 out of 80 hours). Interestingly, providing a larger iteration budget has not helped these methods. ES-IS exhibits the second best performance after IBO with 4-5 margin. Moreover, compared to IBO, it shows a noisier performance with larger uncertainty.

We set and multiply the learning rate by the decay rate after epochs. Experimentally, we saw that epochs is insufficient for the inner (SGD) optimization to converge on the ResNet architecture; this experiment evaluates BO in the setting where is too computationally intensive to compute exactly. We ran all the methods using BO iterations for Fabolas and Fabolas-IS and iterations for the rest. This difference in budget iteration is to compensate for the different cost of training on a subset of data versus on the entire data.666For experiment 4.2, we observed that keeping the BO iteration budget consistent is sufficient since the training costs are not very different. For experiment 4.3, we set this budget to 100, and stopped reporting the results once the first method exhausted its budget. Since ResNet experiments are generally more costly, choosing a large budget for all methods was not feasible. Results are reported in Fig. 3, and are obtained with runs with random initializations.

Consistently with previous results, Fabolas achieves the lowest error in the very initial stage, due to its cheap approximations. However, IBO quickly overtakes all other baselines, and attains a value that other methods cannot achieve with their entire budget consumption. Fabolas-IS also performs well, but suffers a large variance.

The ablation tests (ES-IS and Fabolas-IS) consistently have high variance, likely because these methods do not learn the optimal batch size for importance sampling and opt for a random selection within the recommended range. In contrast, IBO specifically learns the batch size parameter which controls the cost-benefit trade off in importance sampling and hence, enjoys better final results and lower variance.

4.5 Residual Network on CIFAR-100

Finally, we tune the hyperparameters of a residual network trained on CIFAR-100. The architecture of the network, the hyperparameters we optimize and their respective ranges are similar to the §4.4. We set and multiply the learning rate by the decay rate every epochs. For Fabolas and Fabolas-IS, a budget of BO iterations is provided while the rest of the methods are given iterations.

Clearly, IBO outperforms the rest of the methods after spending roughly 20 of the time budget (Fig. 4); Fabolas and ES-IS are the second best methods. Similar to the experiment 4.3, Fabolas-IS is outperformed by the other baselines, and once again incurs a large variance. Interestingly, for Fabolas and Fabolas-IS, the additional BO budget does not cause an improvement in their performance. This is yet further evidence that for complex datasets, neither vanilla multi-task frameworks nor simple importance sampling is sufficient to gain the advantages of IBO.

By seeking higher-fidelity surrogate models, IBO achieves better results in fewer optimization runs and less runtime than other baselines, despite the incurred cost of using each training example individually during certain SGD rounds.

5 Conclusion

Bayesian optimization offers an efficient and principled framework for hyperparameter tuning. However, finding optimal hyperparameters requires an expensive inner loop which repeatedly trains a model with new hyperparameters. Prior work has scaled BO by using cheap evaluations to the black-box function. IBO takes the opposite approach: by increasing time spent obtaining higher-fidelity evaluations, IBO requires much fewer outer BO loops.

Leveraging recent developments in importance sampling, IBO takes into account the contribution of each training point to decide whether to run vanilla SGD or a more complex, time-consuming but higher quality variant. Although this results in costlier neural network training loops, the additional precision obtained for the black-box estimates allows a more principled search through hyperparameter space, significantly decreasing the amount of wall-clock time necessary to obtain a high-quality hyperparameter.

Crucially, the interaction between importance sampling and Bayesian optimization must be approached with care; a naive merging of both methods does not decrease the overall runtime of Bayesian optimization, and does not yield better final hyperparameters. However, by opting for a multi-task parameterization of the problem, IBO learns to dynamically adjust the trade-off between neural network training time and black-box estimate value, producing faster overall runtimes as well as better hyperparameters.

We show on four benchmark tasks of increasing complexity that IBO achieves the lowest error compared to all other baseline methods, and scales gracefully with dataset and neural architecture complexity. When tuning a ResNet on CIFAR-100, IBO outperforms all other baselines and ablation tests by a significant margin, both as a function of wall-clock time and number of outer optimization rounds.

References

  • J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: §1.
  • H. Bertrand, R. Ardon, M. Perrot, and I. Bloch (2017) Hyperparameter optimization of deep neural networks: combining hyperband with bayesian model selection. In Conférence sur l’Apprentissage Automatique, Cited by: §2.
  • E. V. Bonilla, K. M. Chai, and C. Williams (2008) Multi-task gaussian process prediction. In Advances in neural information processing systems, pp. 153–160. Cited by: §3.3.
  • D. D. Cox and S. John (1992) A statistical method for global optimization. In [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1241–1246. Cited by: §3.1.
  • D. D. Cox and S. John (1997) SDO: a statistical method for global optimization. In in Multidisciplinary Design Optimization: State-of-the-Art, pp. 315–329. Cited by: §3.1.
  • Z. Dai, H. Yu, B. K. H. Low, and P. Jaillet (2019) Bayesian optimization meets bayesian optimal stopping. In International Conference on Machine Learning, pp. 1496–1506. Cited by: §2, §3.1, §4.3.
  • T. Domhan, J. T. Springenberg, and F. Hutter (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    ,
    Cited by: §1, §2, §4.2.
  • S. Falkner, A. Klein, and F. Hutter (2018) BOHB: robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774. Cited by: Appendix A, §2, §4.2, §4.3.
  • A. I.J. Forrester, A. Sóbester, and A. J. Keane (2007) Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A 463 (2088), pp. 3251–3269. Cited by: §2.
  • T. Fu and Z. Zhang (2017) CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 54, pp. 841–850. Cited by: §3.2.
  • D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. E. Karro, and D. Sculley (Eds.) (2017) Google vizier: a service for black-box optimization. Cited by: §2.
  • P. Hennig and C. J. Schuler (2012) Entropy search for information-efficient global optimization. Journal of Machine Learning Research 13 (Jun), pp. 1809–1837. Cited by: §1, §3.1, 1st item.
  • D. Hernández-Lobato, J. Hernandez-Lobato, A. Shah, and R. Adams (2016) Predictive entropy search for multi-objective bayesian optimization. In International Conference on Machine Learning, pp. 1492–1501. Cited by: §4.2.
  • J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani (2014) Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926. Cited by: §3.1.
  • D. Huang, T. T. Allen, W. I. Notz, and R. A. Miller (2006) Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization 32, pp. 369–382. Cited by: §2.
  • T. B. Johnson and C. Guestrin (2018) Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31, pp. 7265–7275. Cited by: §1, footnote 1.
  • D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §3.1.
  • A. G. Journel and C. J. Huijbregts (1978) Mining geostatistics. Vol. 600, Academic press London. Cited by: §3.3.
  • K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Póczos (2016) Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, pp. 992–1000. Cited by: §2.
  • A. Katharopoulos and F. Fleuret (2018)

    Not all samples are created equal: deep learning with importance sampling

    .
    arXiv preprint arXiv:1803.00942. Cited by: §3.2, §3.2, §3.3, §3.3, 2nd item, §4.1, §4.1.
  • A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter (2016) Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079. Cited by: Appendix C, §1, §2, §2, §3.3, 3rd item, §4.1, §4.3, §4.4.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.3.
  • Y. LeCun (1998)

    The mnist database of handwritten digits

    .
    http://yann. lecun. com/exdb/mnist/. Cited by: §4.2.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2016) Hyperband: a novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §2.
  • J. Močkus (1975) On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Cited by: §3.1.
  • D. Needell, R. Ward, and N. Srebro (2014) Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems 27, pp. 1017–1025. Cited by: §3.2.
  • C. E. Rasmussen (2003) Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §3.
  • M. Schmidt, R. Babanezhad, M. Ahmed, A. Defazio, A. Clifton, and A. Sarkar (2015) Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, pp. 819–828. Cited by: §3.2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §3.3, §4.1.
  • E. Strubell, A. Ganesh, and A. Mccallum (2019) Energy and policy considerations for deep learning in nlp. In Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
  • K. Swersky, J. Snoek, and R. P. Adams (2013) Multi-task bayesian optimization. In Advances in neural information processing systems, pp. 2004–2012. Cited by: §1, §1, §2, §3.3.
  • K. Swersky, J. Snoek, and R. P. Adams (2014) Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896. Cited by: §1, §2, §3.1.
  • J. Wang, J. Xu, and X. Wang (2018) Combination of hyperband and bayesian optimization for hyperparameter optimization in deep learning. arXiv preprint arXiv:1801.01596. Cited by: §2.
  • J. Wu, M. Poloczek, A. G. Wilson, and P. Frazier (2017) Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pp. 5267–5278. Cited by: §3.1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.4.
  • C. Zhang, H. Kjellstrom, and S. Mandt (2017) Determinantal point processes for mini-batch diversification. In UAI 2017, Cited by: §3.2.
  • P. Zhao and T. Zhang (2015) Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §3.2, §3.2.

Appendix A Feed-forward Neural Network on MNIST

Per BO iteration (Fig. 5), IBO is amongst the best performing methods gaining higher utility compared to the others. However, since performing importance sampling is expensive, IBO’s performance degrades over wall-clock time. After spending roughly 30 of the time budget (around two hours), Fabolas outperforms the other methods. This is expected since Fabolas utilizes cheap approximations by using training subsets. Although such approximations are noisy, we speculate that it does not significantly harm the performance, specially for simpler datasets and models such as a feed-forward network on MNIST.

We tune six hyperparameters: number of hidden layers , number of units per layer , batch size , initial learning rate , decay rate and dropout rate . Following (Falkner et al., 2018), the batch size, number of units, and learning rate are optimized over a log-scale search space.

Figure 5: Average performance of all methods on MNIST as a function of both iteration budget (left column) and wall-clock time (right column).

All methods are run for BO iterations. The performance is averaged over five random runs and shown in the last row of Figure 6 (median with 25 and 75 percentiles over time and iteration budget) and Figure 7 (mean with standard deviation over time).


Metric Method % % % %
IBO 0.2 0.09 0.07 0.06
Fabolas 0.21 0.07 0.06 0.06
Median Fabolas-IS 0.16 0.07 0.07 0.06
Error ES 0.17 0.08 0.07 0.07
ES-IS 0.21 0.1 0.07 0.06

Metric Method % % % %
IBO 0.19 0.08 0.07 0.06
Fabolas 0.15 0.07 0.06 0.05
25 Fabolas-IS 0.16 0.07 0.06 0.06
Error ES 0.12 0.07 0.07 0.06
ES-IS 0.19 0.1 0.07 0.06


Metric Method % % % %
IBO 0.34 0.11 0.08 0.06
Fabolas 0.22 0.07 0.06 0.06
75 Fabolas-IS 0.19 0.08 0.07 0.06
Error ES 0.2 0.08 0.08 0.08
ES-IS 0.21 0.1 0.07 0.06


Table 2: Test error of models trained using hyperparameters found by the different methods. Each method is allocated the same amount of time; results reflect each method’s choice of best hyperparameter after the different percentages of time has elapsed. Test error is obtained by a model trained on the full training set using vanilla minibatch SGD. All methods roughly perform similar achieving 6 error at max budget. However, Fabolas starts its progress earlier. Given the simplicity of MNIST, we speculate that the cheap noisy approximations provided by Fabolas via uniform sampling, suffices to attain improvement while importance sampling is unnecessarily costly.

Appendix B IBO scales with dataset and network complexity

IBO improves upon existing BO methods, moreso when tuning on large complex datasets and architectures. To illustrate, Figure 6 includes the results of all experiments over iteration budget (left column) and wall-clock time budget (right column). Moreover, the plots are sorted such that complexity of dataset and model architecture decreases along the rows; i.e., the most straight-forward problem, FCN on MNIST, lies in the bottom row and the most challenging experiment, ResNet on CIFAR100 is in the top row. In the iteration plots (left column), IBO is consistently amongst the best methods (lower curve denotes better performance), achieving high utility per BO iteration. However, since doing importance sampling is inherently expensive, the advantage of IBO over wall-clock time gradually manifests once the tuning becomes more challenging. Specifically, moving from the bottom to the top, as the complexity level of tuning increases, IBO starts to outperform the rest from and earlier stage and with an increasing margin over wall-clock time (right column).

Figure 6: Average performance of all methods for all experiments as a function of both iteration budget (left column) and wall-clock time (right column). Each row represents one experiment such that the difficulty of tuning increases from the bottom row to the top i.e., the most straight-forward problem, FCN on MNIST, lies in the bottom row and the most challenging benchmark, ResNet on CIFAR100 is in the top row. In the iteration plots, IBO is consistently amongst the best methods (lower curve denotes better performance), achieving high utility per BO iteration. However, since doing importance sampling is inherently expensive, the advantage of IBO over wall-clock time gradually manifests once the tuning becomes more challenging. Specifically, IBO starts to outperform the rest earlier with an increasing margin over wall-clock time, the more difficult benchmarks become (from the bottom row to the top).

Appendix C Initializing Fabolas

Conventionally, Bayesian optimization starts with evaluating the objective at an initial set of hyperparameters chosen at random. To leverage speedup in Fabolas, Klein et al. (2016) suggests to evaluate the initial hyperparameters at different, usually small, subsets of the training data. In our experiments, we randomly selected hyperparameters and evaluated each on randomly selected training subsets with sizes of the entire training data. However, our experimental results show that Fabolas achieves better results faster if during the initial design phase, the objective evaluation use the entire training data. Figure 8 illustrates this point for CNN and ResNet on CIFAR-10. Fabolas with the original initialization scheme performs evaluations ( hyperparameters each evaluated at budgets) where with the new scheme, Fabolas initializes with evaluations ( hyperparameters each evaluated at budget). The plots show the mean results (with standard deviation) averaged over five and three runs for CNN and ResNet. Overall, the Fabolas with new initialization achieves better average performance.

Figure 7: These plots show the mean performance (with standard deviation) of all methods for all the experiments. IBO consistently achieves amongst the lowest test errors at the maximum budget. For CNN on CIFAR10, IBO suffers 1 relatively weak run (out of 5 total runs) which affects the mean and standard deviation . For a different perspective, see Fig. 6 reporting median and 25/75 .

c.1 Mean and Standard Deviation Results

For completion, we include the plots reporting mean and standard deviation throughout the experiments (Figure 7).

Figure 8: Comparison between two initialization schemes of Fabolas for CNN and ResNet on CIFAR-10. The dashed lines (left column) show the number of initial design evaluations for each method, immediately followed by the start of BO. We observe that with the new initial design scheme, Fabolas can potentially start progressing at a smaller iteration and a lower time, and achieve a reduced variance.