1 Introduction
The incorporation of more parameters and more data, coupled with faster computing and longer training times, has driven stateoftheart results across a variety of benchmark tasks in machine learning. However, careful model tuning remains critical in order to find good configurations of hyperparameters, architecture and optimization settings. This tuning requires significant experimentation, training many models, and is often guided by expert intuition, grid search, or random sampling. Such experimentation multiplies the cost of training, and incurs significant financial, computational, and even environmental costs (Strubell et al., 2019).
Bayesian optimization (BO) offers an efficient alternative when the tuning objective can be effectively modeled by a surrogate regression (Bergstra et al., 2011; Snoek et al., 2012), or when one can take advantage of related tasks (Swersky et al., 2013) or strong priors over problem structure (Swersky et al., 2014; Domhan et al., 2015). BO optimizes an expensive function by iteratively building a relatively cheap probabilistic surrogate and evaluating a carefully balanced combination of uncertain and promising regions (exploration vs. exploitation).
In the context of neural network hyperparameter optimization, BO typically involves an inner loop of training a model given a hyperparameter configuration, and then evaluating validation error as the objective to be optimized. This inner loop is expensive and its cost grows with the size of the dataset: querying modern models even once may require training for days or weeks.
One strategy to mitigate the high cost of hyperparameter tuning is to enable the BO algorithm to trade off between the value of the information gained from evaluating a hyperparameter setting and the cost of that evaluation. For example, Swersky et al. (2013) and Klein et al. (2016) allow BO to evaluate models trained on randomly chosen subsets of data to obtain more, but less informative, evaluations. We propose an alternative approach: our method, Importancebased Bayesian Optimization (IBO), dynamically learns when spending additional effort training a network to obtain a higher fidelity observation is worth the incurred cost. To achieve this, in addition to considering the hyperparameters, IBO takes into account the underlying training data and focuses the computation on more informative training points. Specifically, IBO models a distribution over the location of the optimal hyperparameter configuration, and allocates experimental budget according to costadjusted expected reduction in entropy Hennig and Schuler (2012). Therefore, higher fidelity observations provide a greater reduction in entropy, albeit at a higher evaluation cost.
the target function, reflecting the impact of limiting the number of SGD iterations during training). However, obtaining higherquality estimates can significantly slow down the overall runtime of Bayesian optimization; by learning to optimize the tradeoff between the value and cost of obtaining highfidelity estimates, we show in Section
4 that IBO achieves the best of both worlds.To decide how much effort to allocate to training a network and which training examples to prioritize, IBO leverages both the properties of stochastic gradient descent (SGD) and recent work on importance sampling
(Johnson and Guestrin, 2018). At each SGD iteration, IBO estimates how much each training example will impact the model; based on this estimate, IBO either does a normal round of SGD, or a more costly importanceweighted gradient update.Balancing the cost of the inner loop of neural network training and the outer loop of BO is a nontrivial task; if done naively, the overall hyperparameter tuning procedure will be substantially slower. To address this issue, we adopt a multitask Bayesian optimization formulation for IBO and develop an algorithm that dynamically adjusts for the tradeoff between the cost of training a network at higher fidelity and getting more but noisier evaluations (Fig. 1). This approach allows us to obtain higher quality blackbox function evaluations only when worthwhile, while controlling the average cost of blackbox queries. As a consequence, we are able to tune complex network architectures over challenging datasets in less time and with better results than existing stateoftheart BO methods. Tuning a ResNet on CIFAR100, IBO improves the validation error by over the next best method; other baselines are not able to reach IBO’s performance, even with additional computational budget.
Contributions.
We introduce a multitask Bayesian optimization framework, IBO (Importancebased Bayesian Optimization), which takes into account the contribution of each training point during the evaluation of a candidate hyperparameter. To do so, IBO optimizes the importance sampling tradeoff between quality and runtime while simultaneously searching hyperparameter space. We show on extensive benchmark experiments that the computational burden incurred by importance sampling is more than compensated for by the principled search through hyperparameter space that it enables. We show across these experiments that IBO consistently improves over a variety of baseline Bayesian optimization methods. On more complex datasets, IBO converges significantly faster in wallclock time than existing methods and furthermore reaches lower validation errors, even as other methods are given larger time budgets.
2 Related work
Several different methods have been proposed to accelerate the hyperparameter tuning process. Swersky et al. (2013) proposed MultiTask Bayesian Optimization (MTBO), which performs surrogate cheap function evaluations on a small subset of training data which is then used to extrapolate the performance on the entire training set. Motivated by this work, Klein et al. (2016) introduced Fabolas, which extends MTBO to also learn the sufficient size of training data. MTBO and Fabolas avoid costly function evaluations by training on small datasets where data is uniformly chosen at the beginning of each training round.
Another body of related work involves modeling the neural network’s loss as a function of both the hyperparameters and the inner training iterations. Then, the goal is to extrapolate and forecast the ultimate objective value and stop underperforming training runs early. Work such as (Swersky et al., 2014; Domhan et al., 2015; Dai et al., 2019; Golovin et al., 2017) falls under this category. These methods generally have to deal with the cubic cost of Gaussian processes — , for observed hyperparameters and iterations. In practice, these methods typically apply some type of relaxation. For example, the freezethaw method Swersky et al. (2014) assumes that training curves for different hyperparameter configurations are independent conditioned on their prior mean, which is drawn from another global GP.
Moreover, an alternative approach to Bayesian optimization solves the hyperparameter tuning problem through enhanced random search. Hyperband (Li et al., 2016) starts from several randomly chosen hyperparameters and trains them on a small subset of data. Following a fixed schedule, the algorithm stops underperforming experiments and then retrains the remaining ones on larger training sets. Hyperband outperforms standard BO in some settings, as it is easily parallelized and not subject to model misspecification. However, Hyperband’s exploration is necessarily limited to the initial hyperparameter sampling phase: the best settings chosen by Hyperband inevitably will correspond to one of initial initializations, which were selected uniformly and in an unguided manner. To address this issue, several papers, including (Falkner et al., 2018; Wang et al., 2018; Bertrand et al., 2017), have proposed the use of Bayesian optimization to warmstart Hyperband and perform a guided search during the initial hyperparameter sampling phase.
Finally, IBO belongs to the family of multifidelity Bayesian optimization methods (Kandasamy et al., 2016; Forrester et al., 2007; Huang et al., 2006; Klein et al., 2016) methods, which take advantage of cheap approximations to the target blackbox function. Of those methods, Fabolas (Klein et al., 2016) focuses specifically on hyperparameter tuning, and is included as a baseline in all our experiments. Fabolas uses cheap evaluations of the network validation loss by training the network on a randomly sampled subset of the training dataset. Hence, both IBO and Fabolas depend directly on training examples to vary the cost of querying the blackbox function; Fabolas by using fewer examples for cheap evaluations, whereas IBO uses the perexample contribution to training to switch to costlier evaluations.
Existing literature on hyperparameter tuning weighs all training examples equally and does not take advantage of their decidedly unequal influence. To the best of our knowledge, IBO is the first method to exploit the informativeness of training data to accelerate hyperparameter tuning, merging Bayesian optimization with importance sampling.
Terminology.
We refer to one stochastic gradient descent (SGD) update to a neural network as an inner optimization round. Conversely, an outer optimization round designates one iteration of the BO process: fitting a GP, optimizing an acquisition function, and evaluating the blackbox function.
3 Importance sampling for BO
Bayesian optimization is a strategy for the global optimization of a potentially noisy, and generally nonconvex, blackbox function . The function is presumed to be expensive to evaluate in terms of time, resources, or both.
In the context of hyperparameter tuning, is the space of hyperparameters, and is the validation error of a neural network trained with hyperparameters .
Given a set of hyperparameter configurations and associated function evaluations (which may be subject to observation noise), Bayesian optimization starts by building a surrogate model for over . Gaussian processes (GPs), which provide a flexible nonparametric distribution over smooth functions, are a popular choice for this probabilistic model, as they provide tractable closedform inference and facilitate the specification of a prior over the functional form of (Rasmussen, 2003).
3.1 Surrogate model quality vs. computational budget
Given a zeromean prior with covariance function , the GP’s posterior belief about the unobserved output at a new point after seeing data
is a Gaussian distribution with mean
and variance
such that(1)  
where , , and is the variance of the observation noise, that is, .
Given this posterior belief over the value of unobserved points, Bayesian optimization selects the next point (hyperparameter set) to query by solving
(2) 
where is the acquisition function, which quantifies the expected added value of querying at point , based on the posterior belief on given by Eq. (1).
Typical choices for the acquisition function include entropy search (ES) (Hennig and Schuler, 2012) and its approximation predictive entropy search (HernándezLobato et al., 2014), knowledge gradient (Wu et al., 2017), expected improvement (Močkus, 1975; Jones et al., 1998) and upper/lower confidence bound (Cox and John, 1992, 1997).
Entropy search quantifies how much knowing reduces the entropy of the distribution over the location of the best hyperparameters :
(3)  
where is the entropy function and the expectation is taken with respect to the posterior distribution over the observation at hyperparameter .
The more accurate the observed values , the more accurate the GP surrogate model (1). A more accurate surrogate model, in turn, defines a better acquisition function (2), and, finally, a more valuable Bayesian optimization outer loop.
Previous work has tackled this tradeoff during the BO process by earlystopping training that is predicted to yield poor final values (Swersky et al., 2014; Dai et al., 2019). IBO takes the opposite route, detecting when to spend additional effort to acquire a more accurate value of .
Crucially, hyperparameter tuning for neural networks is not an entirely blackbox optimization setting, as we know the loss minimization framework in which neural networks are trained. We take advantage of this by allocating computational budget at each SGD iteration; based on the considered training points, IBO switches from standard SGD updates to the more computationally intensive importance sampling updates. This is the focus of the following section.
3.2 Importance sampling for loss minimization
The impact of training data points on one (batched) SGD iteration has benefited from significant attention in machine learning Needell et al. (2014); Schmidt et al. (2015); Zhang et al. (2017); Fu and Zhang (2017). For the purposes of IBO, we focus on importance sampling (IS) (Needell et al., 2014; Zhao and Zhang, 2015). IS minimizes the variance in SGD updates;^{1}^{1}1IS also benefits SGD with momentum (Johnson and Guestrin, 2018). Although we focus our analysis on pure SGD, IBO also extends to certain SGD variants. however, IS is parameterized by the perexample gradient norm for the current weights of the network, and as such incurs a significant computational overhead.
Specifically, let be the training loss, where is the number of training examples and is the loss at point . To minimize , SGD with importance sampling iteratively computes estimate of by sampling
with probability
, then applying the update(4) 
where is the learning rate. Update (4) provably minimizes the variance of the gradient estimate, which in turn improves the convergence speed of SGD.^{2}^{2}2Standard SGD is recovered by setting .
Various solutions to efficiently leverage importance sampling have been suggested (Zhao and Zhang, 2015). We leverage recent work Katharopoulos and Fleuret (2018), which speeds up batched SGD with IS by a cheap subroutine that determines whether IS’s variance reduction justifies the incurred computational cost at each SGD step.
To achieve efficient IS for batches of size , Katharopoulos and Fleuret (2018) introduce a presample batch size hyperparameter . At each SGD step, points are first sampled uniformly at random, from which a batch of size is then subsampled. These points are sampled either uniformly or with importance sampling, depending on an upper bound on the variance reduction permitted by IS.
3.3 Multitask BO for importance sampling
In (Katharopoulos and Fleuret, 2018), the authors state that the added value of importance sampling is extremely sensitive to the number of the presampled data points; we verify this empirically in §4, showing that naively replacing standard SGD with the IS algorithm of (Katharopoulos and Fleuret, 2018) does not improve upon standard BO hyperparameter tuning. To maximize the utility of importance sampling, we instead opt for a multitask BO framework, within which the search through hyperparamater space is done in parallel to a second task: optimization over .
Multitask Bayesian optimization (MTBO) (Swersky et al., 2013) extends BO to evaluating a point on multiple correlated tasks. To do so, MTBO optimizes an objective function over a target task which, although expensive to evaluate, provides the maximum utility for the downstream task. MTBO exploits cheap evaluations on surrogate tasks to extrapolate performance on the target task; here, the target task evaluates when sampling a batch from all training data, whereas the surrogate task evaluates when subsampling from a superbatch of datapoints at each SGD iteration MTBO uses the entropy search acquisition function (Eq. 3), and models an objective function over points and tasks via a multitask GP (Journel and Huijbregts, 1978; Bonilla et al., 2008). The covariance between two pairs of points and corresponding tasks is defined through a Kronecker product kernel:
(5) 
where models the relation between the hyperparameters and describes the correlation between tasks.
For our case, the subsampling size is the task variable while the optimal task sets to the size of the entire training set. Let denote the validation error value at hyperparameter after training iterations using IS with presample size . We define the multitask kernel for the GP that models as
with the subtask kernels defined as
(6)  
Additionally, following Snoek et al. (2012), we penalize the evaluation of any point by the computational cost of training a model for SGD iterations at hyperparameter with subsampling size . This penalty guides the hyperparameter search towards promising yet relatively inexpensive solutions. We model the training cost using a multitask GP fitted to the log cost of observations that are collected during BO. We choose the covariance function
where this time we modify the kernel on to reflect that larger increases training time:
(7)  
Our choices for and follow (Klein et al., 2016), who recommend the associated feature maps.
Our resulting acquisition function is thus:
(8)  
where is the posterior mean of the GP modeling the training cost; as previously, is the probability that is the optimal solution at the target task given data .
Our algorithm is presented in Algorithm 1. The initialization phase follows the MTBO convention: we collect initial data at randomly chosen inputs , and evaluate each hyperparameter configuration with a randomly selected value for . DoSGD is the subroutine proposed by (Katharopoulos and Fleuret, 2018); it determines if the variance reduction enabled by importance sampling is worth the additional cost at the current SGD iteration.
Remark 1.
Whereas Fabolas speeds up the evaluation of by limiting the number of training points used during training, IBO uses the entire training data, reweighting points based on their relevance to the training task. Thus, each IBO iteration is slower than a Fabolas iteration. However, because IBO carries out a more principled search through hyperparameter space and queries higher fidelity evaluations, IBO requires less BO iterations — and hence potentially less time — to find a good hyperparameter.
4 Experiments
Problem  Time Budget  IBO (ours)  Fabolas  FabolasIS  ES  ESIS 

CNN (CIFAR10)  %  0.28 (0.25,0.32)  0.29 (0.26,0.36)  0.4 (0.38,0.9)  0.38 (0.29,0.83)  0.3 (0.26,0.35) 
%  0.27 (0.26,0.29)  0.26 (0.25.0.27)  0.38 (0.27,0.9)  0.28 (0.27.0.37)  0.25 ( 0.25,0.26)  
%  0.25(0.24,0.29)  0.25 (0.24.0.27)  0.38 (0.26,0.9)  0.28 (0.26.0.28)  0.26 (0.25,0.26)  
%  0.23 (0.23,0.23)  0.25 (0.24.0.27)  0.33 (0.26,0.38)  0.28 (0.26.0.28)  0.26 (0.25,0.26)  
ResNet (CIFAR10)  %  0.11 (0.11,0.12)  0.11 (0.11,0.12)  0.11 (0.11,0.2)  0.12 (0.11,0.21)  0.11 (0.11,0.21) 
%  0.1 (0.1,0.1)  0.11 (0.1,0.11)  0.11 (0.11,0.11)  0.11 (0.1,0.2)  0.12 (0.11,0.21)  
%  0.09 (0.09,0.1)  0.11 (0.1,0.11)  0.11 (0.11,0.11)  0.11 (0.1,0.17)  0.12 (0.11,0.19)  
%  0.09 (0.09,0.1)  0.11 (0.1,0.11)  0.11 (0.11,0.11)  0.11 (0.1,0.17)  0.12 (0.11,0.18) )  
ResNet (CIFAR100)  %  0.38 (0.35,0.39)  0.37 (0.37,0.39)  0.39 (0.38,0.44)  0.38 (0.38,0.44)  0.38 (0.37,0.38) 
%  0.33 (0.33,0.37)  0.37 (0.36,0.39)  0.39 (0.38,0.44)  0.38 (0.38,0.44)  0.37 (0.36,0.38)  
%  0.33 (0.33,0.34)  0.37 (0.36,0.39)  0.39 (0.38,0.44)  0.38 (0.38,0.42)  0.38 (0.36,0.39)  
%  0.32 (0.32,0.34)  0.37 (0.36,0.39)  0.39 (0.39,0.45)  0.38 (0.37,0.41)  0.36 (0.34,0.38) 
We evaluate our proposed method, IBO,^{3}^{3}3The code for IBO will be released upon acceptance.
on four benchmark hyperparameter tuning tasks: a feedforward network on MNIST, a convolutional neural network (CNN) on CIFAR
, a residual network on CIFAR, and a residual network on CIFAR. We include the following baselines:
[label=–,leftmargin=*,parsep=0pt]

ES: Bayesian optimization with the entropy search acquisition function (Hennig and Schuler, 2012),

ESIS: BO with entropy search; inner optimization is performed using IS. For each blackbox query, we draw the presample size uniformly at random from batch size as prescribed in (Katharopoulos and Fleuret, 2018); is constant during the rounds of SGD.

Fabolas (Klein et al., 2016): BO in which each innerloop optimization uses a fraction of the training set. The value of is learned via multitask BO; this subtraining set does not evolve during the inner SGD iterations.

FabolasIS: Fabolas, training with SGDIS. For this method, a fraction of the training data is uniformly chosen as in Fabolas, but training is performed with SGDISo. The presample batch size is the randomly uniformly sampled in batch size.
ESIS acts as an ablation test for IBO’s multitask framework, as it does not reason about the costfidelity tradeoff of IBO. Thus, we keep the training procedure for ESIS and FabolasIS similar to IBO, switching to IS only if variance reduction is possible and using IS is advantageous (Alg. 1, lines ). We run all methods on a PowerEdge R730 Server with NVIDIA Tesla K80 GPUs (experiment 4.2) or on a DGX server with NVIDIA Tesla V100 GPUs (rest).
4.1 Implementation Details
For IBO, we use task kernels (Eq. 6) and (Eq. 7), with kernel hyperparameters and . Following Snoek et al. (2012), we marginalize out the GPs’ hyperparameters using MCMC for all methods.
To set the time budget, we fix a total number of BO iterations for each method; the time at which the fastest method completes its final iteration acts as the maximum amount of time available to any other method. All initial design evaluations also count towards the runtime; this slightly advantages nonIS methods, which have cheaper initializations.
We report the performance of each method as a function of wallclock time, since the methods differ in periteration complexity (App. 6 reports results vs. iteration number).
We measure the performance of each method by taking the predicted best hyperparameter values after each BO iteration, then training a model with hyperparameters , using the entire training set and vanilla SGD. Recall that for Fabolas, FabolasIS, and IBO, the incumbent is the set of hyperparameters with the best predicted objective on the target task (e.g., using the full training data for Fabolas).
We run each method five times unless otherwise stated, and report the median performance and and
percentiles (mean and standard deviation results are included in Appendix
C.1 for completeness). ES and Fabolas variations are run using RoBO.^{4}^{4}4https://github.com/automl/RoBO For importance sampling, we used the code provided by Katharopoulos and Fleuret (2018).^{5}^{5}5https://github.com/idiap/importancesamplingAll methods are initialized with 5 hyperparameter configurations drawn from a Latin hypercube design. For IBO, we evaluate each configuration on the maximum value of its target task . For Fabolas, Klein et al. (2016) suggest initializing by evaluating each hyperparameter on an increasing series of task values. This aims to capture the task variable’s effect on the objective. However, we empirically observed that following an initial design strategy similar to IBO’s, i.e., evaluating each hyperparameter on the maximum target value , worked better in practice for both Fabolas and FabolasIS. This is the method we use in our experiments; App. C includes results for both initialization schemes.
For IBO, FabolasIS and ESIS, we reparameterize the presample size as . As was recommended by Katharopoulos and Fleuret (2018), we set . For FabolasIS, if is larger than the training subset size, we use the entire subset to compute the importance distribution.
4.2 Feedforward Neural Network on MNIST
Our first experiment is based on a common Bayesian optimization benchmark problem (Falkner et al., 2018; Domhan et al., 2015; HernándezLobato et al., 2016)
. We tune a fully connected neural network using RMSProp on MNIST
(LeCun, 1998). The number of training epochs
and the number of BO rounds are set to . We tune six hyperparameters: number of hidden layers, number of units per layer, batch size, learning rate, decay rate, and dropout rate (see App. A).Given the wellknown straightforwardness of the MNIST dataset, we do not expect to see significant gains when using importance sampling during training. Indeed, we see (Table 2) that all methods perform similarly after exhausting their BO iteration budget, although Fabolas does reach a low test error slightly earlier on, since training on few data points is sufficient; see Appendix A for more details.
4.3 CNN on CIFAR10
We next tune a convolutional neural network (CNN) using RMSProp on the CIFAR10 dataset (Krizhevsky et al., 2009)
. We fix an architecture of three convolutional layers with maxpooling, followed by a fully connected layer, in line with previous benchmarks on this problem
(Falkner et al., 2018; Klein et al., 2016; Dai et al., 2019). Following Dai et al. (2019), we tune six hyperparameters: number of convolutional filters , number of units in the fully connected layer , batch size , initial learning rate , decay rate , and regularization weight . All methods are run for BO iterations and trained using SGD epochs.IBO, Fabolas and ESIS exhibit the best performance (Fig. 3) but switch ranking over the course of time. However, after spending roughly half of the budget, IBO outperforms Fabolas and all other baselines, achieving the best final error with the lowest uncertainty.
ESIS shows that adding IS naively can improve upon base entropy search; however, IBO outperforms both ES and ESIS, confirming the importance of a multitask setting that optimizes IS. Furthermore, simply adding importance sampling during SGD is not guaranteed to improve upon any method: FabolasIS performs poorly compared to Fabolas.
4.4 Residual Network on CIFAR10
We next tune the a residual network trained on CIFAR10. We follow the wide ResNet architecture in (Zagoruyko and Komodakis, 2016), and tune four hyperparameters: initial learning rate , decay rate , momentum and regularization weight . Following Klein et al. (2016), all but the momentum are optimized over a logscale search space.
We set and multiply the learning rate by the decay rate after epochs. Experimentally, we saw that epochs is insufficient for the inner (SGD) optimization to converge on the ResNet architecture; this experiment evaluates BO in the setting where is too computationally intensive to compute exactly. We ran all the methods using BO iterations for Fabolas and FabolasIS and iterations for the rest. This difference in budget iteration is to compensate for the different cost of training on a subset of data versus on the entire data.^{6}^{6}6For experiment 4.2, we observed that keeping the BO iteration budget consistent is sufficient since the training costs are not very different. For experiment 4.3, we set this budget to 100, and stopped reporting the results once the first method exhausted its budget. Since ResNet experiments are generally more costly, choosing a large budget for all methods was not feasible. Results are reported in Fig. 3, and are obtained with runs with random initializations.
Consistently with previous results, Fabolas achieves the lowest error in the very initial stage, due to its cheap approximations. However, IBO quickly overtakes all other baselines, and attains a value that other methods cannot achieve with their entire budget consumption. FabolasIS also performs well, but suffers a large variance.
The ablation tests (ESIS and FabolasIS) consistently have high variance, likely because these methods do not learn the optimal batch size for importance sampling and opt for a random selection within the recommended range. In contrast, IBO specifically learns the batch size parameter which controls the costbenefit trade off in importance sampling and hence, enjoys better final results and lower variance.
4.5 Residual Network on CIFAR100
Finally, we tune the hyperparameters of a residual network trained on CIFAR100. The architecture of the network, the hyperparameters we optimize and their respective ranges are similar to the §4.4. We set and multiply the learning rate by the decay rate every epochs. For Fabolas and FabolasIS, a budget of BO iterations is provided while the rest of the methods are given iterations.
Clearly, IBO outperforms the rest of the methods after spending roughly 20 of the time budget (Fig. 4); Fabolas and ESIS are the second best methods. Similar to the experiment 4.3, FabolasIS is outperformed by the other baselines, and once again incurs a large variance. Interestingly, for Fabolas and FabolasIS, the additional BO budget does not cause an improvement in their performance. This is yet further evidence that for complex datasets, neither vanilla multitask frameworks nor simple importance sampling is sufficient to gain the advantages of IBO.
By seeking higherfidelity surrogate models, IBO achieves better results in fewer optimization runs and less runtime than other baselines, despite the incurred cost of using each training example individually during certain SGD rounds.
5 Conclusion
Bayesian optimization offers an efficient and principled framework for hyperparameter tuning. However, finding optimal hyperparameters requires an expensive inner loop which repeatedly trains a model with new hyperparameters. Prior work has scaled BO by using cheap evaluations to the blackbox function. IBO takes the opposite approach: by increasing time spent obtaining higherfidelity evaluations, IBO requires much fewer outer BO loops.
Leveraging recent developments in importance sampling, IBO takes into account the contribution of each training point to decide whether to run vanilla SGD or a more complex, timeconsuming but higher quality variant. Although this results in costlier neural network training loops, the additional precision obtained for the blackbox estimates allows a more principled search through hyperparameter space, significantly decreasing the amount of wallclock time necessary to obtain a highquality hyperparameter.
Crucially, the interaction between importance sampling and Bayesian optimization must be approached with care; a naive merging of both methods does not decrease the overall runtime of Bayesian optimization, and does not yield better final hyperparameters. However, by opting for a multitask parameterization of the problem, IBO learns to dynamically adjust the tradeoff between neural network training time and blackbox estimate value, producing faster overall runtimes as well as better hyperparameters.
We show on four benchmark tasks of increasing complexity that IBO achieves the lowest error compared to all other baseline methods, and scales gracefully with dataset and neural architecture complexity. When tuning a ResNet on CIFAR100, IBO outperforms all other baselines and ablation tests by a significant margin, both as a function of wallclock time and number of outer optimization rounds.
References
 Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pp. 2546–2554. Cited by: §1.
 Hyperparameter optimization of deep neural networks: combining hyperband with bayesian model selection. In Conférence sur l’Apprentissage Automatique, Cited by: §2.
 Multitask gaussian process prediction. In Advances in neural information processing systems, pp. 153–160. Cited by: §3.3.
 A statistical method for global optimization. In [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1241–1246. Cited by: §3.1.
 SDO: a statistical method for global optimization. In in Multidisciplinary Design Optimization: StateoftheArt, pp. 315–329. Cited by: §3.1.
 Bayesian optimization meets bayesian optimal stopping. In International Conference on Machine Learning, pp. 1496–1506. Cited by: §2, §3.1, §4.3.

Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.
In
TwentyFourth International Joint Conference on Artificial Intelligence
, Cited by: §1, §2, §4.2.  BOHB: robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774. Cited by: Appendix A, §2, §4.2, §4.3.
 Multifidelity optimization via surrogate modelling. Proceedings of the Royal Society A 463 (2088), pp. 3251–3269. Cited by: §2.
 CPSGMCMC: ClusteringBased Preprocessing method for Stochastic Gradient MCMC. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 54, pp. 841–850. Cited by: §3.2.
 Google vizier: a service for blackbox optimization. Cited by: §2.
 Entropy search for informationefficient global optimization. Journal of Machine Learning Research 13 (Jun), pp. 1809–1837. Cited by: §1, §3.1, 1st item.
 Predictive entropy search for multiobjective bayesian optimization. In International Conference on Machine Learning, pp. 1492–1501. Cited by: §4.2.
 Predictive entropy search for efficient global optimization of blackbox functions. In Advances in neural information processing systems, pp. 918–926. Cited by: §3.1.
 Sequential kriging optimization using multiplefidelity evaluations. Structural and Multidisciplinary Optimization 32, pp. 369–382. Cited by: §2.
 Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31, pp. 7265–7275. Cited by: §1, footnote 1.
 Efficient global optimization of expensive blackbox functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §3.1.
 Mining geostatistics. Vol. 600, Academic press London. Cited by: §3.3.
 Gaussian process bandit optimisation with multifidelity evaluations. In Advances in Neural Information Processing Systems, pp. 992–1000. Cited by: §2.

Not all samples are created equal: deep learning with importance sampling
. arXiv preprint arXiv:1803.00942. Cited by: §3.2, §3.2, §3.3, §3.3, 2nd item, §4.1, §4.1.  Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079. Cited by: Appendix C, §1, §2, §2, §3.3, 3rd item, §4.1, §4.3, §4.4.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.3.

The mnist database of handwritten digits
. http://yann. lecun. com/exdb/mnist/. Cited by: §4.2.  Hyperband: a novel banditbased approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560. Cited by: §2.
 On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pp. 400–404. Cited by: §3.1.
 Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems 27, pp. 1017–1025. Cited by: §3.2.
 Gaussian processes in machine learning. In Summer School on Machine Learning, pp. 63–71. Cited by: §3.
 NonUniform Stochastic Average Gradient Method for Training Conditional Random Fields. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, pp. 819–828. Cited by: §3.2.
 Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §3.3, §4.1.
 Energy and policy considerations for deep learning in nlp. In Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
 Multitask bayesian optimization. In Advances in neural information processing systems, pp. 2004–2012. Cited by: §1, §1, §2, §3.3.
 Freezethaw bayesian optimization. arXiv preprint arXiv:1406.3896. Cited by: §1, §2, §3.1.
 Combination of hyperband and bayesian optimization for hyperparameter optimization in deep learning. arXiv preprint arXiv:1801.01596. Cited by: §2.
 Bayesian optimization with gradients. In Advances in Neural Information Processing Systems, pp. 5267–5278. Cited by: §3.1.
 Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.4.
 Determinantal point processes for minibatch diversification. In UAI 2017, Cited by: §3.2.
 Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, Cited by: §3.2, §3.2.
Appendix A Feedforward Neural Network on MNIST
Per BO iteration (Fig. 5), IBO is amongst the best performing methods gaining higher utility compared to the others. However, since performing importance sampling is expensive, IBO’s performance degrades over wallclock time. After spending roughly 30 of the time budget (around two hours), Fabolas outperforms the other methods. This is expected since Fabolas utilizes cheap approximations by using training subsets. Although such approximations are noisy, we speculate that it does not significantly harm the performance, specially for simpler datasets and models such as a feedforward network on MNIST.
We tune six hyperparameters: number of hidden layers , number of units per layer , batch size , initial learning rate , decay rate and dropout rate . Following (Falkner et al., 2018), the batch size, number of units, and learning rate are optimized over a logscale search space.
All methods are run for BO iterations. The performance is averaged over five random runs and shown in the last row of Figure 6 (median with 25 and 75 percentiles over time and iteration budget) and Figure 7 (mean with standard deviation over time).
Metric  Method  %  %  %  % 

IBO  0.2  0.09  0.07  0.06  
Fabolas  0.21  0.07  0.06  0.06  
Median  FabolasIS  0.16  0.07  0.07  0.06 
Error  ES  0.17  0.08  0.07  0.07 
ESIS  0.21  0.1  0.07  0.06  

Metric  Method  %  %  %  % 

IBO  0.19  0.08  0.07  0.06  
Fabolas  0.15  0.07  0.06  0.05  
25  FabolasIS  0.16  0.07  0.06  0.06 
Error  ES  0.12  0.07  0.07  0.06 
ESIS  0.19  0.1  0.07  0.06  

Metric  Method  %  %  %  % 

IBO  0.34  0.11  0.08  0.06  
Fabolas  0.22  0.07  0.06  0.06  
75  FabolasIS  0.19  0.08  0.07  0.06 
Error  ES  0.2  0.08  0.08  0.08 
ESIS  0.21  0.1  0.07  0.06  

Appendix B IBO scales with dataset and network complexity
IBO improves upon existing BO methods, moreso when tuning on large complex datasets and architectures. To illustrate, Figure 6 includes the results of all experiments over iteration budget (left column) and wallclock time budget (right column). Moreover, the plots are sorted such that complexity of dataset and model architecture decreases along the rows; i.e., the most straightforward problem, FCN on MNIST, lies in the bottom row and the most challenging experiment, ResNet on CIFAR100 is in the top row. In the iteration plots (left column), IBO is consistently amongst the best methods (lower curve denotes better performance), achieving high utility per BO iteration. However, since doing importance sampling is inherently expensive, the advantage of IBO over wallclock time gradually manifests once the tuning becomes more challenging. Specifically, moving from the bottom to the top, as the complexity level of tuning increases, IBO starts to outperform the rest from and earlier stage and with an increasing margin over wallclock time (right column).
Appendix C Initializing Fabolas
Conventionally, Bayesian optimization starts with evaluating the objective at an initial set of hyperparameters chosen at random. To leverage speedup in Fabolas, Klein et al. (2016) suggests to evaluate the initial hyperparameters at different, usually small, subsets of the training data. In our experiments, we randomly selected hyperparameters and evaluated each on randomly selected training subsets with sizes of the entire training data. However, our experimental results show that Fabolas achieves better results faster if during the initial design phase, the objective evaluation use the entire training data. Figure 8 illustrates this point for CNN and ResNet on CIFAR10. Fabolas with the original initialization scheme performs evaluations ( hyperparameters each evaluated at budgets) where with the new scheme, Fabolas initializes with evaluations ( hyperparameters each evaluated at budget). The plots show the mean results (with standard deviation) averaged over five and three runs for CNN and ResNet. Overall, the Fabolas with new initialization achieves better average performance.
c.1 Mean and Standard Deviation Results
For completion, we include the plots reporting mean and standard deviation throughout the experiments (Figure 7).