In hyperparameter tuning of machine learning models, we seek to find hyperparametersin some set to minimize the validation error , i.e., to solve
Evaluating can take substantial time and computational power (Bergstra and Bengio, 2012) and may not provide gradient evaluations. Bayesian optimization, which requires relatively few function evaluations, provides a compelling approach to such optimization problems (Jones et al., 1998; Snoek et al., 2012).
As the computational expense of training and testing a modern deep neural network for a single set of hyperparameters has grown, researchers have sought to supplant some evaluations of
with computationally inexpensive low-fidelity approximations. Conceputally, an algorithm can use low-fidelity evaluations to quickly identify a smaller set of promising hyperparameter settings, and then later focus more expensive high-fidelity evaluations within this set to refine its estimates.
Pioneering multi-fidelity approaches focused on hyperparameter tuning for deep neural networks include the Bayesian optimization methods FaBOLAS (Klein et al., 2017a, 2015), Freeze-Thaw Bayesian Optimization (Swersky et al., 2014), BOCA (Kandasamy et al., 2017), predictive entropy search for a single continuous fidelity (McLeod et al., 2017), early-stopping SMAC (Domhan et al., 2015), and Hyperband (Li et al., 2016). This work builds on earlier multi-fidelity optimization approaches (Huang et al., 2006; Lam et al., 2015; Poloczek et al., 2017) focused on low-fidelity approximation of physics-based computational models.
These validation error approximations perform the same training and testing steps as in standard Bayesian optimization, but control fidelity with fewer training iterations than required for convergence, fewer training data points, or fewer validation data points. These approximations present unique opportunities not typically considered in the multifidelity literature, even within the portion focused on hyperparameter tuning. First, we observe a full trace of performance with respect to training iterations, rather than just a single performance value at the chosen fidelity. Indeed, training with iterations produces evaluations of the low-fidelity approximation for all training iterations less than or equal to . Second, by caching state after completing iterations, we can significantly reduce computation time when later evaluating for evaluations. This allows quickly evaluating low-fidelity approximations to the validation error for many hyperparameter settings, then later returning to those most promising hyperparameter settings to cheaply obtain more accurate observations. Third, we may simultaneously alter fidelity along several continuous dimensions (iterations, training data, validation data), rather than modifying one continuous fidelity control or choosing from among a discrete set of ambiguously related fidelities.
In this paper, we propose the trace-aware knowledge gradient (taKG) for Bayesian optimization with multiple fidelities. taKG is distinctive in that it leverages both trace information and multiple fidelity controls at once, efficiently selecting training size, validation size, number of training iterations, and hyperparameters to optimize. Moreover, we provide a provably-convergent method for maximizing this acquisition function. taKG addresses the challenges presented by trace observations by considering the reduced cost of adding iterations at a previously evaluated point, and using an intelligent selection scheme to choose a subset of the observed training iterations to include in inference. Additionally, taKG can be used in either a batch or sequential setting, and can also efficiently leverage gradient information if it is available.
We present two variants of our trace-aware knowledge-gradient acquisition function, one for when the cost of sampling is substantial over the whole fidelity space (even when using few training points or iterations), and the other for when the cost and value of information vanish as fidelity decreases to 0. The first form we refer to simply as taKG, and the second as 0-avoiding taKG () because it avoids the tendency of multi-fidelity other methods to measure repeatedly at near-0 fidelities even when these low fidelities provide almost no useful information. Alternative approaches (McLeod et al., 2017; Klein et al., 2017a) add and tune a fixed cost per sample to avoid this issue, while does not require tuning.
Furthermore, we present a novel efficient method to optimize these acquisition functions, even though they cannot be evaluated in closed form. This method first constructs a stochastic gradient estimator which it then uses within multistart stochastic gradient ascent. We show that our stochastic gradient estimator is unbiased and thus asymptotically consistent, and the resulting stochastic gradient ascent procedure converges to a local stationary point of the acquisition function.
Our numerical experiments demonstrate a significant improvement over state-of-the-art alternatives, including FaBOLAS (Klein et al., 2017a, 2015), Hyperband (Li et al., 2016), and BOCA (Kandasamy et al., 2017). Our approach is also applicable to problems that do not have trace observations, but use continuous fidelity controls, and we additionally show strong performance in this setting.
In general, efficient and flexible multifidelity optimization is of crucial practical importance, as evidenced by growing momentum in this research area. Although Bayesian optimization has shown great promise for tuning hyperparameters of machine learning algorithms, computational bottlenecks have remained a major deterrent to mainstream adoption. With taKG, we leverage crucial trace information, while simultaneously providing support for several fidelity controls, providing remarkably efficient optimization of expensive objectives. This work is intended as a step towards the renewed practical adoption of Bayesian optimization for machine learning.
2 THE taKG AND Acquistion Functions
In this section we define the trace-aware knowledge-gradient acquisition function. §2.1 defines our formulation of multi-fidelity optimization with traces and continuous fidelities, along with our inference procedure. §2.2 describes a measure of expected solution quality possible after observing a collection of fidelities within a trace. §2.3 uses this measure to define the taKG acquisition function, and §2.4 defines an improved version, , appropriate for settings in which the the cost and value of information vanish together (for example, as the number of training iterations declines to 0). §2.5 then presents a computational approach for maximizing the taKG and acquisition functions and theoretical results justifying its use. §2.6 discusses warm-starting previously stopped traces, and §2.7 briefly discusses generalizations to batch and derivative observations.
2.1 Problem Setting
We model our objective function and its inexpensive approximations by a real-valued function where our objective is and denotes the fidelity-control parameters. (Here, in
is a vector ofs.) We assume that our fidelity controls have been re-scaled so that is the highest fidelity and the lowest. can be evaluated, optionally with noise, at a cost depending on and .
We let be the additional fidelities observed for free when observing fidelity . Although our framework can be easily generalized, we assume that is a cross product of sets of the form either (trace fidelities) or (non-trace fidelities). We let denote the number of trace fidelities and the number of non-trace fidelities. We also assume that the cost of evaluation is non-decreasing in each component of the fidelity.
For example, consider hyperparameter tuning with fidelities: first is the number of training iterations; second is the amount of training data. Each is bounded between and some maximum value, and specifies training iterations or number of training data points as a fraction of this maximum value. Then, , because we observe results for the number of training iterations ranging from up to the number evaluated. If the amount of validation data is another trace fidelity, we would have: .
We model using Gaussian process regression jointly over and
, assuming that observations are perturbed by independent normally distributed noise with mean
and variance. Each evaluation consists of , a vector of fidelities , and a noisy observation of for each fidelity in . For computational tractability, in our inference, we will choose to retain and incorporate observations only from a subset of these fidelities with each observation. After such evaluations, we will have a posterior distribution on that will also be a Gaussian process, and whose mean and kernel we refer to by and . We describe this inference framework in more detail in the supplement.
We model the logarithm of the cost of evaluating using a separate Gaussian process, updated after each evaluation, and let be the predicted cost after evaluations. We assume for now that the cost of evaluation does not depend on previous evaluations, and then discuss later in §2.6 an extension to warm-starting evaluation at higher fidelities using past lower-fidelity evaluations.
2.2 Valuing Trace Observations
Before defining the taKG and acquisition functions, we define a function that quantifies the extent to which observing trace information improves the quality of our solution to (1.1).
Let indicate the expectation with respect to the posterior after evaluations. Given any and set of fidelities , we will define a function to be the expected loss (with respect to the time- posterior) of our final solution to (1.1) if we are allowed to first observe at all fidelities in .
To define this more formally, let be a random vector comprised of observations of for all . Then, the conditional expected loss from choosing a solution to (1.1) after this observation is . This quantity is a function of , , , and the first evaluations, and can be computed explicitly using formulas from GP regression given in the supplement.
We would choose the solution for which this is minimized, giving a conditional expected loss of
. This is a random variable under the time-posterior whose value depends on . We finally take the expected value under the time- posterior to obtain :
where the integral is over all .
We compute this quantity using simulation. To create one replication of this simulation we first simulate from the time- posterior. We then add simulated noise to this quantity to obtain a simulated value of . We then update our posterior distribution on using this simulated data, allowing us to compute for any given as a predicted value from GP regression. We then use continuous optimization method designed for inexpensive evaluations with gradients (e.g., multi-start L-BFGS) to optimize this value, giving one replication of . We then average many replications to give an unbiased and asymptotically consistent estimate of .
In a slight abuse of notation, we also define . This is the minimum expected loss we could achieve by selecting a solution without observing any additional information. This is equal to for any .
The need to compute via simulation will present a challenge when optimizing acquisition functions defined in terms of it. Below, in §2.5
we will overcome this challenge via a novel method for simulating unbiased estimators of the gradient ofwith respect to and the components of . First, however, we define the taKG and acqisition functions.
2.3 Trace-aware Knowledge Gradient (taKG)
The taKG acquisition function will value observations of a point and a collection of fidelities according to the ratio of the reduction in expected loss (as measured using ) that it induces, to its computational cost.
While evaluating at a fidelity in principle provides observations of at all , we choose to retain and include in our inference only a subset of the observed fidelities . This reduces computational overhead in GP regression. In our numerical experiments, we take the cardinality of to be either 2 or 3, though the approach also allows increased cardinality.
With this in mind, the taKG acquisition function at a point and set of fidelities at time is
where we also refer to the numerator as the value of information (Howard, 1966), . Thus, taKG quantifies the value of information per unit cost of sampling.
The cost of observing at all fidelities in is taken here to be the cost of evaluating at a vector of fidelities equal to the elementwise maximum, . This is the least expensive fidelity at which we could observe .
The taKG algorithm chooses to sample at the point , fidelity , and additional lower-fidelity point(s) to retain that jointly maximize the taKG acquisition function, among all fidelity sets with limited cardinality .
This is a continuous optimization problem whose decision variable is described by real numbers. describe , describe , and describe .
2.4 0-avoiding Trace-aware Knowledge Gradient ()
The taKG acquisition function uses the value of information per unit cost of sampling. When the value of information and cost become small simultaneously, as when we shrink training iterations or training data to 0 in hyperparameter tuning, this ratio becomes sensitive to misspecification of the GP model on . We first discuss this issue, and then develop a version of taKG for these settings.
To understand this issue, we first observe the value of information for sampling , for any , is strictly positive when the kernel has strictly positive entries.
If the kernel function for any , then for any and any , .
Proposition 1 holds even if , or has some components set to 0. Thus, if the estimated cost at such extremely low fidelities is small relative to the (strictly positive) value of information there, taKG may be drawn to sample them, even though the value of information is small. We may even spend a substantial portion of our budget evaluating at different . This is usually undesirable.
For example, in hyperparameter tuning with training iterations as our fidelity, fidelity corresponds to training a machine learning model with no training iterations. This would return the validation error on initial model parameter estimates. While this likely provides some information about the validation error of a fully trained model, specifying a kernel over that productively uses this information from a large number of hyperparameter sets would be challenging.
This issue becomes even more substantial when considering training iterations and training data together, as we do here, because cost nearly vanishes as either fidelity vanishes. Thus, there are many fidelities at each that we may be drawn to oversample.
This issue is not specific to taKG. It also occurs in previous literature (Klein et al., 2017a; McLeod et al., 2017; Klein et al., 2017b) when using the ratio of information gain to cost in an entropy search or predictive entropy search method based on the same predictive model.
To deal with this issue, Klein et al. (2017a); McLeod et al. (2017) artificially inflate the cost of evaluating at fidelity to penalize low fidelity evaluations. Similarly, Klein et al. (2017b) recommends adding a fixed cost to all evaluations motivated by the overhead of optimizing the acquisition function, but then recommends setting this to the same order of magnitude as a full-fidelity evaluation even though the overhead associated with optimizing a BO acquisition function using well-written code and efficient methodology will usually be substantially smaller. As a result, any fixed cost must be tuned to the application setting to avoid oversampling at excelssively small fidelities while still allowing sampling at moderate fidelities.
Here, we propose an alternate solution that we find works well without tuning, focusing on the setting where the cost of evaluation becomes small as the smallest component in approaches 0.
We first define to be the set of fidelities obtained by replacing one component of by . We then let be the union of these fidelities over . For example, suppose is a trace fidelity (say, training iterations), is a non-trace fidelity (say, training data size), and , corresponding to an evaluation of at and retention of the point from the trace . Then .
Fidelities in (for any ) are extremely inexpensive to evaluate and provide extremely small but strictly positive value of information. These, and ones close to them, are ones we wish to avoid sampling, even when taKG is large.
To accomplish this, we modify our value of information to suppose free observations will be provided of these problematic low-fidelity . Our modified value of information will suppose these free observations will be provided to both the benchmark, previously set to , and to the reduced expected loss, previously set to , achieved through observing at fidelities . The resulting modified value of information is
We emphasize our algorithm will not evaluate at fidelities in . Instead, it will simulate these evaluations according to the algorithm in §2.2.
We define the acquisition function using this modified value of information as
To find the point and fidelity to sample, we optimize over , fidelity , and additional lower-fidelity point(s) as we did in (2.2).
We refer to this VOI and acquisition function as “0-avoiding,” because they place 0 value on fidelities with any component equal to 0. This prevents sampling at these points as long as the cost of sampling is strictly positive.
Indeed, suppose has a component equal to . Then each element in will have one component equal to , and . Then . Moreover, the following proposition shows that if has all components strictly positive and additional regularity conditions hold, then is also strictly positive.
If has all components strictly positive, our kernel is positive definite, and the hypothesis of Proposition 1 is satisfied for given , then is strictly positive.
Thus, maximizing will never choose to sample at a fidelity with a component. Additionally, under other regularity conditions (see Corollary 1 in the supplement), is continuous in , and so the property that when a component of is 0 also discourages sampling at whose smallest component is close to 0.
2.5 Efficiently maximizing taKG and
The taKG and acquisition functions are defined in terms of a hard-to-calculate function . Here, we describe how to efficiently maximize these acquisition functions using stochastic gradient ascent with multiple restarts. The heart of this method is a simulation-based procedure for simulating a stochastic gradient of , i.e., a random variable whose expectation is the gradient of with respect to and the elements of .
To construct this procedure, we first provide a more explicit expression for . Because is the expectation of the minimum over of , we begin with the distribution of this conditional expectation for a fixed under the time- posterior distribution.
This conditional expectation can be calculated with GP regression from previous observations, the new point and fidelities , and the observations . This conditional expectation is linear in .
Moreover, is the sum of (which is multivariate normal under the posterior) and optional observational noise (which is independent and normally distributed), and so is itself multivariate normal. As a multivariate normal random variable, it can be written as the sum of its mean vector and the product of the Cholesky decomposition of its covariance matrix with an independent standard normal random vector, call it . (The coefficients of this mean vector and covariance matrix may depend on , , and previously observed data.) The dimension of is the number of components in the observation, .
Thus, the conditional expected value of the objective is a linear function (through GP regression) of another linear function (through the distribution of the observation) of .
We also have that the mean of this conditional expectation is by iterated conditional expectation.
These arguments imply the existence of a function such that simultaneously for all . In the supplement, we show where , and is the Cholesky factor of the covariance matrix .
When certain regularity conditions hold,
where is a global minimum (over ) of , and the gradient in the last line is taken holding fixed even though in reality its value depends on . Here, the interchange of expectation and the gradient is justified using results from infinitessimal perturbation analysis (L’Ecuyer, 1990) and ignoring the dependence of on is justified using the envelope theorem (Milgrom and Segal, 2002). We formalize this below in Theorem 1.
Suppose is compact, is constant, the kernel is continuously differentiable, and contains a single element almost surely. Then,
With this result in place, we can obtain an unbiased estimator of by simulating , calculating , and then returning
. Using this, together with the chain rule and an exact gradient calculation for, we can then compute stochastic gradients for taKG and .
We then use this stochastic gradient estimator within stochastic gradient ascent (Kushner and Yin, 2003) to solve the optimization problem (2.2) (or the equivalent problem for maximizing ). The following theorem shows that, under the right conditions, a stochastic gradient ascent algorithm converges almost surely to a critical point of . Its proof is in the supplement.
Assume the conditions of Theorem 1, is a compact hyperrectangle, and is continuously differentiable and bounded below by a strictly positive constant. In addition, assume that we optimize using a stochastic gradient ascent method with the stochastic gradient from Theorem 1 whose stepsize sequence satisfies , , and . Then the sequence of points from stochastic gradient ascent converges almost surely to a connected set of stationary points of .
2.6 Warm-starting from partial runs
When tuning hyperparameters using training iterations as a fidelity, we can cache the state of training after iterations, for a warm start, and then continue training later up to a larger number of iterations for less than training iterations would cost at a new .
We assume trace fidelities can be “warm-started” while non-trace fidelities cannot. We also assume the incremental cost of evaluting fidelity vector warm-starting from is the difference in the costs of their evaluations from a “cold start”. We model the cost of cold-start evaluation as in §2.1 with a Gaussian process on log(cost). To obtain training data for this model, costs observed from warm-started evaluations are summed with those of the previous evaluations they continue to approximate what the cold-start cost would be. We set to be the difference in estimated cold-start costs if a previous evaluation would allow warm starting, and to the estimated cold start cost if not.
While our approach to choosing and to evaluate is to optimize as before our computational approach from §2.5 required that be continuously differentiable (in Theorem 2). This requirement is not met. To address this, we modify the way we optimize the acquisition function. (The approach we describe also works for taKG.)
First, we maintain a basket of size at most of previously evaluated point, fidelity pairs, . For each , we optimize letting vary over those sets satisfying two conditions: (1) ; (2) componentwise for each , with equality for non-trace fidelity components. Over this set, is continuously differentiable in and the method from §2.5 can be applied.
We also optimize over all and with , but using the estimated cold-start cost function and the method from §2.5.
Among the solution to these at most optimization problems, we select the and that provide the largest at optimality, and evaluate at and .
We then update our basket. We first add the and produced by the optimization not constraining . If the basket size exceeds , we then remove the and whose optimization over produced the smallest value. In practice, we set .
2.7 Batch and Derivative Evaluations
taKG and generalize naturally following Wu and Frazier (2016) and Wu et al. (2017) to batch settings where we can evaluate multiple point, fidelity pairs simultaneously and derivative-enabled settings where we observe gradients.
The batch version uses the same acquisition functions taKG and defined above, but optimizes over a set of values for , each of which has an associated of limited cardinality.
In the derivative-enabled setting, we incorporate (optionally noisy) gradient observations into our posterior distribution directly through GP regression. We also generalize the taKG and acquisition functions to allow inclusion of gradients of the objective in the set of quantities observed in the definition of .
3 Numerical Experiments
We compare sequential, batch, and derivative-enabled with benchmark algorithms on synthetic optimization problems (Sect. 3.1), hyperparameter optimization of neural networks (Sect. 3.2), and hyperparameter optimization for large-scale kernel learning (Sect. 3.3). The synthetic and neural network benchmarks use fidelities with trace observations, while the large-scale kernel learning benchmark does not. We integrate over GP hyperparameters by sampling sets of values using the emcee package (Foreman-Mackey et al., 2013).
3.1 Optimizing synthetic functions
Here, we compare against both the sequential and batch versions of the single-fidelity algorithms KG (Wu and Frazier, 2016) and EI (Jones et al., 1998; Wang et al., 2016), a derivative-enabled single-fidelity version of KG (Wu et al., 2017), Hyperband (Li et al., 2016), and the multi-fidelity method BOCA (Kandasamy et al., 2017). BOCA is the only previous method of which we are aware that treats multiple continuous fidelities. We do not compare against FaBOLAS (Klein et al., 2017a, 2015) because the kernel it uses is specialized to neural network hyperparameter tuning.
Following experiments in Kandasamy et al. (2017), we augment four synthetic test functions, 2-d Branin, 3-d Rosenbrock, 3-d Hartmann, and 6-d Hartmann, by adding one or two fidelity controls, as described in the supplement. We set the cost of an individual evaluation of at fidelity to . Thinking of as mimicking training data and training iterations, the term mimics a cost of training that is proportional to the number of times a datapoint is visited in training. The term mimics a fixed cost associated with validating a trained model. We set the cost of a batch evaluation to the maximum of the costs of the individual evaluations, to mimic wall-clock time for running evaluations synchronously in parallel.
Fig. 1 shows results. For methods that have both sequential and batch versions, the batch size is indicated with a number before the method name. For example, 8-EI indicates expected improvement performed with batches of size 8. We run versions of using set to (taKG_0 2-points) and (taKG_0 3-points).
We first see that using the larger improves the performance of . We then see that, for both values of , sequential performs well relative to sequential competitors (1-EI, 1-KG, 1-BOCA), and batch with batch size 8 (8-) performs well relative to its batch competitors (8-EI, 8-KG, Hyperband).
Here, we consider Hyperband to be a batch method although the amount of parallelism it can leverage varies through the course of its operation.
3.2 Optimizing hyperparameters of neural networks
Here, we evaluate on hyperparameter optimization of neural networks. Benchmarks include the single-fidelity Bayesian optimization algorithms KG (Wu and Frazier, 2016) and EI (Jones et al., 1998), their batch versions, and the state-of-art hyperparameter tuning algorithms HyperBand and FaBOLAS. uses two fidelity controls: the size of the training set and the number of training iterations.
Following Li et al. (2016), we set the cost to the number of training examples passed during training divided by the number passed in full fidelity. For example, if we have
training points and the maximum number of epochs is, then the cost of evaluating a set of hyperparameters using sub-sampled training points per epoch over epochs is . Complete training has cost .
We show the -log marginal likelihood divided by the number of datapoints for tuning feedforward neural networks on MNIST (each with 20 runs); tuning convolutional neural networks on CIFAR-10 and SVHN (each with 10 runs); and for KISS-GP kernel learning.outperforms competitors in both sequential and batch settings.
Feedforward neural networks on MNIST
We tune a fully connected two-layer neural network on MNIST. The maximum number of epochs allowed is . We optimize hyperparameters: learning rate, dropout rate, batch size and the number of units at each layer.
Fig. 2 shows that sequential performs much better than the sequential methods KG, EI and the multi-fidelity hyperparameter optimization algorithm FaBOLAS. with a batch size substantially improves over batch versions of KG and EI, and also over the batch method Hyperband.
Convolutional neural networks on CIFAR-10 and SVHN
We tune convolution neural networks (CNNs) on CIFAR-10 and SVHN. Our CNN consists of 3 convolutional blocks and a softmax classification layer. Each convolutional block consists of two convolutional layers with the same number of filters followed by a max-pooling layer. There is no dropout or batch-normalization layer. We split the CIFAR-10 dataset into 40000 training samples, 10000 validation samples and 10000 test samples. We split the SVHN training dataset into 67235 training samples and 6000 validation samples, and use the standard 26032 test samples. We apply standard data augmentation: horizontal and vertical shifts, and horizontal flips. We optimizehyperparameters to minimize the classification error on the validation set: the learning rate, batch size, and number of filters in each convolutional block. Hyperband uses the size of the training set as its resource (it can use only one resource or fidelity), using a bracket size of as in Li et al. (2016) and the maximum resource allowed by a single configuration set to 40000. We set the maximum number of training epochs for all algorithms to for CIFAR-10 and for SVHN. Because of the computational expense of training CNNs, we leave out some benchmarks, dropping the single-fidelity method EI in favor of the structurally similar single-fidelity method KG, and performing batch evaluations for only some methods.
Fig. 2 shows that sequential outperforms its competitors (including the batch method Hyperband) on both problems. Using batch evaluations with on CIFAR-10 improves performance even further.
When we train using optimized hyperparameters on the full training dataset for epochs, test data classification error is for CIFAR-10 and for SVHN.
3.3 Optimizing hyperparameters for large-scale kernel learning
We test derivative-enabled (ta-dKG) in a large-scale kernel learning example: the 1-d demo example for KISS-GP (Wilson and Nickisch, 2015) on the GPML website (Rasmussen and Nickisch, 2016). In this example, we optimize 3 hyperparameters (marginal variance, length scale, and variance of the noise) of a GP with an RBF kernel on million training points to maximize the log marginal likelihood. We evaluate both the log marginal likelihood and its gradient using the KISS-GP framework. We use two continuous fidelity controls: the number of training points and the number of inducing points. We set the maximum number of inducing points to .
We compare ta-d-KG to the derivative-enabled knowledge gradient (d-KG) (Wu et al., 2017), using both algorithms in the sequential setting (1-dKG and 1-cf-dKG) and with a batch size of 4 (4-dKG and 4-cf-dKG). We leave out methods that are unable to utilize derivatives, as these are likely to substantially underperform.
Fig. 2 shows that ta-dKG successfully utilizes inexpensive function and gradient evaluations to find a good solution more quickly than d-KG, in both the sequential and batch setting.
We propose a novel multi-fidelity acquisition function, the trace aware knowledge-gradient, which leverages special structure provided by trace observations, is able to handle multiple simultaneous continuous fidelities, and generalizes naturally to batch and derivative settings. This acquisition function uses traces to find good solutions to global optimization problems more quickly than state-of-the-art algorithms in application settings including deep learning and kernel learning.
- Bartle (1966) Bartle, R. G. (1966). The elements of integration. John Wiley & Sons.
- Bergstra and Bengio (2012) Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.
- Domhan et al. (2015) Domhan, T., Springenberg, J. T., and Hutter, F. (2015). Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, pages 3460–3468.
- Foreman-Mackey et al. (2013) Foreman-Mackey, D., Hogg, D. W., Lang, D., and Goodman, J. (2013). emcee: the mcmc hammer. Publications of the Astronomical Society of the Pacific, 125(925):306.
- Howard (1966) Howard, R. (1966). Information Value Theory. Systems Science and Cybernetics, IEEE Transactions on, 2(1):22–26.
- Huang et al. (2006) Huang, D., Allen, T., Notz, W., and Miller, R. (2006). Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 32(5):369–382.
- Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492.
- Kandasamy et al. (2017) Kandasamy, K., Dasarathy, G., Schneider, J., and Poczos, B. (2017). Multi-fidelity bayesian optimisation with continuous approximations. In ICML.
- Klein et al. (2015) Klein, A., Bartels, S., Falkner, S., Hennig, P., and Hutter, F. (2015). Towards efficient bayesian optimization for big data. In NIPS 2015 workshop on Bayesian Optimization (BayesOpt 2015), volume 134, page 98.
- Klein et al. (2017a) Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. (2017a). Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics. ArXiv preprint arXiv:1605.07079.
- Klein et al. (2017b) Klein, A., Falkner, S., Mansur, N., and Hutter, F. (2017b). Robo: A flexible and robust bayesian optimization framework in python. In NIPS 2017 Bayesian Optimization Workshop.
- Kushner and Yin (2003) Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
- Lam et al. (2015) Lam, R., Allaire, D., and Willcox, K. (2015). Multifidelity optimization using statistical surrogate modeling for non-hierarchical information sources. In 56th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, page 0143.
- L’Ecuyer (1990) L’Ecuyer, P. (1990). A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science, 36(11):1364–1383.
- Li et al. (2016) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2016). Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560.
- McLeod et al. (2017) McLeod, M., Osborne, M. A., and Roberts, S. J. (2017). Practical bayesian optimization for variable cost objectives. arXiv preprint arXiv:1703.04335.
- Milgrom and Segal (2002) Milgrom, P. and Segal, I. (2002). Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601.
- Poloczek et al. (2017) Poloczek, M., Wang, J., and Frazier, P. I. (2017). Multi-information source optimization. In Advances in Neural Information Processing Systems.
- Rasmussen and Nickisch (2016) Rasmussen, C. E. and Nickisch, H. (2016). documentation for gpml matlab code version 4.0.
- Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
- Smith (1995) Smith, S. P. (1995). Differentiation of the cholesky algorithm. Journal of Computational and Graphical Statistics, 4(2):134 – 147.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959.
- Swersky et al. (2014) Swersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896.
- Wang et al. (2016) Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2016). Parallel bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149.
Wilson and Nickisch (2015)
Wilson, A. G. and Nickisch, H. (2015).
Kernel interpolation for scalable structured gaussian processes (kiss-gp).In ICML, pages 1775–1784.
- Wu and Frazier (2016) Wu, J. and Frazier, P. (2016). The parallel knowledge gradient method for batch bayesian optimization. In Advances in Neural Information Processing Systems, pages 3126–3134.
- Wu et al. (2017) Wu, J., Poloczek, M., Wilson, A. G., and Frazier, P. I. (2017). Bayesian optimization with gradients. In Advances in Neural Information Processing Systems.
5 Supplementary Material
5.1 Background: Gaussian processes
We put a Gaussian process (GP) prior (Rasmussen and Williams, 2006) on the function . The GP prior is defined by its mean function and kernel function . These mean and kernel functions have hyperparameters, whose inference we discuss below.
We assume that evaluations of are subject to additive independent normally distributed noise with common variance . We treat the parameter as a hyperparameter of our model, and also discuss its inference below. Our assumption of normally distributed noise with constant variance is common in the BO literature (Klein et al., 2017a).
The posterior distribution on after observing function values at points with observed values remains a Gaussian process (Rasmussen and Williams, 2006), and with and evaluated at a point (or pair of points , ) given as follows
We emphasize that a single evaluation of provides multiple observaions of , because of trace observations. taKG chooses to retain 2 observations per evaluation, and so will be twice the number of evaluations.
This statistical approach contains several hyperparameters: the variance , and any parameters in the mean and kernel functions. We treat these hyperparameters in a Bayesian way as proposed in Snoek et al. (2012). We analogously train a separate GP on the logarithm of the cost of evaluating .
5.2 Proofs Details
In this section we prove the theorems of the paper. We may assume without loss of generality that , and we denote the number of fidelities by . Observe that the dimension of the vector is . We first show some smoothness properties of , and in the following lemma.
We assume that the domain is compact, is a constant and the kernel is continuously differentiable. We then have that
, and are both continuously differentiable for any vector .
For any , is continuously differentiable respect to .
is continuously differentiable.
is differentiable if .
The posterior parameters of the Gaussian process , and are continuously differentiable if the kernel and are both continuously differentiable. Observe that , and so is continuously differentiable because is continuously differentiable. This proves (1) and (3). (4) follows easily from (3).
To prove (2) we only need to show that is continuously differentiable respect to . This follows from the fact that multiplication, matrix inversion (when the inverse exists), and Cholesky factorization (Smith, 1995) preserve continuous differentiability. This ends the proof. ∎
We now prove Theorem 1.
We now show that we can interchange the gradient and the expectation in . Observe that the domain of is compact, is continuously differentiable respect to by Lemma 1, and so is bounded. Consequently, Corollary 5.9 of Bartle (1966)
implies that we can interchange the gradient and the expectation. The statement of the theorem follows from the strong law of large numbers. ∎
The following corollary follows from the previous proof.
Under the assumptions of the previous theorem, is continuous.
We now prove Theorem 2.
We prove this theorem using Theorem 2.3 of Section 5 of Kushner and Yin (2003), which depends on the structure of the stochastic gradient of the objective function. In addition, we simplify the notation and denote by .
The theorem from Kushner and Yin (2003), requires the following hypotheses:
, , and .
There exist uniformly continuous functions of , and random vectors , such that almost surely and
Furthermore, there exists a continuous function , such that for each ,
for each , where is the unique value of such that , where ,.
There exists a continuously differentiable real-valued function , such that and it is constant on each connected subset of stationary points.
The constraint functions defining are continuously differentiable.
We now prove that our problem satisfy these hypotheses. (1) is true by hypothesis of the lemma.
Let’s prove (2). We first assume that ,
where , which is finite because the domain of the problem is compact and is continuous by Lemma 1. Since is continuously differentiable bounded below by a constant , thus we conclude that the supremum over of is bounded. If , is the average of i.i.d. random vectors, whose squared norm expectation is finite, and so must be finite too.
We now prove (3). For each , define
where , and is a standard normal random vector. Let’s prove that is continuous. In the proof of Theorem 1, we show that is continuous in . Furthermore,
Consequently is continuous by Corollary 5.7 of Bartle (1966). In Theorem 1, we also show that is continuous in . Since is continuously differentiable, we conclude that is continuous. By defining for all , and , we conclude the proof of (3).
Finally, define . Observe that in Lemma 2, we show that we can interchange the expectation and the gradient in , and so In a connected subset of stationary points, we have that , and so is constant. This ends the proof of the theorem. ∎
proof of Proposition 1.
where is a standard normal random variable. By Jensen’s inequality, we have
where . The inequality becomes equal only if is a linear function of for any fixed , i.e the argmin for the inner optimization function doesn’t change as we vary , which is not true if i.e. evaluating at provides value to determine the argmin of the surface . ∎
proof of Proposition 2.
The proof follows a very similar argument than the previous proof. By Jensen’s inequality, we have that
The inequality becomes equal only if the argmin for the inner optimization function doesn’t change as we vary the normal random vector, which is not true under our assumptions.
5.3 GPs for Hyperparameter Optimization
In the context of hyperparameter optimization with two continuous fidelities, i.e. the number of training iterations () and the amount of training data (), we set the kernel function of the GP as
where is a square-exponential kernel. If we assume that the learning curve looks like
then inspired by Swersky et al. (2014), we set the kernel as
where are hyperparameters. We add an intercept compared to the kernel in Swersky et al. (2014) to model the fact that the loss will not diminish. We assume that the kernel has the form
where are hyperparameters.
All the hyperparameters can be treated in a Bayesian way as proposed in Snoek et al. (2012).
5.4 Additional experimental details
5.4.1 Synthetic experiments
Here we define in detail the synthetic test functions on which we perform numerical experiments The test functions are:
5.4.2 Real-world experiments
The range of search domain for feedforward NN experiments: the learning rate in , dropout rate in , batch size in and the number of units at each layer in .
The range of search domain for CNN experiments: the learning rate in , batch size