Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

06/25/2020
by   Victor Picheny, et al.
0

Many machine learning models require a training procedure based on running stochastic gradient descent. A key element for the efficiency of those algorithms is the choice of the learning rate schedule. While finding good learning rates schedules using Bayesian optimisation has been tackled by several authors, adapting it dynamically in a data-driven way is an open question. This is of high practical importance to users that need to train a single, expensive model. To tackle this problem, we introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation, that flexibly adjusts to abrupt changes of behaviours induced by new learning rate values. As illustrated, this model is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks (in a classical BO setup), as well as warm-starting it for a new task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/25/2021

Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic Gradient Descent using Stochastic Learning Rates

Large-scale optimization problems require algorithms both effective and ...
03/14/2017

Online Learning Rate Adaptation with Hypergradient Descent

We introduce a general method for improving the convergence rate of grad...
09/21/2019

Using Statistics to Automate Stochastic Optimization

Despite the development of numerous adaptive optimizers, tuning the lear...
06/23/2020

On Compression Principle and Bayesian Optimization for Neural Networks

Finding methods for making generalizable predictions is a fundamental pr...
05/22/2017

Training Deep Networks without Learning Rates Through Coin Betting

Deep learning methods achieve state-of-the-art performance in many appli...
08/08/2020

Why to "grow" and "harvest" deep learning models?

Current expectations from training deep learning models with gradient-ba...
05/22/2021

AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on the Fly

The learning rate (LR) schedule is one of the most important hyper-param...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The great recent successes of machine learning generally rely on models with high complexity (e.g. deep models) and extensive datasets. Those models usually require running a training procedure over a (large) set of parameters, which often amounts to minimising a loss function with an iterative algorithm such as stochastic gradient descent (SGD) or the non-linear conjugate gradient method. In the deep learning community, a particular focus has been given to SGD algorithms such as Adagrad

[9]

, RMSProp

[39], and in particular Adam [1], which despite recent discussions and improvements [23, 30] can be considered as the state-of-the-art and is widely used in practice [12]. Unfortunately, this training procedure is often extremely time-consuming due to the high model complexity and the amount of data at hand; hence, performing it with the best possible efficiency is paramount for many applications, in particular those for which training is done repeatedly, or continuously, for example in streaming models.

The performance of nearly all SGD variants depend critically on the choice of the learning rate level [3], which in short tunes by how much the algorithm should follow the (noisy) gradient signal. The default choice for most algorithms is a constant learning rate, although recent experiments showed that a time-varying value can be extremely beneficial [2, 33, 34]. In any case, learning rate values require to be set, and tuning them by hand can be excessively burdensome. Bayesian optimisation (BO), on the other hand, is now a well-established tool for model selection and parameter tuning [4, 35]. While learning rates are sometimes included in the parameters to be tuned [4], to our knowledge no work has been dedicated to speeding up SGD algorithms on-the-fly using BO: this is the purpose of the present work.

Let be the objective optimised by SGD; typically in deep learning, can be a mean squared error or an evidence lower bound (ELBO)111In the following, without loss of generality we use the convention that should be maximised.. denotes a set of parameters, while defines the learning task at hand, that is, a particular model structure and a dataset. is typically a singleton (when a single model is fitted to a single dataset), or a discrete set, for instance when the same model is used to fit different datasets. Given , an SGD algorithm produces a sequence of parameters , with a pre-defined number of iterations (i.e. optimisation steps).

In our setup we assume that the learning rate varies from one iteration to the next. One way to parametrise it, which avoids working directly on a high-dimensional vector of size

, is to use a piecewise constant form. Defining a sequence of length :

the learning rate curve is defined by constants such that

Assuming lower and upper bounds for , without loss of generality our tuning parameters can be rescaled to . For any task , our objective is to seek the best possible learning rate, that is, the one that maximises after steps:

(1)

with the parameters returned after iterations (now written as a function of the learning rate schedule and task ).

Now, BO classically relies on Gaussian process (GP) surrogate models of the function to optimise. While predicting trace (that is, the value of for all ) has been already addressed in related settings [28, 8, 19]

, adapting those models to make them dependent on a learning rate that changes over time is often not possible. Moreover, changing the learning rate often induces drastic changes on the trace behaviour, which makes typical parametric models unfit for the task. Our first contribution is the design of an auto-regressive model for the trace of a SGD algorithm, that flexibly adjusts to abrupt changes of behaviours. The model, which serves as a base for our sampling strategies, is presented in

section 2.

In the classical BO setting [32], first an initial set of experiments, typically with space-filling properties, is run. The gathered data would then be used to build a predictive model for

, which would allow us to estimate

. In BO, such estimate is gradually improved by designing a sequence of additional experiments that enhances the prediction capability of the model, following an exploration / exploitation trade-off.

However, often times the user is faced with training a single model, without much prior information on an appropriate choice of the learning rate schedule. In that case, the relevance of BO is somehow debatable, as it is uncertain that the outcome of SGD runs would outperform a single but much longer one with an empirically chosen learning rate. To overcome this, [38, 22, 10] used forecasting models to stop early unpromising runs and drastically reduce the computational cost of the BO loop. Yet, such approaches can only provide long-term recommendation at the price of repeated long runs, and are mostly relevant when other parameters are tuned simultaneously with the learning rate.

We propose here an alternative strategy, unexplored in the BO literature, which is to adapt a single run on the fly rather than terminate some and start new ones from scratch. In addition, as modern hardware architecture favours parallel computing, we want to leverage the fact that training procedures can be run in parallel (say, one on each available GPU). Section 3 is dedicated to this problem.

In other practical situations, training is done repeatedly over similar tasks: for instance the same model fitted to different datasets or different variations of a model fitted to a dataset. In that case, one seeks an optimal mapping [or profile optima, 11] (from the space of tasks to the space of learning rates), such that . Transferring information from one task to another allows designing strategies much more efficient than running independent BO loops for each task [37]. This framework is considered in section 4.

2 A GP-based NARX Model For Optimisation Traces

For simplicity, we focus for now on the case where is a singleton (i.e. we consider a single dataset and model) and remove the dependence on from our notations. The generalisation to multiple tasks is deferred to section 4.

2.1 Modelling of Optimisation Traces

Define first

the trace of an optimisation run and denote by the trace value at each changing point for the learning rate.

There exist many options to fit a parametric model to : for instance, in [8], 11 models for traces are proposed, including exponential forms () and power ones (). While those forms make sense with a constant learning rate, they cannot fit properly a varying one, as changing the learning rate drastically modifies the trace dynamics. This is illustrated in fig. 2, left, where the visible trace trend abruptly changes every time the learning rate changes (see also fig. 4). Hence, instead of fitting a single parametric model, we propose to use a composite one, based on an auto-regressive formulation.

First, we model as an i.i.d. Gaussian variable , to account for randomness in the starting point. Then, we propose to model the trace using a non-linear auto-regressive model with exogenous inputs [NARX(q), 21], which general expression is:

where the new state is a non-linear function of the past states and exogenous inputs plus some independent noise. We propose a specific function for as follow:

where is a latent function that modulates the increments according to the current and past trace values and learning rates and (or to ensure the monotonicity of the trace) is a link function that returns the trace increment for any given time according to its parameters. As the trace may be recorded for values outside , our model for the trace is:

where and represents the noise in the case where the objective function is not evaluated exactly and/or for unaccounted-for deviations between observations and the model.

In the following, we use either piecewise linear or piecewise exponential forms for the traces. With respectively defining the linear slope and with

corresponding to a logit offset and

to a logit rate, this gives:

(2)
(3)

where is a (multi-output) GP and is here the softplus function, , to ensure monotonicity. Note that any of the parametric models of e.g. [8] may be used here as . In a sense, our model extend those for a time-varying learning rate.

Focusing on the linear case, and setting , we have: . We directly see the dynamic implied by our model: increments are linear with respect to time, but the slope depends non-linearly on an exogenous input (the learning rate) and on the current state (intuitively, we expect flatter slopes if is high, as the model is close to convergence). Considering the exponential case (eq. 3) and setting , we show GP surfaces in fig. 1 for (left) and (right).

Figure 1: Consider the NARX(1) model of section 2 with the exponential link function of eq. 3. On the left, we plot the mean the empirical std. dev. of the GP for the increment factor . On the right, we plot the GP surface of which models the decay rate of the exponential. Given , both GPs are only a function of the most previous trace value ‘starting point’ and the current learning rate ‘LR value’. These surfaces are learned as part of experiment section 4.3. We notice, for example, that for low starting values of the objective, the response behaves like a step-function.

Importantly, the same is used over all the time intervals and only depends on . Hence, in the NARX(1) case, our model implies that the same initial conditions and learning rates would lead to the same outcome, regardless of the time the algorithm has been run before. In that sense, the model can be considered as Markovian, as it is independent of the path followed by the optimiser before

. While our model is general for higher-order Markov chains (i.e. NARX(q)), we found empirically that the current formulation provides the right trade-off between accuracy, robustness and ease of inference, as the link function

is only defined over .

Finally, we follow the chained GP framework [31] for , and choose a GP prior with independent components.

2.2 Learning Using Variational Inference

We consider here a fixed link function, so the inference task boils down to estimating the parameters of the function and (the starting value of the trace). Assume that we have run the optimiser times for different learning rate schedules and recorded the trace of each run at times (in ). Let

denote an observation of the trace at time for a learning rate schedule . We assume that each trace is evaluated at the beginning of each time interval (i.e. ), which gives explicit observations of , which we denote henceforth .

Figure 2: Left: four traces (top) corresponding to the training of an SVGP model using Adam and corresponding learning rates. Middle: notations and model. Right: corresponding data point locations for the latent GP.

The parameters of can be inferred by maximum likelihood from . Now, we focus on learning the latent GP functions . Without loss of generality, we assume in the following that our link function only requires a single GP parameter (i.e. p = 1). When a link function depends on more than one GP we simply apply the same procedure in parallel and model every function independently.

We follow a fully Bayesian approach to infer from the given dataset. We place a GP prior on the latent function and assume that is i.i.d. Gaussian noise with zero mean and variance, so that:

with such that . For general link functions is exact inference of the latent function

not possible due to the non-linear transformation.

A classical solution is to follow the Sparse Variational GP (SVGP) framework [40, 15], that relies on two components. First, it introduces a set of inducing variables, which main use is to specify the function value of the posterior GP at a specific set of pseudo inputs , denoted as . The distribution of the inducing variables is specified by a fully parameterised Gaussian with mean and covariance , which are the variational parameters we want to learn. The prior GP on can then be conditioned on , which leads to a marginal posterior with mean and the variance :

(4)

where and .

With this approximation in place we can set up our model’s optimisation objective, which is a lower bound on the log marginal likelihood [ELBO, 16], equal to

(5)

where , and

is the Kullback-Leibler divergence between the approximate and the prior of

[24]. It can be calculated analytically given the Gaussianity of both the prior and posterior on . The expectation can be estimated in an unbiased way using Monte-Carlo, by sampling (section 2.2) and propagating the samples through the link-function. Optimising eq. 5 with respect to the model parameters can be done by gradient descent thanks to automatic differentiation toolkits.

In our experiments we made use of the GPflow library [25]. More precisely, we used the provided multi-output framework for GPs [41], which is well-suited for implementing and optimising of these complex, composite GP models.

Note that as the approximation is sparse (i.e. it relies on a few inducing points), it can handle much larger datasets than classical GP models. This is a decisive advantage here as traces typically contain thousands of datapoints.

2.3 Generating Trace Predictions

Since the main objective is to maximise after iterations, we would like to predict

which requires access to . Recursively, we see that predicting is achieved by predicting the corresponding sequence . Importantly, the distribution of is not available analytically because of the arbitrary link function. Hence, we must resort to sampling. Given the recursive structure (the value of is necessary to draw from ) we sample first , then after conditioning on , and recursively sampling () after conditioning on . Drawing multiple samples for a given may be used to provide any statistic of

, such as mean, variance and quantiles.

3 Dynamic Tuning of the Learning Rate

We address now the question of dynamically tuning the learning rate for a single model and task, using the model previously defined. To do so, we depart from the standard BO approach, by modifying a small set of runs on the fly instead of starting repeatedly new ones.

3.1 Proposed Strategy

We consider the following framework. We assume that SGD runs are performed in parallel with different learning rates. For simplicity, we assume synchronicity (all runs progress with the same speed). Each run is conveyed independently over time intervals . At , the traces are fed to the model, which is then used to schedule new learning rates for the next interval.

The learning rates are chosen as follows. At initialisation, as there is no model to help making decisions, are taken on a uniform grid between bounds. At any (), the current trace observations are first integrated into the model. Then, each new learning rate is chosen to maximise the -quantile (computed here empirically by sampling) of the trace at the end of the next interval:

(6)

The ’s are used here to balance exploitation and exploration: ’s close to one may lead to very optimistic choices (learning rates for which the outcome is highly uncertain) while ’s close to zero result in risk-averse choices (guaranteed immediate performance). Hence, to maximise our diversity of choices we set . In the case , this forces to choose , which is a risk-neutral strategy. Following an optimistic strategy (say, ) instead may enhance exploration and improve long-term performance.

In addition here, at each we greedily select the run with highest current trace and duplicate it times while discarding the others. While this was found to accelerate significantly the performance, keeping each run may prove a valid alternative on problems that require more learning rate scheduling exploration.

The pseudo-code of the strategy is given in alg. 1. Note that a relevant choice of is problem-dependent: a large allows more changes of the learning rate value, but increases the computational overhead due to model fitting and solving eq. 6. Besides, to facilitate inference the trace may not be sliced in too many parts (fig. 2). In the experiment reported below, using resulted with a negligible BO overhead.

Choose , , , set ;
Take on a regular grid;
Run SGDs for steps;
Gather traces, read ’s, and build the model;
for  to  do
       Get ;
       Duplicate ’s SGD run times;
       , find by solving eq. 6;
       Pursue duplicated run for iterations and learning rates , resp.;
       Gather traces, collect ’s, and update model;
      
end for
Find , return corresponding parameters;
Algorithm 1 Single task tuning

3.2 Experiment: Dynamic Tuning of Learnig Rate on CIFAR

We apply our approach to the training of a vanilla ResNet [14]neural network with 56 layers on the classification dataset CIFAR-10 [20] that contains 60,000 32

32 colour images. We use an implementation of the ResNet model for the CIFAR dataset available in Keras

[7]

. We first split the dataset into 50,000 training and 10,000 testing images. The Adam optimiser is used for 100 epochs to maximise the log cross-entropy for future predictions.

Figure 3:

Learning rates (bottom) and corresponding traces (top), following Algorithm 1. Dashed lines highlight ‘failed’ runs according to our classifier. The baselines (constant learning rates) are shown by dotted lines and the optimal learning rate schedule is given in black.

Our BO setup is as follows: the 100 epochs are divided into equal intervals and optimisations are ran in parallel. The objective is recorded every 50 iterations. The GP model for uses a Matérn- kernel, a linear link function and inducing points. A practical issue we face is that increasing abruptly the learning rate sometimes causes aberrant behaviour (which can be seen by large peaks in the trace in fig. 3). To avoid this problem, we use a GP classification model [15] to predict which runs are likely to fail based on values. The optimisation of eq. 6 is then restricted to the values of

for which the probability of failure is lower than

. We set the threshold for failure for a given trace inverse proportional to its quantile as we want traces with larger be more explorative. In addition, we limit the maximum change to one order of magnitude.

As baselines, we use five constant learning rates schedules uniformly spread in log space between and , and 12 learning rates with exponential decay (three initial values between and and four decay rates , , , and ), such that the learning rate in each epoch equals .

Figure 3 shows the dynamic of our approach. The initial interval shows the large performance differences when using different learning rates. Here, a very large learning rate is best at first, but almost immediately becomes sub-optimal. After a quarter of the optimisation budget, the optimal learning rates always takes values around , slowly decreasing over time. The algorithm behaviour captures this, by being very exploratory at first and much less towards the last intervals.

Comparing to constant learning rate schedules, our approach largely outperforms any of them. In the case where no parallel computation is performed, our approach would still outperform any constant learning rate, as those seem to have converged already to sub-optimal values after 40,000 iterations. Our approach also outperforms all exponential decay schedules but one. For this problem, a properly tuned exponential decay seems like a very efficient solution, and our dynamic tuning captures this solution. Arguably, five runs with different exponential decays might outperform our dynamic approach, but this would critically depend on the chosen bounds for the parameters and luck in the design of experiments. Standard BO (over the parameters) might be an alternative, but five observations would be too small to run it.

4 Multi-Task Scheduling

4.1 Multi-Task Learning

We now consider the case where is discrete and relatively small (say, ), but can be increased when a new task needs to be solved, similarly to [29]. The objective is then to find an optimal set of learning rates rather than a single one. However, as an efficient learning rate schedule for a task is often found to perform well for another, we assume that the values of the set share some resemblance.

Several sampling strategies have been proposed recently in this context [37, 11, 29, 27]. However, all exploit the fact that posterior distributions are available in closed form. As our model is sampling-based, using those approaches would be either impractical, or overly expensive computationally. Hence, we propose a new principled strategy, adapted from the TruVar algorithm of [6], originally proposed for optimisation and level-set estimation.

We first extend our model to multiple tasks, by indexing the latent GP on on top of and . Then, following [37], we assume a product kernel for :

To facilitate inference, we assume further that the tasks can be embedded in a low-dimensional latent space, . This results in a set of parameters to infer (the locations of the tasks in the latent space), independently of the number of runs.

4.2 Sequential Infill Strategy

In a nutshell, TruVar repeatedly applies two steps: 1) select a set of reference points (e.g. for optimisation, potential maximisers), then 2) find the observation that greedily shrinks the sum of prediction variances at reference points.

We adapt here this strategy to uncover profile optima, that is:

The original algorithm selects as reference points all the points for which an upper confidence bound (UCB) of the objective is higher than a threshold. As we work with continuous design spaces, we decided to simplify this step and consider for the maximisers of the UCB of the final trace value for each task, that is:

(7)

with so that the quantile defines a UCB for . Note that to ensure theoretical guarantees, UCB strategies generally require quantile orders that increase with time [17]. However, a constant value usually works best in practice [36, 5], so we focus on this case here.

Due to the lack of data, the performance at is uncertain, which can be quantified by the mean of variances at :

Note that as is chosen using a UCB, it is likely to correspond to values for which the model has a high prediction variance. So, may increase monotonically with , which acts as a tuning parameter for the exploration / exploitation trade-off.

Now, we would like to find the run (learning rate and task) that reduces the most. Assume a potential candidate , that would provide, if evaluated, an additional set of observations . Conditioning the model on this new data would reduce the prediction uncertainty at (by law of total variance), which we can measure with

where denotes the variance conditionally on . In the case of regular GP models, is actually available in closed form independently of the values of [6]. This is not the case here, so we replace by its expectation over the values of , which leads to the following sampling strategy:

(8)

In practice, this criterion is not available in closed form, and must be computed using a double Monte-Carlo loop. However, conditioning on can be approximated simply, as follow. First, samples of are obtained recursively, as in section 2.3. Then, conditioning on each of those samples and computing the new conditional variance as in section 2.3 allows to compute eq. 8.

Once and are obtained, the corresponding experiment is run and the model is updated, which in turn leads to a new set , etc. Once the budget is exhausted, the final set may be chosen using a different (either 0.5 for a risk-neutral solution or for a risk-averse one). The pseudo-code of the strategy is given in alg. 2.

Choose , , , , , ;
Select initial set of experiments ;
Run SGDs for steps;
Gather traces, build trace model;
for  to  do
       Find the optimistic set by solving eq. 7;
       Find the experiment that reduces the uncertainty related to by solving eq. 8;
       Run SGD with ;
       Gather new trace, update trace model;
      
end for
Return the set of optimal learning rates ;
Algorithm 2 Multi-task tuning

Following [42], we use the so-called reparametrisation trick when generating samples in order to solve eq. 8 with respect to using gradient-based optimisation. Note that is found by exhaustive search.

4.3 Experiment: Multi-Task Setting with SVGP on MNIST

To illustrate our strategy, we consider the following setup. We use the MNIST digit dataset, split into five binary classification problems (0 against 1, 2 against 3 and so on until 8 against 9). Our goal is to fit a sparse GP classification model to each of these datasets, by maximising its ELBO using the Adam algorithm.

We choose here Adam iterations, and different learning rate values. The learning rates are bounded by . initial runs are performed with learning rates chosen by Latin Hypercube sampling (in the logarithmic space), and runs are added sequentially according to our TruVar strategy. The GP model for uses a Matérn- kernel, an exponential link function, inducing points and a dimension for the latent space of tasks. To ease the resolution of eq. 7, we follow a greedy approach by searching for one learning rate value at a time, starting from , which is in line with the Markov assumption of the model.

Figure 4 shows the learning rate profiles and corresponding runs obtained after running the procedure, as well as some predictions for randomly chosen learning rates. We first observe the flexibility of our model, which is able to capture complex traces while providing relevant uncertainty estimates (top plots). Then, for all tasks, the learning rates found reach the upper bound at first and decreases when the trace reaches a plateau. The optimal way of decreasing the learning rate depends on the task. One can see that the predictions are uncertain, but only on one side (some samples largely overestimate the true trace but median ones are quite close to the truth).

Figure 4: Actual traces (black), predictions (color) and learning rates (grey, in log-scale between and ). The learning rates are randomly chosen for the top row, and set to their optimal estimates on the bottom right.

Figure 5

shows the estimated proximity between tasks. Here, all the tasks have been found relatively similar, as they all lead to close learning rate schedules. One may notice for instance that mnist01 is at an edge of the domain, which can be imputed to a different initial trace behaviour.

Figure 5: Latent variable values for each dataset.

4.4 Extension: Warm-Starting SGD for a New Task

Now, assume that a new task is added to the current set . Unfortunately, our model cannot be used directly as the value of the corresponding latent variables is unknown. The first solution is to find a “universal” tuning: this can be obtained by maximising the prediction of averaged over all possible values for . This average can be calculated by Monte-Carlo assuming a probability measure for , for instance the Lebesgue measure over the convex hull of .

Alternatively, one might want to spend some computing budget (say, , ) to learn and achieve then a better learning rate tuning. Assuming again a measure for , an informative experiment would correspond to a learning rate for which the proportion of the variance of the predictor due to the uncertainty on is maximal. Averaging over all time steps, we define our sampling strategy (with again a criterion computable by Monte-Carlo) as:

with . Note that once is estimated, it is possible to apply alg. 1, exploiting the flexibility of dynamic tuning while leveraging information from previous runs. A more integrated approach would use directly alg. 1 while measuring and accounting for uncertainty in ; this is left for future work.

5 Concluding Comments

We proposed a probabilistic model for the traces of optimisers, which input parameters (the choice of learning rate values) correspond to particular periods of the optimisation. This allowed us to define a versatile framework to tackle a set of problems: tuning the optimiser for a set of similar tasks, warm-starting it for a new task or on-line adaptation of the learning rate for a cold-started run.

Convergence proof for the multitask strategy has not been considered here. We believe that the results of [6] may be adapted to our case: this is left for future work. Other possible extensions are to apply our framework to other optimisers: for instance, to control the population sizes of evolutionary strategy algorithms such as CMAES [13], for which adaptation mechanisms have been found promising [26]. Finally, additional efficiency could be achieved by leveraging the use of varying dataset size, in the spirit of [18, 10] for instance.

References

  • Andrychowicz et al. [2016] Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., De Freitas, N.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems. pp. 3981–3989 (2016)
  • Baydin et al. [2017] Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782 (2017)
  • Bengio [2012] Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Neural networks: Tricks of the trade, pp. 437–478. Springer (2012)
  • Bergstra et al. [2011] Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Advances in neural information processing systems. pp. 2546–2554 (2011)
  • Bogunovic et al. [2018] Bogunovic, I., Scarlett, J., Jegelka, S., Cevher, V.: Adversarially robust optimization with Gaussian processes. In: Advances in Neural Information Processing Systems. pp. 5760–5770 (2018)
  • Bogunovic et al. [2016] Bogunovic, I., Scarlett, J., Krause, A., Cevher, V.: Truncated variance reduction: A unified approach to Bayesian optimization and level-set estimation. In: Advances in neural information processing systems. pp. 1507–1515 (2016)
  • Chollet [2009] Chollet, F.: Keras implementation of ResNet for CIFAR. https://keras.io/examples/cifar10_resnet/ (2009)
  • Domhan et al. [2015]

    Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.

    In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015)

  • Duchi et al. [2011]

    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research 12(Jul), 2121–2159 (2011)
  • Falkner et al. [2018] Falkner, S., Klein, A., Hutter, F.: Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774 (2018)
  • Ginsbourger et al. [2014] Ginsbourger, D., Baccou, J., Chevalier, C., Perales, F., Garland, N., Monerie, Y.: Bayesian adaptive reconstruction of profile optima and optimizers. SIAM/ASA Journal on Uncertainty Quantification 2(1), 490–510 (2014)
  • Gugger and Howard [2018] Gugger, S., Howard, J.: Adamw and super-convergence is now the fastest way to train neural nets (Jul 2018), https://www.fast.ai/2018/07/02/adam-weight-decay/
  • Hansen and Ostermeier [2001] Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies. Evolutionary computation 9(2), 159–195 (2001)
  • He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  • Hensman et al. [2015] Hensman, J., Matthews, A.G.d.G., Ghahramani, Z.: Scalable variational Gaussian process classification. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (2015)
  • Hoffman et al. [2013] Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic Variational Inference. Journal of Machine Learning Research (2013)
  • Kaufmann et al. [2012] Kaufmann, E., Cappé, O., Garivier, A.: On Bayesian upper confidence bounds for bandit problems. In: Artificial intelligence and statistics. pp. 592–600 (2012)
  • Klein et al. [2017a] Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimization of machine learning hyperparameters on large datasets. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2017). pp. 528–536. PMLR (2017a)
  • Klein et al. [2017b] Klein, A., Falkner, S., Springenberg, J.T., Hutter, F.: Learning curve prediction with Bayesian neural networks. In: ICLR (2017b)
  • Krizhevsky and Hinton [2009] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)
  • Leontaritis and Billings [1985] Leontaritis, I., Billings, S.A.: Input-output parametric models for non-linear systems part ii: stochastic non-linear systems. International journal of control 41(2), 329–344 (1985)
  • Li et al. [2018] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research 18(185), 1–52 (2018)
  • Loshchilov and Hutter [2019] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
  • Matthews et al. [2016] Matthews, A.G.d.G., Hensman, J., Turner, R., Ghahramani, Z.: On sparse variational methods and the kullback-leibler divergence between stochastic Processes. Journal of Machine Learning Research 51, 231–239 (2016)
  • Matthews et al. [2017]

    Matthews, A.G.d.G., Van Der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., Hensman, J.: Gpflow: A Gaussian Process library using tensorflow.

    The Journal of Machine Learning Research 18(1), 1299–1304 (2017)
  • Nishida and Akimoto [2018] Nishida, K., Akimoto, Y.: PSA-CMA-ES: CMA-ES with population size adaptation. In: Proceedings of the Genetic and Evolutionary Computation Conference (pp. 865-872) (2018)
  • Pearce and Branke [2018] Pearce, M., Branke, J.: Continuous multi-task Bayesian optimisation with correlation. European Journal of Operational Research 270(3), 1074–1085 (2018)
  • Picheny and Ginsbourger [2013] Picheny, V., Ginsbourger, D.: A nonstationary space-time Gaussian Process model for partially converged simulations. SIAM/ASA Journal on Uncertainty Quantification 1(1), 57–78 (2013)
  • Poloczek et al. [2016] Poloczek, M., Wang, J., Frazier, P.I.: Warm starting Bayesian optimization. In: Proceedings of the 2016 Winter Simulation Conference. pp. 770–781. IEEE Press (2016)
  • Reddi et al. [2018] Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: ICLR (2018)
  • Saul et al. [2016] Saul, A.D., Hensman, J., Vehtari, A., Lawrence, N.D., et al.: Chained Gaussian Processes. In: AISTATS. pp. 1431–1440 (2016)
  • Shahriari et al. [2016] Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (2016)
  • Smith [2017] Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 464–472. IEEE (2017)
  • Smith and Topin [2019] Smith, L.N., Topin, N.: Super-convergence: Very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)
  • Snoek et al. [2012] Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems. pp. 2951–2959 (2012)
  • Srinivas et al. [2010] Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian Process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. pp. 1015–1022. Omnipress (2010)
  • Swersky et al. [2013] Swersky, K., Snoek, J., Adams, R.P.: Multi-task Bayesian optimization. In: Advances in neural information processing systems. pp. 2004–2012 (2013)
  • Swersky et al. [2014] Swersky, K., Snoek, J., Adams, R.P.: Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014)
  • Tieleman and Hinton [2012] Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2), 26–31 (2012)
  • Titsias [2009] Titsias, M.: Variational Learning of Inducing Variables in Sparse Gaussian Processes. Artificial Intelligence and Statistics (2009)
  • van der Wilk et al. [2020] van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., Hensman, J.: A framework for interdomain and multioutput Gaussian processes. arXiv:2003.01115 (2020), https://arxiv.org/abs/2003.01115
  • Wilson et al. [2018] Wilson, J., Hutter, F., Deisenroth, M.: Maximizing acquisition functions for Bayesian optimization. In: Advances in Neural Information Processing Systems. pp. 9884–9895 (2018)