1 Introduction
The great recent successes of machine learning generally rely on models with high complexity (e.g. deep models) and extensive datasets. Those models usually require running a training procedure over a (large) set of parameters, which often amounts to minimising a loss function with an iterative algorithm such as stochastic gradient descent (SGD) or the nonlinear conjugate gradient method. In the deep learning community, a particular focus has been given to SGD algorithms such as Adagrad
[9], RMSProp
[39], and in particular Adam [1], which despite recent discussions and improvements [23, 30] can be considered as the stateoftheart and is widely used in practice [12]. Unfortunately, this training procedure is often extremely timeconsuming due to the high model complexity and the amount of data at hand; hence, performing it with the best possible efficiency is paramount for many applications, in particular those for which training is done repeatedly, or continuously, for example in streaming models.The performance of nearly all SGD variants depend critically on the choice of the learning rate level [3], which in short tunes by how much the algorithm should follow the (noisy) gradient signal. The default choice for most algorithms is a constant learning rate, although recent experiments showed that a timevarying value can be extremely beneficial [2, 33, 34]. In any case, learning rate values require to be set, and tuning them by hand can be excessively burdensome. Bayesian optimisation (BO), on the other hand, is now a wellestablished tool for model selection and parameter tuning [4, 35]. While learning rates are sometimes included in the parameters to be tuned [4], to our knowledge no work has been dedicated to speeding up SGD algorithms onthefly using BO: this is the purpose of the present work.
Let be the objective optimised by SGD; typically in deep learning, can be a mean squared error or an evidence lower bound (ELBO)^{1}^{1}1In the following, without loss of generality we use the convention that should be maximised.. denotes a set of parameters, while defines the learning task at hand, that is, a particular model structure and a dataset. is typically a singleton (when a single model is fitted to a single dataset), or a discrete set, for instance when the same model is used to fit different datasets. Given , an SGD algorithm produces a sequence of parameters , with a predefined number of iterations (i.e. optimisation steps).
In our setup we assume that the learning rate varies from one iteration to the next. One way to parametrise it, which avoids working directly on a highdimensional vector of size
, is to use a piecewise constant form. Defining a sequence of length :the learning rate curve is defined by constants such that
Assuming lower and upper bounds for , without loss of generality our tuning parameters can be rescaled to . For any task , our objective is to seek the best possible learning rate, that is, the one that maximises after steps:
(1) 
with the parameters returned after iterations (now written as a function of the learning rate schedule and task ).
Now, BO classically relies on Gaussian process (GP) surrogate models of the function to optimise. While predicting trace (that is, the value of for all ) has been already addressed in related settings [28, 8, 19]
, adapting those models to make them dependent on a learning rate that changes over time is often not possible. Moreover, changing the learning rate often induces drastic changes on the trace behaviour, which makes typical parametric models unfit for the task. Our first contribution is the design of an autoregressive model for the trace of a SGD algorithm, that flexibly adjusts to abrupt changes of behaviours. The model, which serves as a base for our sampling strategies, is presented in
section 2.In the classical BO setting [32], first an initial set of experiments, typically with spacefilling properties, is run. The gathered data would then be used to build a predictive model for
, which would allow us to estimate
. In BO, such estimate is gradually improved by designing a sequence of additional experiments that enhances the prediction capability of the model, following an exploration / exploitation tradeoff.However, often times the user is faced with training a single model, without much prior information on an appropriate choice of the learning rate schedule. In that case, the relevance of BO is somehow debatable, as it is uncertain that the outcome of SGD runs would outperform a single but much longer one with an empirically chosen learning rate. To overcome this, [38, 22, 10] used forecasting models to stop early unpromising runs and drastically reduce the computational cost of the BO loop. Yet, such approaches can only provide longterm recommendation at the price of repeated long runs, and are mostly relevant when other parameters are tuned simultaneously with the learning rate.
We propose here an alternative strategy, unexplored in the BO literature, which is to adapt a single run on the fly rather than terminate some and start new ones from scratch. In addition, as modern hardware architecture favours parallel computing, we want to leverage the fact that training procedures can be run in parallel (say, one on each available GPU). Section 3 is dedicated to this problem.
In other practical situations, training is done repeatedly over similar tasks: for instance the same model fitted to different datasets or different variations of a model fitted to a dataset. In that case, one seeks an optimal mapping [or profile optima, 11] (from the space of tasks to the space of learning rates), such that . Transferring information from one task to another allows designing strategies much more efficient than running independent BO loops for each task [37]. This framework is considered in section 4.
2 A GPbased NARX Model For Optimisation Traces
For simplicity, we focus for now on the case where is a singleton (i.e. we consider a single dataset and model) and remove the dependence on from our notations. The generalisation to multiple tasks is deferred to section 4.
2.1 Modelling of Optimisation Traces
Define first
the trace of an optimisation run and denote by the trace value at each changing point for the learning rate.
There exist many options to fit a parametric model to : for instance, in [8], 11 models for traces are proposed, including exponential forms () and power ones (). While those forms make sense with a constant learning rate, they cannot fit properly a varying one, as changing the learning rate drastically modifies the trace dynamics. This is illustrated in fig. 2, left, where the visible trace trend abruptly changes every time the learning rate changes (see also fig. 4). Hence, instead of fitting a single parametric model, we propose to use a composite one, based on an autoregressive formulation.
First, we model as an i.i.d. Gaussian variable , to account for randomness in the starting point. Then, we propose to model the trace using a nonlinear autoregressive model with exogenous inputs [NARX(q), 21], which general expression is:
where the new state is a nonlinear function of the past states and exogenous inputs plus some independent noise. We propose a specific function for as follow:
where is a latent function that modulates the increments according to the current and past trace values and learning rates and (or to ensure the monotonicity of the trace) is a link function that returns the trace increment for any given time according to its parameters. As the trace may be recorded for values outside , our model for the trace is:
where and represents the noise in the case where the objective function is not evaluated exactly and/or for unaccountedfor deviations between observations and the model.
In the following, we use either piecewise linear or piecewise exponential forms for the traces. With respectively defining the linear slope and with
corresponding to a logit offset and
to a logit rate, this gives:(2)  
(3) 
where is a (multioutput) GP and is here the softplus function, , to ensure monotonicity. Note that any of the parametric models of e.g. [8] may be used here as . In a sense, our model extend those for a timevarying learning rate.
Focusing on the linear case, and setting , we have: . We directly see the dynamic implied by our model: increments are linear with respect to time, but the slope depends nonlinearly on an exogenous input (the learning rate) and on the current state (intuitively, we expect flatter slopes if is high, as the model is close to convergence). Considering the exponential case (eq. 3) and setting , we show GP surfaces in fig. 1 for (left) and (right).
Importantly, the same is used over all the time intervals and only depends on . Hence, in the NARX(1) case, our model implies that the same initial conditions and learning rates would lead to the same outcome, regardless of the time the algorithm has been run before. In that sense, the model can be considered as Markovian, as it is independent of the path followed by the optimiser before
. While our model is general for higherorder Markov chains (i.e. NARX(q)), we found empirically that the current formulation provides the right tradeoff between accuracy, robustness and ease of inference, as the link function
is only defined over .Finally, we follow the chained GP framework [31] for , and choose a GP prior with independent components.
2.2 Learning Using Variational Inference
We consider here a fixed link function, so the inference task boils down to estimating the parameters of the function and (the starting value of the trace). Assume that we have run the optimiser times for different learning rate schedules and recorded the trace of each run at times (in ). Let
denote an observation of the trace at time for a learning rate schedule . We assume that each trace is evaluated at the beginning of each time interval (i.e. ), which gives explicit observations of , which we denote henceforth .
The parameters of can be inferred by maximum likelihood from . Now, we focus on learning the latent GP functions . Without loss of generality, we assume in the following that our link function only requires a single GP parameter (i.e. p = 1). When a link function depends on more than one GP we simply apply the same procedure in parallel and model every function independently.
We follow a fully Bayesian approach to infer from the given dataset. We place a GP prior on the latent function and assume that is i.i.d. Gaussian noise with zero mean and variance, so that:
with such that . For general link functions is exact inference of the latent function
not possible due to the nonlinear transformation.
A classical solution is to follow the Sparse Variational GP (SVGP) framework [40, 15], that relies on two components. First, it introduces a set of inducing variables, which main use is to specify the function value of the posterior GP at a specific set of pseudo inputs , denoted as . The distribution of the inducing variables is specified by a fully parameterised Gaussian with mean and covariance , which are the variational parameters we want to learn. The prior GP on can then be conditioned on , which leads to a marginal posterior with mean and the variance :
(4) 
where and .
With this approximation in place we can set up our model’s optimisation objective, which is a lower bound on the log marginal likelihood [ELBO, 16], equal to
(5) 
where , and
is the KullbackLeibler divergence between the approximate and the prior of
[24]. It can be calculated analytically given the Gaussianity of both the prior and posterior on . The expectation can be estimated in an unbiased way using MonteCarlo, by sampling (section 2.2) and propagating the samples through the linkfunction. Optimising eq. 5 with respect to the model parameters can be done by gradient descent thanks to automatic differentiation toolkits.In our experiments we made use of the GPflow library [25]. More precisely, we used the provided multioutput framework for GPs [41], which is wellsuited for implementing and optimising of these complex, composite GP models.
Note that as the approximation is sparse (i.e. it relies on a few inducing points), it can handle much larger datasets than classical GP models. This is a decisive advantage here as traces typically contain thousands of datapoints.
2.3 Generating Trace Predictions
Since the main objective is to maximise after iterations, we would like to predict
which requires access to . Recursively, we see that predicting is achieved by predicting the corresponding sequence . Importantly, the distribution of is not available analytically because of the arbitrary link function. Hence, we must resort to sampling. Given the recursive structure (the value of is necessary to draw from ) we sample first , then after conditioning on , and recursively sampling () after conditioning on . Drawing multiple samples for a given may be used to provide any statistic of
, such as mean, variance and quantiles.
3 Dynamic Tuning of the Learning Rate
We address now the question of dynamically tuning the learning rate for a single model and task, using the model previously defined. To do so, we depart from the standard BO approach, by modifying a small set of runs on the fly instead of starting repeatedly new ones.
3.1 Proposed Strategy
We consider the following framework. We assume that SGD runs are performed in parallel with different learning rates. For simplicity, we assume synchronicity (all runs progress with the same speed). Each run is conveyed independently over time intervals . At , the traces are fed to the model, which is then used to schedule new learning rates for the next interval.
The learning rates are chosen as follows. At initialisation, as there is no model to help making decisions, are taken on a uniform grid between bounds. At any (), the current trace observations are first integrated into the model. Then, each new learning rate is chosen to maximise the quantile (computed here empirically by sampling) of the trace at the end of the next interval:
(6) 
The ’s are used here to balance exploitation and exploration: ’s close to one may lead to very optimistic choices (learning rates for which the outcome is highly uncertain) while ’s close to zero result in riskaverse choices (guaranteed immediate performance). Hence, to maximise our diversity of choices we set . In the case , this forces to choose , which is a riskneutral strategy. Following an optimistic strategy (say, ) instead may enhance exploration and improve longterm performance.
In addition here, at each we greedily select the run with highest current trace and duplicate it times while discarding the others. While this was found to accelerate significantly the performance, keeping each run may prove a valid alternative on problems that require more learning rate scheduling exploration.
The pseudocode of the strategy is given in alg. 1. Note that a relevant choice of is problemdependent: a large allows more changes of the learning rate value, but increases the computational overhead due to model fitting and solving eq. 6. Besides, to facilitate inference the trace may not be sliced in too many parts (fig. 2). In the experiment reported below, using resulted with a negligible BO overhead.
3.2 Experiment: Dynamic Tuning of Learnig Rate on CIFAR
We apply our approach to the training of a vanilla ResNet [14]neural network with 56 layers on the classification dataset CIFAR10 [20] that contains 60,000 32
32 colour images. We use an implementation of the ResNet model for the CIFAR dataset available in Keras
[7]. We first split the dataset into 50,000 training and 10,000 testing images. The Adam optimiser is used for 100 epochs to maximise the log crossentropy for future predictions.
Our BO setup is as follows: the 100 epochs are divided into equal intervals and optimisations are ran in parallel. The objective is recorded every 50 iterations. The GP model for uses a Matérn kernel, a linear link function and inducing points. A practical issue we face is that increasing abruptly the learning rate sometimes causes aberrant behaviour (which can be seen by large peaks in the trace in fig. 3). To avoid this problem, we use a GP classification model [15] to predict which runs are likely to fail based on values. The optimisation of eq. 6 is then restricted to the values of
for which the probability of failure is lower than
. We set the threshold for failure for a given trace inverse proportional to its quantile as we want traces with larger be more explorative. In addition, we limit the maximum change to one order of magnitude.As baselines, we use five constant learning rates schedules uniformly spread in log space between and , and 12 learning rates with exponential decay (three initial values between and and four decay rates , , , and ), such that the learning rate in each epoch equals .
Figure 3 shows the dynamic of our approach. The initial interval shows the large performance differences when using different learning rates. Here, a very large learning rate is best at first, but almost immediately becomes suboptimal. After a quarter of the optimisation budget, the optimal learning rates always takes values around , slowly decreasing over time. The algorithm behaviour captures this, by being very exploratory at first and much less towards the last intervals.
Comparing to constant learning rate schedules, our approach largely outperforms any of them. In the case where no parallel computation is performed, our approach would still outperform any constant learning rate, as those seem to have converged already to suboptimal values after 40,000 iterations. Our approach also outperforms all exponential decay schedules but one. For this problem, a properly tuned exponential decay seems like a very efficient solution, and our dynamic tuning captures this solution. Arguably, five runs with different exponential decays might outperform our dynamic approach, but this would critically depend on the chosen bounds for the parameters and luck in the design of experiments. Standard BO (over the parameters) might be an alternative, but five observations would be too small to run it.
4 MultiTask Scheduling
4.1 MultiTask Learning
We now consider the case where is discrete and relatively small (say, ), but can be increased when a new task needs to be solved, similarly to [29]. The objective is then to find an optimal set of learning rates rather than a single one. However, as an efficient learning rate schedule for a task is often found to perform well for another, we assume that the values of the set share some resemblance.
Several sampling strategies have been proposed recently in this context [37, 11, 29, 27]. However, all exploit the fact that posterior distributions are available in closed form. As our model is samplingbased, using those approaches would be either impractical, or overly expensive computationally. Hence, we propose a new principled strategy, adapted from the TruVar algorithm of [6], originally proposed for optimisation and levelset estimation.
We first extend our model to multiple tasks, by indexing the latent GP on on top of and . Then, following [37], we assume a product kernel for :
To facilitate inference, we assume further that the tasks can be embedded in a lowdimensional latent space, . This results in a set of parameters to infer (the locations of the tasks in the latent space), independently of the number of runs.
4.2 Sequential Infill Strategy
In a nutshell, TruVar repeatedly applies two steps: 1) select a set of reference points (e.g. for optimisation, potential maximisers), then 2) find the observation that greedily shrinks the sum of prediction variances at reference points.
We adapt here this strategy to uncover profile optima, that is:
The original algorithm selects as reference points all the points for which an upper confidence bound (UCB) of the objective is higher than a threshold. As we work with continuous design spaces, we decided to simplify this step and consider for the maximisers of the UCB of the final trace value for each task, that is:
(7) 
with so that the quantile defines a UCB for . Note that to ensure theoretical guarantees, UCB strategies generally require quantile orders that increase with time [17]. However, a constant value usually works best in practice [36, 5], so we focus on this case here.
Due to the lack of data, the performance at is uncertain, which can be quantified by the mean of variances at :
Note that as is chosen using a UCB, it is likely to correspond to values for which the model has a high prediction variance. So, may increase monotonically with , which acts as a tuning parameter for the exploration / exploitation tradeoff.
Now, we would like to find the run (learning rate and task) that reduces the most. Assume a potential candidate , that would provide, if evaluated, an additional set of observations . Conditioning the model on this new data would reduce the prediction uncertainty at (by law of total variance), which we can measure with
where denotes the variance conditionally on . In the case of regular GP models, is actually available in closed form independently of the values of [6]. This is not the case here, so we replace by its expectation over the values of , which leads to the following sampling strategy:
(8) 
In practice, this criterion is not available in closed form, and must be computed using a double MonteCarlo loop. However, conditioning on can be approximated simply, as follow. First, samples of are obtained recursively, as in section 2.3. Then, conditioning on each of those samples and computing the new conditional variance as in section 2.3 allows to compute eq. 8.
Once and are obtained, the corresponding experiment is run and the model is updated, which in turn leads to a new set , etc. Once the budget is exhausted, the final set may be chosen using a different (either 0.5 for a riskneutral solution or for a riskaverse one). The pseudocode of the strategy is given in alg. 2.
4.3 Experiment: MultiTask Setting with SVGP on MNIST
To illustrate our strategy, we consider the following setup. We use the MNIST digit dataset, split into five binary classification problems (0 against 1, 2 against 3 and so on until 8 against 9). Our goal is to fit a sparse GP classification model to each of these datasets, by maximising its ELBO using the Adam algorithm.
We choose here Adam iterations, and different learning rate values. The learning rates are bounded by . initial runs are performed with learning rates chosen by Latin Hypercube sampling (in the logarithmic space), and runs are added sequentially according to our TruVar strategy. The GP model for uses a Matérn kernel, an exponential link function, inducing points and a dimension for the latent space of tasks. To ease the resolution of eq. 7, we follow a greedy approach by searching for one learning rate value at a time, starting from , which is in line with the Markov assumption of the model.
Figure 4 shows the learning rate profiles and corresponding runs obtained after running the procedure, as well as some predictions for randomly chosen learning rates. We first observe the flexibility of our model, which is able to capture complex traces while providing relevant uncertainty estimates (top plots). Then, for all tasks, the learning rates found reach the upper bound at first and decreases when the trace reaches a plateau. The optimal way of decreasing the learning rate depends on the task. One can see that the predictions are uncertain, but only on one side (some samples largely overestimate the true trace but median ones are quite close to the truth).
4.4 Extension: WarmStarting SGD for a New Task
Now, assume that a new task is added to the current set . Unfortunately, our model cannot be used directly as the value of the corresponding latent variables is unknown. The first solution is to find a “universal” tuning: this can be obtained by maximising the prediction of averaged over all possible values for . This average can be calculated by MonteCarlo assuming a probability measure for , for instance the Lebesgue measure over the convex hull of .
Alternatively, one might want to spend some computing budget (say, , ) to learn and achieve then a better learning rate tuning. Assuming again a measure for , an informative experiment would correspond to a learning rate for which the proportion of the variance of the predictor due to the uncertainty on is maximal. Averaging over all time steps, we define our sampling strategy (with again a criterion computable by MonteCarlo) as:
with . Note that once is estimated, it is possible to apply alg. 1, exploiting the flexibility of dynamic tuning while leveraging information from previous runs. A more integrated approach would use directly alg. 1 while measuring and accounting for uncertainty in ; this is left for future work.
5 Concluding Comments
We proposed a probabilistic model for the traces of optimisers, which input parameters (the choice of learning rate values) correspond to particular periods of the optimisation. This allowed us to define a versatile framework to tackle a set of problems: tuning the optimiser for a set of similar tasks, warmstarting it for a new task or online adaptation of the learning rate for a coldstarted run.
Convergence proof for the multitask strategy has not been considered here. We believe that the results of [6] may be adapted to our case: this is left for future work. Other possible extensions are to apply our framework to other optimisers: for instance, to control the population sizes of evolutionary strategy algorithms such as CMAES [13], for which adaptation mechanisms have been found promising [26]. Finally, additional efficiency could be achieved by leveraging the use of varying dataset size, in the spirit of [18, 10] for instance.
References
 Andrychowicz et al. [2016] Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T., Shillingford, B., De Freitas, N.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems. pp. 3981–3989 (2016)
 Baydin et al. [2017] Baydin, A.G., Cornish, R., Rubio, D.M., Schmidt, M., Wood, F.: Online learning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782 (2017)
 Bengio [2012] Bengio, Y.: Practical recommendations for gradientbased training of deep architectures. In: Neural networks: Tricks of the trade, pp. 437–478. Springer (2012)
 Bergstra et al. [2011] Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyperparameter optimization. In: Advances in neural information processing systems. pp. 2546–2554 (2011)
 Bogunovic et al. [2018] Bogunovic, I., Scarlett, J., Jegelka, S., Cevher, V.: Adversarially robust optimization with Gaussian processes. In: Advances in Neural Information Processing Systems. pp. 5760–5770 (2018)
 Bogunovic et al. [2016] Bogunovic, I., Scarlett, J., Krause, A., Cevher, V.: Truncated variance reduction: A unified approach to Bayesian optimization and levelset estimation. In: Advances in neural information processing systems. pp. 1507–1515 (2016)
 Chollet [2009] Chollet, F.: Keras implementation of ResNet for CIFAR. https://keras.io/examples/cifar10_resnet/ (2009)

Domhan et al. [2015]
Domhan, T., Springenberg, J.T., Hutter, F.: Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves.
In: TwentyFourth International Joint Conference on Artificial Intelligence (2015)

Duchi et al. [2011]
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research 12(Jul), 2121–2159 (2011)  Falkner et al. [2018] Falkner, S., Klein, A., Hutter, F.: Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774 (2018)
 Ginsbourger et al. [2014] Ginsbourger, D., Baccou, J., Chevalier, C., Perales, F., Garland, N., Monerie, Y.: Bayesian adaptive reconstruction of profile optima and optimizers. SIAM/ASA Journal on Uncertainty Quantification 2(1), 490–510 (2014)
 Gugger and Howard [2018] Gugger, S., Howard, J.: Adamw and superconvergence is now the fastest way to train neural nets (Jul 2018), https://www.fast.ai/2018/07/02/adamweightdecay/
 Hansen and Ostermeier [2001] Hansen, N., Ostermeier, A.: Completely derandomized selfadaptation in evolution strategies. Evolutionary computation 9(2), 159–195 (2001)

He et al. [2016]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
recognition.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
 Hensman et al. [2015] Hensman, J., Matthews, A.G.d.G., Ghahramani, Z.: Scalable variational Gaussian process classification. In: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (2015)
 Hoffman et al. [2013] Hoffman, M.D., Blei, D.M., Wang, C., Paisley, J.: Stochastic Variational Inference. Journal of Machine Learning Research (2013)
 Kaufmann et al. [2012] Kaufmann, E., Cappé, O., Garivier, A.: On Bayesian upper confidence bounds for bandit problems. In: Artificial intelligence and statistics. pp. 592–600 (2012)
 Klein et al. [2017a] Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast Bayesian optimization of machine learning hyperparameters on large datasets. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2017). pp. 528–536. PMLR (2017a)
 Klein et al. [2017b] Klein, A., Falkner, S., Springenberg, J.T., Hutter, F.: Learning curve prediction with Bayesian neural networks. In: ICLR (2017b)
 Krizhevsky and Hinton [2009] Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. rep., Citeseer (2009)
 Leontaritis and Billings [1985] Leontaritis, I., Billings, S.A.: Inputoutput parametric models for nonlinear systems part ii: stochastic nonlinear systems. International journal of control 41(2), 329–344 (1985)
 Li et al. [2018] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: A novel banditbased approach to hyperparameter optimization. Journal of Machine Learning Research 18(185), 1–52 (2018)
 Loshchilov and Hutter [2019] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
 Matthews et al. [2016] Matthews, A.G.d.G., Hensman, J., Turner, R., Ghahramani, Z.: On sparse variational methods and the kullbackleibler divergence between stochastic Processes. Journal of Machine Learning Research 51, 231–239 (2016)

Matthews et al. [2017]
Matthews, A.G.d.G., Van Der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., LeónVillagrá, P., Ghahramani, Z., Hensman, J.: Gpflow: A Gaussian Process library using tensorflow.
The Journal of Machine Learning Research 18(1), 1299–1304 (2017)  Nishida and Akimoto [2018] Nishida, K., Akimoto, Y.: PSACMAES: CMAES with population size adaptation. In: Proceedings of the Genetic and Evolutionary Computation Conference (pp. 865872) (2018)
 Pearce and Branke [2018] Pearce, M., Branke, J.: Continuous multitask Bayesian optimisation with correlation. European Journal of Operational Research 270(3), 1074–1085 (2018)
 Picheny and Ginsbourger [2013] Picheny, V., Ginsbourger, D.: A nonstationary spacetime Gaussian Process model for partially converged simulations. SIAM/ASA Journal on Uncertainty Quantification 1(1), 57–78 (2013)
 Poloczek et al. [2016] Poloczek, M., Wang, J., Frazier, P.I.: Warm starting Bayesian optimization. In: Proceedings of the 2016 Winter Simulation Conference. pp. 770–781. IEEE Press (2016)
 Reddi et al. [2018] Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond. In: ICLR (2018)
 Saul et al. [2016] Saul, A.D., Hensman, J., Vehtari, A., Lawrence, N.D., et al.: Chained Gaussian Processes. In: AISTATS. pp. 1431–1440 (2016)
 Shahriari et al. [2016] Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (2016)
 Smith [2017] Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 464–472. IEEE (2017)
 Smith and Topin [2019] Smith, L.N., Topin, N.: Superconvergence: Very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for MultiDomain Operations Applications. vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)
 Snoek et al. [2012] Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems. pp. 2951–2959 (2012)
 Srinivas et al. [2010] Srinivas, N., Krause, A., Kakade, S., Seeger, M.: Gaussian Process optimization in the bandit setting: no regret and experimental design. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. pp. 1015–1022. Omnipress (2010)
 Swersky et al. [2013] Swersky, K., Snoek, J., Adams, R.P.: Multitask Bayesian optimization. In: Advances in neural information processing systems. pp. 2004–2012 (2013)
 Swersky et al. [2014] Swersky, K., Snoek, J., Adams, R.P.: Freezethaw Bayesian optimization. arXiv preprint arXiv:1406.3896 (2014)
 Tieleman and Hinton [2012] Tieleman, T., Hinton, G.: Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2), 26–31 (2012)
 Titsias [2009] Titsias, M.: Variational Learning of Inducing Variables in Sparse Gaussian Processes. Artificial Intelligence and Statistics (2009)
 van der Wilk et al. [2020] van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., Hensman, J.: A framework for interdomain and multioutput Gaussian processes. arXiv:2003.01115 (2020), https://arxiv.org/abs/2003.01115
 Wilson et al. [2018] Wilson, J., Hutter, F., Deisenroth, M.: Maximizing acquisition functions for Bayesian optimization. In: Advances in Neural Information Processing Systems. pp. 9884–9895 (2018)
Comments
There are no comments yet.