1 Introduction
Bayesian optimization (BO) is a wellestablished methodology to optimize expensive blackbox functions Shahriari2016 . It relies on a probabilistic model of an unknown target function , which is repeatedly queried until one runs out of budget (e.g., time). The queries consist in evaluations of
at hyperparameter configurations
selected according to an exploreexploit tradeoff criterion (e.g., expected improvement). The hyperparameter configuration corresponding to the best query is then returned. One popular approach is to impose a Gaussian process (GP) prior over and, in light of the observed queries, to compute the posterior GP. The GP model maintains posterior mean and posterior variance functions as required by the exploreexploit criterion.
Despite their flexibility, GPs scale cubically with the number of observations Rasmussen2006 . Hence, they cannot be applied in situations where
has been or can be queried a very large number of times. In this work, we are interested in such a setting as we would like to warm start BO by, e.g., transferring information obtained from previous runs of the BO routine, or learn across similar problems (e.g., a given classifier applied across different datasets
Bardenet2013 ; Yogatama2014 ; Feurer2015 ; Fusi2017 ; Golovin2017 ), which we will call tasks. To tackle the scalability limitation of GPs and ease transfer learning in BO, we propose to fall back to adaptive Bayesian linear regression (BLR)
Bishop2006 , ABLR for short, which scales linearly with the number of observations and cubically in the dimension of the basis function expansion. Sparse GPs McIntire2016 or multitask GPs Swersky2013 have been respectively developed to scale up GPs and make them suitable for multitask learning. ABLR offers a simple alternative combining the strengths of these two approaches.Our main contribution is to learn a suitable representation of a variety of tasks with a feedforward neural network (NN), provided it is fed with enough data. We consider conditionally independent taskspecific BLR models, which share a NN that learns the basis expansion. We compare to random Fourier basis expansions Rahimi2007 as they have already been successfully applied to BO HernandezLobato2017 ; Jenatton2017b . While more scalable, they are less flexible in learning a useful representation.
Closest to our work is Snoek2015
, where BO is scaled up by replacing the GP with an ABLR model. The authors consider a single task setting, with a twostep inference procedure. First, they train the NN with a squared loss at the output layer to learn a maximum a posteriori estimate of the NN parameters. This requires evaluating a number of candidate queries to feed the NN training algorithm. They then fix the network architecture and replace the output layer by a BLR layer to run the BO routine. Instead, we
jointly learn the basis expansion, that is, the NN, and the taskspecific BLR models in one go. Our objective function corresponds to a sum of logmarginal likelihood terms, each term corresponding to one of the underlying tasks. As a result, in contrast with Snoek2015 who use the squared loss, we can handle heterogeneous signals, each having its own marginal likelihood. In this sense, we borrow the strength of the likelihood of multioutput GPs while maintaining the scalability of Snoek2015 .Another related model is presented in Springenberg2016 . The authors propose Bayesian NNs to sample from the posterior over , and add taskspecific embeddings to the NN inputs to handle multiple tasks. While allowing for a principled treatment of uncertainties, fully Bayesian NNs are computationally more expensive and their training can be sensitive to the stochastic gradient MCMC hyperparameters. Our model allows for simpler inference and is more suitable for large scale deployment.
2 Multiple Adaptive Bayesian Linear Regression Model
Consider tasks defined by a set of blackbox target functions we would like to optimize. Let be the set of pairs of inputs and responses associated to task
. We further denote the stacked response vector associated to task
by and the corresponding stacked matrix of inputs by . We assume the task responses are drawn from independent BLR models conditioned on the shared feature map , which is parametrized by , and the residual noise parameters :where is the feature matrix, a weight vector, and a precision (i.e., inverse variance). To complete the model, we impose a zeromean isotropic Gaussian prior on and denote its precision by . In the remainder, we will use for .
2.1 Posterior inference
The posterior distribution over the weight parameters is analytically tractable in this model, as well as the predictive distribution, both of which are multivariate Gaussian distributions
Bishop2006 . The predictive mean and the predictive variance at a new input are respectively given by(1)  
(2) 
where . The right hand side reformulations (1) and (2) ensure numerical stability. They are obtained by decomposing in terms of its Cholesky factor , so that and with .
Each taskspecific BLR depends on the hyperparameters and , as well as the set of hyperparameters defining the feature map. In particular, will represent the weights of a NN (see Section 2.2). We adopt an empirical Bayes approach and jointly learn all these hyperparameters by optimizing the marginal likelihood of the data MacKay2003 . More specifically, we integrate out the model parameters and minimize the sum of the negative logmarginal likelihoods of each task:
(3) 
2.2 Learning a joint representation with feedforward neural networks
We learn the nonlinear map with a feedforward NN. For some input vector , we consider the following layer feedforward transformation parametrized by the weight matrices :
The parameter vector is a flattened version of the stacked weight matrices. In practice, are set as tanh functions and (as Snoek2015 ), but any more complex NN architecture can be used. Interestingly, we depart from Snoek2015 regarding the optimization of
. While their squaredloss formulation naturally lends itself to stochastic gradient descent (SGD), in a regime with moderate values of
(typically several tens in our settings) the evidence (3) is better suited to batch optimization. In our experiments, LBFGS Byrd1995 worked well. Unlike Snoek2015 , an important byproduct of this choice is that we need not find hyperparameters for SGD that should work robustly across a broad set of BO problems.2.3 Random Fourier representation
An alternative approach is to use random kitchen sinks (RKS) for a random Fourier basis expansion Rahimi2007 . Let and be such that and . For a vector , RKS defines the mapping where is the bandwidth of the approximated RBF kernel. The parameter vector is a flattened version of . Unlike the NN, the RKS representation contains only one hyperparameter to optimize ( and are randomly generated). This reduces the complexity of learning the map, but is less expressive as we show in the following section. To optimize , we proceed as for the weights of the NN (see Section 2.2).
3 Results
The following subsections illustrate the benefits of multiple ABLR in a variety of settings. Sections 3.1 and 3.2 evaluate its ability to gather knowledge from multiple tasks, respectively on synthetic and OpenML data Vanschoren2014 . Section 3.3 shows how it can also be applied to exploit information from multiple heterogeneous signals. By doing so, we intend to learn more meaningful representations, which can be leveraged to accelerate the hyperparameter optimization. We could further generalize the model to handle multiple tasks and multiple signals at the same time, but leave this for future work.
We implemented multiple ABLR in GPyOpt Gpyopt2016 , with a backend in MxNet Chen2015 , fully benefiting from the symbolic computation to obtain the derivatives of the mappings , together with and . In particular, we leverage the backward operator for the Cholesky decomposition Seeger2017 . Interestingly, this allows us to jointly optimize all the model hyperparameters and perform exact BLR on top of an arbitrarily complex NN.
3.1 Transfer learning across parametrized quadratic functions
We first consider a set of tasks. A task takes the form of a parametrized 3dimensional quadratic function where . We call the triplet the context associated to each task . In a realworld setting, the contextual information would correspond to metadata, e.g., the data set size or its dimensionality, as we shall see in the next section. We generated different tasks by drawing uniformly at random, and evaluated ABLR in a leaveonetaskout fashion. Specifically, we optimized each one of the 30 tasks after warm starting the optimization with 10 observations for the remaining 29 tasks. We compared single task ABLRbased and standard GPbased hyperparameter optimization (HPO), both denoted by plain, with their transfer learning counterparts, both denoted by transfer. We perform transfer learning with standard GPs by stacking all observations together and augmenting the input space with the corresponding contextual information Krause2011 . For ABLR with transfer, we took our approach, i.e., one marginal likelihood per task, with and without the contextual information.
Figure 1(left) shows the current best minimum at each of 50 iterations of HPO. The results are averaged over 10 random initializations and 30 leaveonetaskout runs. HPO converges to the minimum much faster than plain ABLR or plain GP when we exploit the information from the related tasks. In addition, the RKS representation with performs slightly worse than the NN with 3 hidden layers of 50 units each per layer (as advocated in Snoek2015 ). Including the contextual information did not yield clear improvements, hence, for simplicity, we do not use it in the following experiments. The GPbased HPO with transfer performs slightly better on this toy example, but is not applicable in largescale settings, such as the one in the next section (with ). Figure 1(right) compares the compute time of HPO with GP and NNbased ABLR, suggesting that the linear scaling with the number of evaluations of the latter allows us to apply ABLR in the largescale setting. The RKS basis expansion further decreases the computational time (at the expense of performance).
3.2 Transfer learning across OpenML blackbox functions
We consider the OpenML platform Vanschoren2014 , which contains a large number of evaluations for a wide range of machine learning algorithms (referred to as flows in OpenML
) over different datasets. In particular, we focus on a random forest model (
flow_id 6794) and apply ABLR to optimize its hyperparameters. We filtered the most evaluated datasets for this flow_id, which amounts to evaluations (with ranging from to ). In this setting, the linear scaling of ABLR is particularly appealing. As previously, we apply a leaveonetaskout protocol, where each task stands for a dataset. For the leftout task being optimized, say , we use the surrogate modeling approach from Eggensperger2012 . We compare GP plain and ABLR plain, which use evaluations of task only, with ABLR transfer, which is warmstarted with the evaluations of all the other tasks. The results are reported in Figure 2(left), showing that ABLR is able to gather knowledge from different datasets to speed up the convergence.3.3 Tuning of feedforward neural networks from heterogeneous signals
Finally, we consider the tuning of feedforward NNs for binary classification. We show that our formulation can be seamlessly applied to the orthogonal problem of modeling multiple output signals, possibly of heterogeneous nature, at once. Here, we optimize for the validation accuracy, using the training accuracy and CPU time as side information. Such side signals “come for free” while training machine learning algorithms, but are in general not exploited for efficient HPO. In comparison to multioutput GPs that scale as , ABLR scales as . The NN hyperparameters to tune are the number of hidden layers in , the number of units in , the amount of regularization in , the learning rate of Adam Kingma2014 in
, and the number of epochs in
. Figure 2(right) shows the results, which are averaged over 10 random initializations and 5 datasets (w8a, sonar, w1a, phishing, australian) from LIBSVM Chang2011 . It can be observed that incorporating side signals in addition to the target signal, namely the validation accuracy of the NN classifier, speeds up the ABLRbased HPO.References
 [1] GPyOpt: A Bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
 [2] R. Bardenet, M. Brendel, B. Kégl, and M. Sebag. Collaborative hyperparameter tuning. In Proceedings of the International Conference on Machine Learning (ICML), pages 199–207, 2013.
 [3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer New York, 2006.
 [4] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.

[5]
C.C. Chang and C.J. Lin.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011.  [6] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.

[7]
K. Eggensperger, F. Hutter, H. Hoos, and K. Leytonbrown.
Efficient benchmarking of hyperparameter optimizers via surrogates
background: hyperparameter optimization.
In
Proceedings of the 29th AAAI Conference on Artificial Intelligence
, pages 1114–1120, 2012.  [8] M. Feurer, T. Springenberg, and F. Hutter. Initializing Bayesian hyperparameter optimization via metalearning. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 [9] N. Fusi and H. M. Elibol. Probabilistic matrix factorization for automated machine learning. Technical report, preprint arXiv:1705.05355, 2017.
 [10] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for blackbox optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1487–1495, 2017.

[11]
J. M. HernándezLobato, J. Requeima, E. O. PyzerKnapp, and
A. AspuruGuzik.
Parallel and distributed Thompson sampling for largescale accelerated exploration of chemical space.
In Proceedings of the International Conference on Machine Learning (ICML), 2017.  [12] R. Jenatton, C. Archambeau, J. Gonzales, and M. Seeger. Bayesian optimization with treestructured dependencies. In Proceedings of the International Conference on Machine Learning (ICML), 2017.
 [13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. Technical report, preprint arXiv:1412.6980, 2014.
 [14] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems (NIPS), pages 2447–2455, 2011.
 [15] D. J. C. Mackay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
 [16] M. McIntire, D. Ratner, and S. Ermon. Sparse gaussian processes for Bayesian optimization. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2016.
 [17] A. Rahimi, B. Recht, et al. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems (NIPS), volume 3, page 5, 2007.
 [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [19] M. Seeger, A. Hetzel, Z. Dai, and N. D. Lawrence. Autodifferentiating linear algebra. Technical report, preprint arXiv:1710.08717, 2017.
 [20] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 [21] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams. Scalable Bayesian optimization using deep neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 2171–2180, 2015.
 [22] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust Bayesian neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 4134–4142, 2016.
 [23] K. Swersky, J. Snoek, and R. P. Adams. Multitask Bayesian optimization. In Advances in Neural Information Processing Systems (NIPS), pages 2004–2012, 2013.
 [24] J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
 [25] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1077–1085, 2014.
Comments
There are no comments yet.