1 Introduction
The increasing complexity of machine learning algorithms has driven a large amount of research in the area of hyperparameter optimization (HO) — see, e.g.,
(Hutter et al., 2015) for a review. The core idea is relatively simple: given a measure of interest (e.g. the misclassification error) HO methods use a validation set to construct a response function of the hyperparameters (such as the average loss on the validation set) and explore the hyperparameter space to seek an optimum.Early approaches based on grid search quickly become impractical as the number of hyperparameters grows and are even outperformed by random search (Bergstra & Bengio, 2012). Given the high computational cost of evaluating the response function, Bayesian optimization approaches provide a natural framework and have been extensively studied in this context (Snoek et al., 2012; Swersky et al., 2013; Snoek et al., 2015)
. Related and faster sequential modelbased optimization methods have been proposed using random forests
(Hutter et al., 2011)and tree Parzen estimators
(Bergstra et al., 2011), scaling up to a few hundreds of hyperparameters (Bergstra et al., 2013).In this paper, we follow an alternative direction, where gradientbased algorithms are used to optimize the performance on a validation set with respect to the hyperparameters (Bengio, 2000; Larsen et al., 1996)
. In this setting, the validation error should be evaluated at a minimizer of the training objective. However, in many current learning systems such as deep learning, the minimizer is only approximate.
Domke (2012) specifically considered running an iterative algorithm, like gradient descent or momentum, for a given number of steps, and subsequently computing the gradient of the validation error by a backoptimization algorithm. Maclaurin et al. (2015) considered reversemode differentiation of the response function. They suggested the idea of reversing parameter updates to achieve space efficiency, proposing an approximation capable of addressing the associated loss of information due to finite precision arithmetics. Pedregosa (2016) proposed the use of inexact gradients, allowing hyperparameters to be updated before reaching the minimizer of the training objective. Both (Maclaurin et al., 2015) and (Pedregosa, 2016) managed to optimize a number of hyperparameters in the order of one thousand.In this paper, we illustrate two alternative approaches to compute the hypergradient (i.e., the gradient of the response function), which have different tradeoffs in terms of running time and space requirements. One approach is based on a Lagrangian formulation associated with the parameter optimization dynamics. It encompasses the reversemode differentiation (RMD) approach used by Maclaurin et al. (2015), where the dynamics corresponds to stochastic gradient descent with momentum. We do not assume reversible parameter optimization dynamics. A wellknown drawback of RMD is its space complexity: we need to store the whole trajectory of training iterates in order to compute the hypergradient. An alternative approach that we consider overcomes this problem by computing the hypergradient in forwardmode and it is efficient when the number of hyperparameters is much smaller than the number of parameters. To the best of our knowledge, the forwardmode has not been studied before in this context.
As we shall see, these two approaches have a direct correspondence to two classic alternative ways of computing gradients for recurrent neural networks (RNN) (Pearlmutter, 1995): the Lagrangian (reverse) way corresponds to backpropagation through time (Werbos, 1990), while the forward way corresponds to realtime recurrent learning (RTRL) (Williams & Zipser, 1989). As RTRL allows one to update parameters after each time step, the forward approach is suitable for realtime hyperparameter updates, which may significantly speed up the overall hyperparameter optimization procedure in the presence of large datasets. We give experimental evidence that the realtime approach is efficient enough to allow for the automatic tuning of crucial hyperparameters in a deep learning model. In our experiments, we also explore constrained hyperparameter optimization, showing that it can be used effectively to detect noisy examples and to discover the relationships between different learning tasks.
The paper is organized in the following manner. In Section 2 we introduce the problem under study. In Section 3.1 we derive the reversemode computation. In Section 3.2 we present the forwardmode computation of the hypergradient, and in Section 3.3 we introduce the idea of realtime hyperparameter updates. In Section 4 we discuss the time and space complexity of these methods. In Section 5 we present empirical results with both algorithms. Finally in Section 6 we discuss our findings and highlight directions of future research.
2 Hyperparameter Optimization
We focus on training procedures based on the optimization of an objective function with respect to (e.g. the regularized average training loss for a neural network with weights
). We see the training procedure by stochastic gradient descent (or one of its variants like momentum, RMSProp, Adam, etc.) as a dynamical system with a state
that collects weights and possibly accessory variables such as velocities and accumulated squared gradients. The dynamics are defined by the system of equations(1) 
where is the number of iterations, contains initial weights and initial accessory variables, and, for every ,
is a smooth mapping that represents the operation performed by the th step of the optimization algorithm (i.e. on minibatch ). Finally,
is the vector of hyperparameters that we wish to tune.
As simple example of these dynamics occurs when training a neural network by gradient descent with momentum (GDM), in which case and
(2) 
where is the objective associated with the th minibatch, is the rate and is the momentum. In this example, .
Note that the iterates implicitly depend on the vector of hyperparameters . Our goal is to optimize the hyperparameters according to a certain error function evaluated at the last iterate . Specifically, we wish to solve the problem
(3) 
where the set incorporates constraints on the hyperparameters, and the response function is defined at as
(4) 
We highlight the generality of the framework. The vector of hyperparameters
may include components associated with the training objective, and components associated with the iterative algorithm. For example, the training objective may depend on hyperparameters used to design the loss function as well as multiple regularization parameters. Yet other components of
may be associated with the space of functions used to fit the training objective (e.g. number of layers and weights of a neural network, parameters associated with the kernel function used within a kernel based method, etc.). The validation error can in turn be of different kinds. The simplest example is to choose as the average of a loss function over a validation set. We may however consider multiple validation objectives, in that the hyperparameters associated with the iterative algorithm ( and in the case of momentum mentioned above) may be optimized using the training set, whereas the regularization parameters would typically require a validation set, which is distinct from the training set (in order to avoid overfitting).3 Hypergradient Computation
In this section, we review the reversemode computation of the gradient of the response function (or hypergradient) under a Lagrangian perspective and introduce a forwardmode strategy. These procedures correspond to the reversemode and the forwardmode algorithmic differentiation schemes (Griewank & Walther, 2008). We finally introduce a realtime version of the forwardmode procedure.
3.1 ReverseMode
The reversemode computation leads to an algorithm closely related to the one presented in (Maclaurin et al., 2015). A major difference with respect to their work is that we do not require the mappings defined in Eq. (1) to be invertible. We also note that the reversemode calculation is structurally identical to backpropagation through time (Werbos, 1990).
We start by reformulating problem (3) as the constrained optimization problem
(5) 
This formulation closely follows a classical Lagrangian approach used to derive the backpropagation algorithm (LeCun, 1988). Furthermore, the framework naturally allows one to incorporate constraints on the hyperparameters.
The Lagrangian of problem (5) is
(6) 
where, for each , is a row vector of Lagrange multipliers associated with the th step of the dynamics.
The partial derivatives of the Lagrangian are given by
(7)  
(8)  
(9)  
(10) 
where for every , we define the matrices
(11) 
Note that and .
The optimality conditions are then obtained by setting each derivative to zero. In particular, setting the right hand side of Equations (8) and (9) to zero gives
Combining these equations with Eq. (10) we obtain that
As we shall see this coincides with the expression for the gradient of in Eq. (15) derived in the next section. Pseudocode of ReverseHG is presented in Algorithm 1.
3.2 ForwardMode
The second approach to compute the hypergradient appeals to the chain rule for the derivative of composite functions, to obtain that the gradient of
at satisfies^{1}^{1}1Remember that the gradient of a scalar function is a row vector.(12) 
where is the matrix formed by the total derivative of the components of (regarded as rows) with respect to the components of (regarded as columns).
Recall that . The operators depend on the hyperparameter both directly by its expression and indirectly through the state . Using again the chain rule we have, for every , that
(13) 
Defining for every and recalling Eq. (11), we can rewrite Eq. (13) as the recursion
(14) 
Using Eq. (14), we obtain that
(15)  
Note that the recurrence (14) on the Jacobian matrix is structurally identical to the recurrence in the RTRL procedure described in (Williams & Zipser, 1989, eq. (2.10)).
From the above derivation it is apparent that can be computed by an iterative algorithm which runs in parallel to the training algorithm. Pseudocode of ForwardHG is presented in Algorithm 2. At first sight, the computation of the terms in the right hand side of Eq. (14) seems prohibitive. However, in Section 4 we observe that if is much smaller than , the computation can be done efficiently.
3.3 RealTime ForwardMode
For every let be the response function at time : . Note that coincides with the definition of the response function in Eq. (4). A major difference between ReverseHG and ForwardHG is that the partial hypergradients
(16) 
are available in the second procedure at each time step and not only at the end.
The availability of partial hypergradients is significant since we are allowed to update hyperparameters several times in a single optimization epoch, without having to wait until time
. This is reminiscent of the realtime updates suggested by Williams & Zipser (1989) for RTRL. The realtime approach may be suitable in the case of a data stream (i.e. ), where ReverseHG would be hardly applicable. Even in the case of finite (but large) datasets it is possible to perform one hyperparameter update after a hyperbatch of data (i.e. a set of minibatches) has been processed. Algorithm 2 can be easily modified to yield a partial hypergradient when (for some hyperbatch size ) and letting run from to , reusing examples in a circular or random way. We use this strategy in the phone recognition experiment reported in Section 5.3.4 Complexity Analysis
We discuss the time and space complexity of Algorithms 1 and 2. We begin by recalling some basic results from the algorithmic differentiation (AD) literature.
Let be a differentiable function and suppose it can be evaluated in time and requires space . Denote by the Jacobian matrix of . Then the following facts hold true (Griewank & Walther, 2008) (see also Baydin et al. (2015) for a shorter account):

For any vector , the product can be evaluated in time and requires space using forwardmode AD.

For any vector , the product has time and space complexities using reversemode AD.

As a corollary of item (i), the whole can be computed in time and requires space using forwardmode AD (just use unitary vectors for ).

Similarly, can be computed in time and requires space using reversemode AD.
Let and denote time and space, respectively, required to evaluate the update map defined by Eq. (1). Then the response function defined in Eq. (3) can be evaluated in time (assuming the time required to compute the validation error does not affect the bound^{2}^{2}2This is indeed realistic since the number of validation examples is typically lower than the number of training iterations.) and requires space since variables may be overwritten at each iteration. Then, a direct application of Fact (i) above shows that Algorithm 2 runs in time and space . The same results can also be obtained by noting that in Algorithm 2 the product requires Jacobianvector products, each costing (from Fact (i)), while computing the Jacobian takes time (from Fact (iii)).
Similarly, a direct application of Fact (ii) shows that Algorithm 1 has both time and space complexities . Again the same results can be obtained by noting that and are transposedJacobianvector products that in reversemode take both time (from Fact (ii)). Unfortunately in this case variables cannot be overwritten, explaining the much higher space requirement.
As an example, consider training a neural network with weights^{3}^{3}3
This includes linear SVM and logistic regression as special cases.
, using classic iterative optimization algorithms such as SGD (possibly with momentum) or Adam, where the hyperparameters are just learning rate and momentum terms. In this case, and . Moreover, and are both . As a result, Algorithm 1 runs in time and space , while Algorithm 2 runs in time and space , which would typically make a dramatic difference in terms of memory requirements.5 Experiments
In this section, we present numerical simulations with the proposed methods. All algorithms were implemented in TensorFlow and the software package used to reproduce our experiments is available
^{4}^{4}4A newer version of the package is available at https://github.com/lucfra/FARHO. at https://github.com/lucfra/RFHO. In all the experiments, hypergradients were used inside the Adam algorithm (Kingma & Ba, 2014) in order to minimize the response function.5.1 Data Hypercleaning
The goal of this experiment is to highlight one potential advantage of constraints on the hyperparameters. Suppose we have a dataset with label noise and due to time or resource constraints we can only afford to cleanup (by checking and correcting the labels) a subset of the available data. Then we may use the cleaned data as the validation set, the rest as the training set, and assign one hyperparameter to each training example. By putting a sparsity constraint on the vector of hyperparameters
, we hope to bring to zero the influence of noisy examples, in order to generate a better model. While this is the same kind of data sparsity observed in support vector machines (SVM), our setting aims to get rid of erroneously labeled examples, in contrast to SVM which puts zero weight on redundant examples. Although this experimental setup does not necessarily reflect a realistic scenario, it aims to test the ability of our HO method to effectively make use of constraints on the hyperparameters
^{5}^{5}5We note that a related approach based on reinforcement learning is presented in
(Fan et al., 2017).We instantiated the above setting with a balanced subset of examples from the MNIST dataset, split into three subsets: of training examples, of validation examples and a test set containing the remaining samples. Finally, we corrupted the labels of training examples, selecting a random subset .
We considered a plain softmax regression model with parameters (weights) and (bias). The error of a model on an example was evaluated by using the crossentropy both in the training objective function, , and in the validation one, . We added in an hyperparameter vector that weights each example in the training phase, i.e. .
According to the general HO framework, we fit the parameters to minimize the training loss and the hyperparameters to minimize the validation error. The sparsity constraint was implemented by bounding the norm of , resulting in the optimization problem
where and are the parameters obtained after iterations of gradient descent on the training objective. Given the high dimensionality of , we solved iteratively computing the hypergradients with ReverseHG method and projecting Adam updates on the set .
We are interested in comparing the following three test set accuracies:

Oracle: the accuracy of the minimizer of trained on clean examples only, i.e. ; this setting is effectively taking advantage of an oracle that tells which examples have a wrong label;

Baseline: the accuracy of the minimizer of trained on all available data ;

DHR: the accuracy of the data hypercleaner with a given value of the radius, . In this case, we first optimized hyperparameters and then constructed a cleaned training set (keeping examples with ); we finally trained on .
Accuracy %  

Oracle  90.46  1.0000 
Baseline  87.74   
90.07  0.9137  
90.06  0.9244  
90.00  0.9211  
90.09  0.9217 
We are also interested in evaluating the ability of the hypercleaner to detect noisy samples. Results are shown in Table 1. The data hypercleaner is robust with respect to the choice of and is able to identify corrupted examples, recovering a model that has almost the same accuracy as a model produced with the help of an oracle.
Figure 1 shows how the accuracy of improves with the number of hyperiterations and the progression of the amount of discarded examples. The data hypercleaner starts by discarding mainly corrupted examples, and while the optimization proceeds, it begins to remove also a portion of cleaned one. Interestingly, the test set accuracy continues to improve even when some of the clean examples are discarded.
5.2 Learning Task Interactions
This second set of experiments is in the multitask learning (MTL) context, where the goal is to find simultaneously the model of multiple related tasks. Many MTL methods require that a task interaction matrix is given as input to the learning algorithm. However, in real applications, this matrix is often unknown and it is interesting to learn it from data. Below, we show that our framework can be naturally applied to learning the task relatedness matrix.
We used CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009), two object recognition datasets with and
classes, respectively. As features we employed the preactivation of the second last layer of InceptionV3 model trained on ImageNet
^{6}^{6}6Available at tinyurl.com/h2x8wws . From CIFAR10, we extracted examples as training set, different examples as validation set and the remaining for testing. From CIFAR100, we selected examples as training set,as validation set and the remaining for testing. Finally, we used a onehot encoder of the labels obtaining a set of labels in
( or ).The choice of small training set sizes is due to the strong discriminative power of the selected features. In fact, using larger sample sizes would not allow to appreciate the advantage of MTL. In order to leverage information among the different classes, we employed a multitask learning (MTL) regularizer (Evgeniou et al., 2005)
where are the weights for class , is the number of classes, and the symmetric nonnegative matrix models the interactions between the classes/tasks. We used a regularized training error defined as where is the categorical crossentropy and is the vector of thresholds associated with each linear model. We wish solve the following optimization problem:
where is the th iteration obtained by running gradient descent with momentum (GDM) on the training objective. We solve this problem using ReverseHG and optimizing the hyperparameters by projecting Adam updates on the set . We compare the following methods:

SLT: single task learning, i.e. , using a validation set to tune the optimal value of for each task;

NMTL: we considered the naive MTL scenario in which the tasks are equally related, that is for every . In this case we learn the two nonnegative hyperparameters and ;

HMTL: our hyperparameter optimization method ReverseHG to tune and ;

HMTLS: Learning the matrix with only few examples per class could bring the discovery of spurious relationships. We try to remove this effect by imposing the constraint that , where^{7}^{7}7We observed that yielded very similar results. . In this case, Adam updates are projected onto the set .
CIFAR10  CIFAR100  

STL  67.47  18.99 
NMTL  69.41  19.19 
HMTL  70.85  21.15 
HMTLS  71.62  22.09 
Results of five repetitions with different splits are presented in Table 2. Note that HMTL gives a visible improvement in performance, and adding the constraint that further improves performance in both datasets. The matrix can been interpreted as an adjacency matrix of a graph, highlighting the relationships between the classes. Figure 2 depicts the graph for CIFAR10 extracted from the algorithm HMTLS. Although this result is strongly influenced by the choice of the data representations, we can note that animals tends to be more related to themselves than to vehicles and vice versa.
5.3 Phone Classification
The aim of the third set of experiments is to assess the efficacy of the realtime ForwardHG algorithm (RTHO). We run experiments on phone recognition in the multitask framework proposed in (Badino, 2016, and references therein). Data for all experiments was obtained from the TIMIT phonetic recognition dataset (Garofolo et al., 1993). The dataset contains 5040 sentences corresponding to around 1.5 million speech acoustic frames. Training, validation and test sets contain respectively 73%, 23% and 4% of the data. The primary task is a framelevel phone state classification with 183 classes and it consists in learning a mapping
from acoustic speech vectors to hidden Markov model monophone states. Each 25ms speech frame is represented by a 123dimensional vector containing 40 Mel frequency scale cepstral coefficients and energy, augmented with their deltas and deltadeltas. We used a window of eleven frames centered around the prediction target to create the 1353dimensional input to
. The secondary (or auxiliary) task consists in learning a mapping from acoustic vectors to 300dimensional real vectors of contextdependent phonetic embeddings defined in (Badino, 2016).Accuracy  Time (min)  

Vanilla  59.81  12 
RS  
RTHO  61.97  164 
RTHONT  61.38  289 
As in previous work, we assume that the two mappings and
share inputs and an intermediate representation, obtained by four layers of a feedforward neural network with 2000 units on each layer. We denote by
the parameter vector of these four shared layers. The network has two different output layers with parameter vectors and each relative to the primary and secondary task. The network is trained to jointly minimize , where the primary error is the average crossentropy loss on the primary task, the secondary error is given by mean squared error on the embedding vectors and is a design hyperparameter. Since we are ultimately interested in learning , we formulate the hyperparameter optimization problem aswhere is the cross entropy loss computed on a validation set after iterations of stochastic GDM, and and are defined in (2). In all the experiments we fix a minibatch size of 500. We compare the following methods:

Vanilla: the secondary target is ignored (; and are set to 0.075 and 0.5 respectively as in (Badino, 2016).

RS: random search with ,
(exponential distribution with scale parameter 0.1) and
(Bergstra & Bengio, 2012). 
RTHO: realtime hyperparameter optimization with initial learning rate and momentum factor as in Vanilla and initial set to 1.6 (best value obtained by gridsearch in Badino (2016)).

RTHONT: RTHO with “null teacher,” i.e. when the initial values of , and are set to 0. We regard this experiment as particularly interesting: this initial setting, while clearly not optimal, does not require any background knowledge on the task at hand.
We also tried to run ForwardHG for a fixed number of epochs, not in realtime mode. Results are not reported since the method could not make any appreciable progress after running 24 hours on a Titan X GPU.
Test accuracies and execution times are reported in Table 3. Figure 3 shows learning curves and hyperparameter evolutions for RTHONT. In Experiments 1 and 2 we employ a standard early stopping procedure on the validation accuracy, while in Experiments 3 and 4 a natural stopping time is given by the decay to 0 of the learning rate (see Figure 3 leftbottom plot). In Experiments 3 and 4 we used a hyperbatch size of (see Eq. (16)) and a hyperlearning rate of 0.005.
The best results in Table 3 are very similar to those obtained in stateoftheart recognizers using multitask learning (Badino, 2016, 2017). In spite of the small number of hyperparameters, random search yields results only slightly better than the vanilla network (the result reported in Table 3 are an average over 5 trials, with a minimum and maximum accuracy of 59.93 and 60.86, respectively). Within the same time budget of 300 minutes, RTHONT is able to find hyperparameters yielding a substantial improvement over the vanilla version, thus effectively exploiting the auxiliary task. Note that the model trained has more that parameters for a corresponding state of more than variables. To the best of our knowledge, reversemode (Maclaurin et al., 2015) or approximate (Pedregosa, 2016) methods have not been applied to models of this size.
6 Discussion
We studied two alternative strategies for computing the hypergradients of any iterative differentiable learning dynamics. Previous work has mainly focused on the reversemode computation, attempting to deal with its space complexity, that becomes prohibitive for very large models such as deep networks.
Our first contribution is the definition and the application of forwardmode computation to HO. Our analysis suggests that for large models the forwardmode computation may be a preferable alternative to reversemode if the number of hyperparameters is small. Additionally, forwardmode is amenable to realtime hyperparameter updates, which we showed to be an effective strategy for large datasets (see Section 5.3). We showed experimentally that even starting from a farfromoptimal value of the hyperparameters (the null teacher), our RTHO algorithm finds good values at a reasonable cost, whereas other gradientbased algorithms could not be applied in this context.
Our second contribution is the Lagrangian derivation of the reversemode computation. It provides a general framework to tackle hyperparameter optimization problems involving a wide class of response functions, including those that take into account the whole parameter optimization dynamics. We have also presented in Sections 5.1 and 5.2 two nonstandard learning problems where we specifically take advantage of a constrained formulation of the HO problem.
We close by highlighting some potential extensions of our framework and direction of future research. First, the relatively low cost of our RTHO algorithm could suggest to make it a standard tool for the optimization of realvalued critical hyperparameters (such as learning rates, regularization factors and error function design coefficient), in context where no previous or expert knowledge is available (e.g. novel domains). Yet, RTHO must be thoroughly validated on diverse datasets and with different models and settings to empirically asses its robustness and its ability to find good hyperparameter values. Second, in order to perform gradientbased hyperparameter optimization, it is necessary to set a descent procedure over the hyperparameters. In our experiments we have always used Adam with a manually adjusted value for the hyperlearning rate. Devising procedures which are adaptive in these hyperhyperparameters is an important direction of future research. Third, extensions of gradientbased HO techniques to integer or nominal hyperparameters (such as the depth and the width of a neural network) require additional design efforts and may not arise naturally in our framework. Future research should instead focus on the integration of gradientbased algorithm with Bayesian optimization and/or with emerging reinforcement learning hyperparameter optimization approaches (Zoph & Le, 2016). A final important problem is to study the converge properties of RTHO. Results in Pedregosa (2016) may prove useful in this direction.
References
 Badino (2016) Badino, Leonardo. Phonetic context embeddings for dnnhmm phone recognition. In Proceedings of Interspeech, pp. 405–409, 2016.
 Badino (2017) Badino, Leonardo. Personal communication, 2017.
 Baydin et al. (2015) Baydin, Atilim Gunes, Pearlmutter, Barak A., Radul, Alexey Andreyevich, and Siskind, Jeffrey Mark. Automatic differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767, 2015.
 Bengio (2000) Bengio, Yoshua. Gradientbased optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
 Bergstra & Bengio (2012) Bergstra, James and Bengio, Yoshua. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
 Bergstra et al. (2013) Bergstra, James, Yamins, Daniel, and Cox, David D. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. ICML, 28:115–123, 2013.
 Bergstra et al. (2011) Bergstra, James S., Bardenet, Rémi, Bengio, Yoshua, and Kégl, Balázs. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, pp. 2546–2554, 2011.
 Dinuzzo et al. (2011) Dinuzzo, Francesco, Ong, Cheng S, Pillonetto, Gianluigi, and Gehler, Peter V. Learning output kernels with block coordinate descent. In ICML, pp. 49–56, 2011.
 Domke (2012) Domke, Justin. Generic methods for optimizationbased modeling. In AISTATS, volume 22, pp. 318–326, 2012.
 Evgeniou et al. (2005) Evgeniou, Theodoros, Micchelli, Charles A, and Pontil, Massimiliano. Learning multiple tasks with kernel methods. J. Mach. Learn. Res., 6:615–637, 2005.
 Fan et al. (2017) Fan, Yang, Tian, Fei, Qin, Tao, Bian, Jiang, and Liu, TieYan. Learning what data to learn. arXiv preprint arXiv:1702.08635, 2017.
 Garofolo et al. (1993) Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathon G., and Pallett, David S. DARPA TIMIT acousticphonetic continous speech corpus CDROM. NIST speech disc 11.1. NASA STI/Recon technical report, 93, 1993.
 Griewank & Walther (2008) Griewank, Andreas and Walther, Andrea. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, second edition, 2008.
 Hutter et al. (2011) Hutter, Frank, Hoos, Holger H., and LeytonBrown, Kevin. Sequential modelbased optimization for general algorithm configuration. In Int. Conf. on Learning and Intelligent Optimization, pp. 507–523. Springer, 2011.
 Hutter et al. (2015) Hutter, Frank, Lücke, Jörg, and SchmidtThieme, Lars. Beyond Manual Tuning of Hyperparameters. KI  Künstliche Intelligenz, 29(4):329–337, 2015.
 Jawanpuria et al. (2015) Jawanpuria, Pratik, Lapin, Maksim, Hein, Matthias, and Schiele, Bernt. Efficient output kernel learning for multiple tasks. In Advances in Neural Information Processing Systems, pp. 1189–1197, 2015.
 Kang et al. (2011) Kang, Zhuoliang, Grauman, Kristen, and Sha, Fei. Learning with whom to share in multitask feature learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 521–528, 2011.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 Krizhevsky & Hinton (2009) Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.
 Larsen et al. (1996) Larsen, Jan, Hansen, Lars Kai, Svarer, Claus, and Ohlsson, M. Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing, pp. 62–71. IEEE, 1996.
 LeCun (1988) LeCun, Yann. A Theoretical Framework for BackPropagation. In Hinton, Geoffrey and Sejnowski, Terrence (eds.), Proc. of the 1988 Connectionist models summer school, pp. 21–28. Morgan Kaufmann, 1988.
 Maclaurin et al. (2015) Maclaurin, Dougal, Duvenaud, David K, and Adams, Ryan P. Gradientbased hyperparameter optimization through reversible learning. In ICML, pp. 2113–2122, 2015.
 Pearlmutter (1995) Pearlmutter, Barak A. Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural networks, 6(5):1212–1228, 1995.
 Pedregosa (2016) Pedregosa, Fabian. Hyperparameter optimization with approximate gradient. In ICML, pp. 737–746, 2016.
 Snoek et al. (2012) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959, 2012.
 Snoek et al. (2015) Snoek, Jasper, Rippel, Oren, Swersky, Kevin, Kiros, Ryan, Satish, Nadathur, Sundaram, Narayanan, Patwary, Md Mostofa Ali, Prabhat, Mr, and Adams, Ryan P. Scalable Bayesian Optimization Using Deep Neural Networks. In ICML, pp. 2171–2180, 2015.
 Swersky et al. (2013) Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multitask bayesian optimization. In Advances in Neural Information Processing Systems, pp. 2004–2012, 2013.
 Werbos (1990) Werbos, Paul J. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 Williams & Zipser (1989) Williams, Ronald J. and Zipser, David. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
 Zoph & Le (2016) Zoph, Barret and Le, Quoc V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Empirical Validation Of Complexity Analysis
To complement the complexity analysis in Section 4, we study empirically the running time per hyperiteration and space requirements of ReverseHG and ForwardHG algorithms. We trained three layers feedforward neural networks on MNIST dataset with SGDM, for iterations. In a first set of experiments (Figure 4, left) we fixed the number of weights at 199210 and optimized the learning rate, momentum factor and a varying number of example weights in the training error, similar to the experiment in Section 5.1. As expected, the running time of ReverseHG is essentially constant, while that of ForwardHG increases linearly. On the other hand, when fixing the number of hyperparameters (learning rate and momentum factor), the space complexity of ReverseHG grows linearly with respect to the number of parameters (Figure 4, right), while that of ForwardHG remains constant.
Appendix B Experiments
Learning Task Interactions
In Table 4, we report comparative results obtained with two stateoftheart multitask learning methods (Dinuzzo et al. (2011) and Jawanpuria et al. (2015)) on the CIFAR10 dataset.
CIFAR10  

Dinuzzo et al. (2011)  69.96  
Jawanpuria et al. (2015)  70.30  
Jawanpuria et al. (2015)  70.96  
HMTLS  71.62 
Both methods improve over STL and NMTL but perform slightly worse than HMTLS. The task interaction matrix is treated as a model parameter by these algorithms, which may lead to overfitting for such a small training set, further highlighting the advantages of considering as an hyperparameter. Computation times are comparable in the order of 23 hours.
Other approaches (e.g. (Kang et al., 2011) tackle the same problem in a similar framework, but a complete analysis of MTL is beyond the scope of this paper.
Phone Classification
In this section, we present additional experimental results on the phone classification task discussed in Section 5.3. We considered a sequential modelbased optimization with Gaussian processes, using the Python package BayesianOptimization found at https://github.com/fmfn/BayesianOptimization/. We set the following definition intervals for the hyperparameters: , and ; we used expected improvement as acquisition function and initialized the method with 5 randomly chosen points. In Table 5 we report the validation accuracies at different times for this experiment (SMBO) as well as for those presented in Section 5.3.
50 min  100 min  300 min  Final TA  

RS  60.64  60.86  61.23  60.36 
SMBO  60.83  60.83  61.39  60.91 
RHTONT  56.51  60.91  62.11  61.38 
RHTO  59.45  61.21  62.88  61.97 
Appendix C On learning rate initialization in RTHONT
In the RTHONT setting presented in Section 5.3, the hyperparameters are initially set to zero. While is, in general, a far from optimal point in the hyperparameter space, it has the advantage of not requiring any previous knowledge of the task at hand.
Here we provide some insights on this particular initialization strategy. For simplicity, we consider the case that the learning dynamics is stochastic gradient descent, that is, . In the following, we observe that the first update on the learning rate performed by RTHONT is proportional to the scalar product between the gradient of the validation error at and the average of stochastic gradients of the training error over the minibatches composing the first hyperbatch. Indeed, the partial derivatives of the learning dynamics (see Equation (11)) are given by
where the gradient is regarded as a row vector. Consequently,
and the first hypergradient w.r.t. at time step , where is the hyperbatch size, is
Thus, the smaller the angle between the validation error gradient and the (unnormalized) stochastic training error gradient, the bigger the update. In particular, if the angle is negative, the learning rate would become negative, suggesting that may be a bad parameter initialization point.
Comments
There are no comments yet.