1 Introduction and framework
Hyperparameter optimization (see, e.g., Moore et al., 2011; Bergstra et al., 2011; Bergstra and Bengio, 2012; Maclaurin et al., 2015; Bergstra et al., 2013; Hutter et al., 2015; Franceschi et al., 2017) is the problem of tuning the value of certain parameters that control the behavior of a learning algorithm. This is typically obtained by minimizing the expected error w.r.t. the hyperparameters, using the empirical loss on a validation set as a proxy. Meta-learning (see, e.g., Thrun and Pratt, 1998; Baxter, 1998; Maurer, 2005; Maurer et al., 2016; Vinyals et al., 2016; Santoro et al., 2016; Ravi and Larochelle, 2017; Mishra et al., 2017; Finn et al., 2017) is the problem of inferring a learning algorithm from a collection of datasets in order to obtain good performances on unseen datasets. Although hyperparameter optimization and meta-learning are different and apparently unrelated problems, they can be both formulated as special cases of a wider framework that we will introduce. This connection and our observations on learning-to-learn represent the main contribution of this work.
We start by considering bilevel optimization problems (see e.g. Colson et al., 2007) of the form
We will call the function the outer objective (or outer loss), and, for every , is called the inner objective (or inner loss). Note that is a class of objectives parameterized by . As prototypical example of (1.2) consider the case that is a regularized empirical error for supervised learning, is an (unregularized) validation error, a regularization parameter and the parameters of the model.
Following (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017) we approximate the solutions of problem (1.1) by replacing the “argmin” in problem (1.2) by the -th iterate of a dynamical system of the form
where is the number of iterations, is a smooth initialization mapping and, for every , is a smooth mapping that represents the operation performed by the -th step of an optimization algorithm. Since the algorithm might involve auxiliary variables
, e.g. velocities when using stochastic gradient descent with momentum (SGDM), we replace
with a state vector. Using this notation, we formulate the following constrained optimization problem
This reformulation of the original problem allows for an efficient computation of the gradient of , either in time or in memory (Maclaurin et al., 2015; Franceschi et al., 2017), by making use of Reverse or Forward mode algorithmic differentiation (Griewank and Walther, 2008). Moreover, by considering explicitly the learning dynamics, it is possible to compute the hypergradient with respect to the hyperparameters that appear inside the optimization dynamics (e.g. step size or momentum factor if is SGDM), as opposed to other methods that compute the hypergradient at the minimizer of the inner objective (Pedregosa, 2016). This key fact allows for the inclusion of learning-to-learn, more specifically learning-to-optimize, into the framework. In the next two sections we show that gradient-based hyperparameter optimization and learning-to-learn share this same latter underlying mathematical formulation.
2 Gradient-based hyperparameter optimization
In the context of hyperparameter optimization, we are interested in minimizing the generalization error of a model , parametrized by a vector , with respect to . The outer optimization variables are in this context called hyperparameters and the outer objective is generally an empirical validation loss. Specifically, a set of labeled examples , where , is spit into training and validation sets , . The inner objective is computed on (mini-batches of) examples from while the outer objective, that represents a proxy for the generalization error of , is computed on . Assuming, for simplicity, that the optimization dynamics is given by stochastic gradient descent, and thus that the state , problem (1.4) becomes
where is a mini-batch of samples at the -th iteration, is a learning rate (a component of
) and where we made explicit the dependence of the loss functions on the examples. In this setting, the outer lossdoes not depend explicitly on the hyperparameters . The above formulation allows for the computation of the hypergradient of any real valued hyperparameter, so that hyperparameters can be optimized with a gradient descent procedure. Having access to hypergradients makes it feasible to optimize a number of hyperparameters of the same order of that of parameters, a situation which arise in the setting of learning-to-learn.
Since in this context the total number of iterations might be often high due to large datasets or complex models, to speed up the optimization and to reduce memory requirements, it is possible to compute partial hypergradients at intermediate iterations, either in reverse or forward mode, and update online several times before reaching the final iteration (Franceschi et al., 2017).
The aim of meta-learning is to learn an algorithm capable of solving ground learning problems originated by a (unknown) distribution . A meta-dataset is thus a collection of datasets, or episodes, sampled from , where each dataset with is linked to a specific task. We are interested in learning an algorithm capable of “producing” ground models , which we assume identified by parameter vectors . The algorithm itself can be thought of as a meta-model , or meta-learner, parametrized by a vector , so that . The meta-learner is viewed as a function which maps datasets to models (or weights), effectively making it a (non-standard, usually highly parametrized) learning algorithm. As a learning dynamics, in general, the meta-model can act in an iterative way, so that . Moreover, like the case of a standard optimization algorithm, the meta-learner can make use of auxiliary variables , forming state vectors . Since the ground models should exhibit good generalization performances on their specific task, each dataset can be split into training and validation111Note that some authors (e.g. Ravi and Larochelle, 2017) refer to this latter set as the test set. sets , and can be trained to minimize the average validation error over tasks, which constitutes a natural outer objective in this setting. For each task, the meta-learner produces a sequence of states .
We can thus formulate problem (1.4) for learning-to-learn as follows:
where the functions are task specific losses. The meta-model plays the role of the mapping in (1.3), thus reducing the problem of learning-to-learn to that of learning a training dynamics, or its associated parameters . The meta-learner parameters mirror the hyperparameters in the context of hyperparameter optimization in Section 2 and can be optimized with a gradient descent procedure on the outer objective. The inner objective does not appear explicitly in problem (3.1), but we assume that the meta-learner has access to task specific inner objectives .
While in principle could be implemented by any parametrized mapping, the design of meta-learning models can follow three non-exclusive natural directions:
Learning-to-optimize: can replace a gradient-based optimization algorithm (Andrychowicz et al., 2016; Wichrowska et al., 2017), acting on the weights of ground models as , where is a mini-batch of examples. The meta-model is often interpreted (Ravi and Larochelle, 2017)
as a recurrent neural network, whose hidden statesare the analog of auxiliary variables in Section 1. Alongside the update rule, it is possible to learn an initialization for the ground models weights, described by the mapping . For instance, (Finn et al., 2017) set assuming that all the input and output spaces of the tasks in have the same dimensionality, and use gradient descent for the following steps;
Learning meta-representations: the meta-learner is composed by a gradient descent procedure and a mapping from ground task instances to intermediate representations . In this case the ground models are mappings and an update on ground model weights is of the form
. This approach can prove particularly useful in cases where the instance spaces are structurally different among tasks. It differs from standard representation learning in deep learning(Bengio et al., 2009; Goodfellow et al., 2016) since the meta-training loss is specifically designed for promoting generalization across tasks;
Learning ground loss functions: the meta-learner can be a gradient descent algorithm that optimize a learned inner objective. For example, we may directly parametrize the training error (which in a standard supervised learning setting is usually a mean squared error for regression or a cross-entropy loss for classification), or learn a multitask regularizer which provides a coupling among different learning tasks in .
In the next section we presents experiments that explore the second design pattern. For experiments on gradient-based hyperparameter optimization we refer to (Franceschi et al., 2017).
We report preliminary results on the problem of few-shots learning, using MiniImagenet (Vinyals et al., 2016)
, a subset of ImageNet(Deng et al., 2009), that contains 60000 downsampled images from 100 different classes. As in (Ravi and Larochelle, 2017), we build meta-datasets by sampling ground classification problems with 5 classes, where each episode is constructed so that contains 1 (one-shot learning) or 5 (5-shots learning) examples per class and contains 15 examples per class. Out of 100 classes, 64 classes are included in a training meta-dataset from which we sample datasets for solving problem (3.1); 16 classes form a validation meta-dataset which is used to tune meta-learning hyperparameters while a third meta-dataset with the remaining 20 classes is held out for testing. We use the same split and images proposed by (Ravi and Larochelle, 2017). The code is available at https://github.com/lucfra/FAR-HO.
Our meta-model design involves the learning of a cross-episode intermediate representation. We design a meta-representation
as a four layers convolutional neural network, where each layer is composed by a convolution with 32 filters, a batch normalization followed by a ReLU activation and a 2x2 max-pooling. The ground modelsare logistic regressors that take as input the output of . Ground models parameters are initialized to and optimized by few gradient descent steps on the cross-entropy loss computed on (note that, fixing , the inner loss is convex with respect to ). The step-size is also learned. For each task the final classification model is thus given by the composition of the meta-learner with the ground learner so that the prediction for an input sample is equal to . We highlight that, unlike in (Finn et al., 2017), the weight of the representation mapping are kept constant for each episode, and learned across datasets by minimizing the outer objective in (3.1). We compute a stochastic gradient of by sampling mini-batches of 4 episodes and use Adam with decaying learning rate as optimization method for the meta-model variables . Finally we perform early stopping during meta-training and optimize the number of gradient descent updates (see Figure 2) based on the mean accuracy on the test sets of episodes in . We report results in Table 2. The proposed method, called Hyper-Representation
, achieves a competitive result despite its simplicity, highlighting the relative importance of learning a good representation independent from specific tasks, on the top of which simple logistic classifiers can perform and generalize well. Figure3 provides a visual example of the goodness of the learned representation, showing that examples — the first form the training, the second from the testing meta-datasets — from similar classes (different dog breeds) are mapped near by and, conversely, samples from dissimilar classes are mapped afar. In Appendix A we empirically show the importance of learning with the proposed framework.
|5-classes accuracy %||1-shot||5-shots|
Mean accuracy scores with 95% confidence intervals, computed on episodes from, of various methods on 1-shot and 5-shot classification problems on MiniImagenet.
of gradient descent steps on ground models
parameters for one-shot learning.
Ongoing experiments aim at combining the first and the second design patterns outlined in Section 4 both in depth (lower layers weights are hyperparameters and higher layers weights initial points) and in width (a portion of filters constitutes the meta-representation, while the weights relative to the rest of filters are considered initialization), and at experimenting with the third pattern. Moreover we plan to explore settings in which different datasets come form various domains (e.g. visual, natural language, speech, etc.), are linked to diverse tasks (e.g. classification, localization, segmentation, generation and others) and have structurally different instance spaces.
We observed that hyperparameter optimization and learning-to-learn share the same mathematical structure, captured by a bilevel programming problem, which consists in minimizing an outer objective function whose value implicitly depends on the solution of an inner problem. The objective function of this latter problem — whose optimization variables are identified with the parameters of (ground) models — is, in turn, parametrized by the outer problem variables, identified either as hyperparameters or parameters of a meta-model, depending on the context. Since the solution of the inner optimization problem does not have, in general, a closed form expression, we formulate a related constrained optimization problem by considering explicitly an optimization dynamics for the inner problem (e.g. gradient descent). In this way we are able to (A) compute the outer objective and optimize it by gradient descent and (B) optimize also variables that parametrize the learning dynamics. We discussed examples of the framework and present experiments on few-shots learning, introducing a method for learning a shared, cross-episode, representation.
- Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989.
Baxter, J. (1995).
Learning internal representations.
Proceedings of the eighth annual conference on Computational learning theory, pages 311–320. ACM.
- Baxter (1998) Baxter, J. (1998). Theoretical models of learning to learn. Learning to learn, pages 71–94.
Bengio et al. (2009)
Bengio, Y. et al. (2009).
Learning deep architectures for ai.
Foundations and trends® in Machine Learning, 2(1):1–127.
- Bergstra and Bengio (2012) Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.
- Bergstra et al. (2013) Bergstra, J., Yamins, D., and Cox, D. D. (2013). Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. ICML (1), 28:115–123.
- Bergstra et al. (2011) Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554.
- Colson et al. (2007) Colson, B., Marcotte, P., and Savard, G. (2007). An overview of bilevel optimization. Annals of operations research, 153(1):235–256.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE.
- Domke (2012) Domke, J. (2012). Generic Methods for Optimization-Based Modeling. In AISTATS, volume 22, pages 318–326.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pages 1126–1135.
- Franceschi et al. (2017) Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, pages 1165–1173.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
- Griewank and Walther (2008) Griewank, A. and Walther, A. (2008). Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM.
- Hutter et al. (2015) Hutter, F., Lücke, J., and Schmidt-Thieme, L. (2015). Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, 29(4):329–337.
- Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning.
- Maurer (2005) Maurer, A. (2005). Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967–994.
- Maurer et al. (2016) Maurer, A., Pontil, M., and Romera-Paredes, B. (2016). The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32.
- Mishra et al. (2017) Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. (2017). Meta-Learning with Temporal Convolutions. arXiv:1707.03141 [cs, stat].
- Moore et al. (2011) Moore, G., Bergeron, C., and Bennett, K. P. (2011). Model selection for primal SVM. Machine Learning, 85(1-2):175–208.
- Pedregosa (2016) Pedregosa, F. (2016). Hyperparameter optimization with approximate gradient. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 737–746. PMLR.
- Ravi and Larochelle (2017) Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. ICLR.
- Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850.
- Thrun and Pratt (1998) Thrun, S. and Pratt, L. (1998). Learning to learn. Springer.
- Vinyals et al. (2016) Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., and Wierstra, D. (2016). Matching networks for one shot learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, pages 3630–3638.
- Wichrowska et al. (2017) Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., de Freitas, N., and Sohl-Dickstein, J. (2017). Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 3751–3760.
Appendix A On variants of representation learning methods
We report in Table 1 additional results on a series of experiments for one-shot learning on MiniImagenet with the aim of comparing out method for learning a meta-representation outlined in Sections 3 and 4 with other methods for learning representations that involve the factorization of a classifier as . The representation mapping is either pretrained on the classification problem with all the images in the training meta-dataset or learned with different meta-learning algorithms. In all the experiments, for each episode is a multinomial logistic regressor learned with few iterations of gradient descent as described in Section 4.
|Method||Accuracy 1-shot||Method||Accuracy 1-shot|
In the experiments in the left column we use as representation mapping the outputs of different layers of two distinct neural networks (denominated NN and Multiclass in the Table) trained with a standard multiclass supervised learning approach on the totality of examples contained in the training meta-dataset (600 examples for each of the 64 classes 222We hold-out 3840 uniformly drawn samples to form a small test set. ). The first network NN, which has 64 filters per layer, achieves a test accuracy of . It is the same network used to reproduce the Nearest-neighbor baseline in Table 2 and it has been trained with an early stopping procedure on the nearest-neighbor classification accuracy computed on episodes sampled from the validation meta-dataset. The network Multiclass, which has 32 filters per layer, has instead been trained with an early stopping procedure on the accuracy on a small held-out validation set. Achieving a test accuracy of , this second model is superior on the (standard) multiclass classification problem. For each of the network we report experiment using as representation different layers. Specifically:
conv: we use the output of the last convolutional layer as representation, that is for NN and for Multiclass;
linear: we use as representation the linear output layer (before applying the softmax operation), so that .
: the representation is given by the probability distribution output of the network; in this case
The linear representation yields the best result for both of the networks and in the case of Multiclass achieves comparable results with previously proposed meta-learning methods.
The experiments in the right column, where is learned with meta-learning techniques, span in two directions: the first one is that of verifying the impact of various approximations on the computation of the hypergradient, and the second one is to empirically assess the importance of the training/validation splitting of each training episode. In the experiments denoted Bilevel-train, we use a bilevel approach but, unlike in section 4, we optimize the parameter vector of the representation mapping by minimizing the loss on the training sets. The outer objective is thus given by
We consider episodes with training set composed by 1 and 16 examples per class, denoted (1x5) and (16x5) respectively. In these cases goes quickly to 0 and the learning ceases after few hundred iterations. In Approx experiments we consider an approximation of the hypergradient by disregarding the optimization dynamics of the inner objectives (i.e. we set ). We also run this experiment considering the training/validation splitting obtaining a final test accuracy of . In the experiments denoted as Classic we jointly minimize
and treat the problem as a standard multitask learning problem as suggested in (Baxter, 1995) (with the exception that we evaluate a mini-batches of 4 episodes, randomly sampled every 4 gradient descent iterations).
This series of experiments suggest that both the training/validation splitting and the full computation of the hypergradient constitute key factors for learning a good meta-representation. Nevertheless, provided that the training sets contain a sufficient number of examples, also the joint optimization method achieves decent results, while learning the representation using only the training sets of one-shot episodes (experiments train 1x5) proves unsuccessful in every tested setting, a result 333It remains interesting to explore both theoretically and empirically how does the size of validation sets of meta-training episodes impacts on the generalization performances of meta-learning algorithms. in line with the theoretical analysis in (Baxter, 1995). On the other side, using pretrained representations, specially in a low-dimensional space, turns out to be a rather effective baseline. One possible explanation is that, in this context, some classes in the training and testing meta-datasets are rather similar (e.g. various dog breeds) and thus ground classifiers can leverage on very specific representations.