Artificial Neural Networks have proven to be highly successful function approximators when (1) trained on large datasets and (2) trained till convergence using IID sampling. Without large datasets and IID sampling, however, they are prone to over-fitting and catastrophic forgetting(French, 1991, 1999) respectively. Gradient-based meta-learning has recently been shown to be highly successful at extracting the high-level stationary structure of a problem from a meta-data set – a dataset of datasets – allowing few-shot generalization without over-fitting (Finn, 2018). More recently, it has also been shown to mitigate forgetting for better continual learning (Nagabandi et al., 2019; Javed and White, 2019).
A gradient-based meta-learner has two important components. (1) The meta-objective – the objective function that the algorithm minimizes during meta-training – and meta-parameters – the parameters updated during meta-training to minimize the selected meta-objective. One of the most popular realizations of such a meta-learning framework is MAML (Finn et al., 2017). MAML solves few-shot learning by maximizing fast adaptation and generalization as a meta-objective by learning a model initialization – a set of weights used to initialize the parameters of a neural network. The idea is encode the stationary structure of tasks coming from a fixed task distribution in the weights used for initializing a model such that regular SGD updates starting from this initialization are effective for few-shot learning.
While the choices made by MAML for the meta-objective and meta-parameters are reasonable, there are many other alternatives. For instance, instead of learning a model initialization, we could learn a representation (Javed and White, 2019; Bengio et al., 2019)
– an encoder that transform input data into a vector representation more conducive for learning –, learning rates(Li et al., 2017), an update rule (Bengio et al., 1990; Metz et al., 2019), a causal structure (Bengio et al., 2019), or even the complete learning algorithm (Ravi and Larochelle, 2016). Similarly, instead of using the few-shot objective, it is possible to define a meta-objective that minimizes other second-order metrics, such as catastrophic forgetting (Javed and White, 2019; Riemer et al., 2019).
In this work, we investigate if incorporating robustness to interference in the meta-objective improves performance on incremental learning benchmarks at meta-test time. Recently Javed and White (2019) introduced an objective – MRCL – that learns a representation by minimizing interference and showed that such representations drastically improve performance on incremental learning benchmarks. However, they do not compare their method to representations learned by the few-shot learning objective. Nagabandi et al. (2019), on the other hand, found that incorporating effects of incremental learning – such as interference – at meta-train time did not improve performance on their continual learning benchmark at meta-test time. It is a fair question, then, if the new objective introduced by Javed and White (2019) is necessary for effective incremental learning; it is possible that fast adaptation alone would be sufficient for meta-learning non-interfering representations.
2 Problem Formulation
To compare the two objectives, we propose learning Continual Learning Prediction (CLP) tasks – a problem setting that requires both fast adaptation and robustness to interference – online. We define a Continual Learning Prediction (CLP) task as:
consisting of an initial observation and target (
), a loss function111Here
refers to our parametrized model., transition dynamics , an episode length , and sets such that and . A sample from a CLP task, , consists of a stream of potentially highly correlated samples of length H starting from and following the transition dynamics for steps to get
Furthermore, we define loss over a sample as . The learning objective of the CLP task is to minimize the expected loss of a task i.e. from a single sample by seeing one data point at a time. Standard neural network, without any meta-learning, applied to the CLP task would do poorly as they struggle to learn online from a highly correlated stream of data in a singe pass.
3 Comparing the Two Objectives
To apply neural network to the CLP task, we propose meta-learning a function – a deep neural network parametrized by – from . We then learn another function from . By composing the two functions we get which constitute our model for the CLP tasks. We treat as meta-parameters that are learned by minimizing the meta-objective and then later fixed at meta-test time. After learning , we learn from for a CLP task from a single trajectory using fully online SGD updates in a single pass.
For meta-training, we assume a distribution over CLP tasks given by . We consider two meta-objectives for updating the meta-parameters . (1) A MAML like few-shot-learning objective, and MRCL – an objective that also minimizes interference in addition to maximizing fast adaptation. The two objectives can be implemented as Algorithm 1 and 2 respectively with the primary difference between the two highlighted in red. Note that MAML uses the complete batch of data to do inner updates where MRCL uses one data point from for one update. This allows MRCL to take the effects of incremental learning – such as catastrophic forgetting – into account.
4 Dataset, Implementation Details, and Results
4.1 CLP tasks using Omniglot
Omniglot is a dataset of over 1623 characters from 50 different alphabets (Lake et al., 2015). Each character has 20 hand-written images. The dataset is divided into two parts. The first 963 classes constitute the meta-training dataset whereas the remaining 660 the meta-testing dataset. To define a CLP task on these datasets, we sample an ordered set of 200 classes . and , then, constitute of all images of these classes. A sample from such a task is a trajectory of images – five images per class – where we see all five images of followed by five images of and so on. This makes . Note that the sampling operation defines a distribution over tasks which we use for meta-training.
We learn an encoder – a deep CNN with 6 convolution and two FC layers – using the MAML and the MRCL objective. We treat the convolution parameters as and FC layer parameters as . Because optimizing the MRCL objective is computationally expensive for (It involves unrolling the computation graph for 1,000 steps), we approximate the two objectives. For MAML we learn the
by maximizing fast adaptation for a 5 shot 5-way classifier. For MRCL, instead of doingno of inner-gradient steps as described in Algorithm 2, we go over five steps at a time. For five steps in the inner loop, we accumulate our meta-loss on , and update our meta-parameters using these accumulated gradients at the end as explained in Algorithm 4 in the Appendix. This allows us to never unroll our computation graphs for more than five steps (Similar to truncated back-propagation through time) and still take into account the effects of interference at meta-training.
Finally, both MAML and MRCL use 5 inner gradient steps and similar network architectures for a fair comparison. Moreover, for both methods, we try multiple values for the inner learning rate and report the results for the best parameter. For more details about hyper-parameters see the Appendix.
SR-NN (Liu et al., 2019) does not use gradient-based meta-learning; instead, it uses the meta-training dataset to learn a sparse representation by regularizing the activations in the representation layer and serves as a baseline.
At meta-test time, we sample 50 CLP tasks from the meta-test-set. For each task, we learn from a single trajectory using Algorithm 3 and compute accuracy on (Train accuracy). We also measure accuracy on multiple other samples from the task and report them as test accuracy.
More concretely, we transform all the images in a task to a vector representation using our meta-learned encoder and learn a classifier (Up to 200 classes) parametrized by fully online (Seeing all the data of one class before moving to the next) in a single pass. We report the accuracy in Fig. 1 (a) and (b) respectively. At every point on the x-axis, we only report accuracy for the classes seen so far (This is why accuracy drops for all methods as we learn more and more classes). We can see from Fig 1 (a) that representations learned by MRCL are significantly more robust to catastrophic interference than those learned by MAML. Moreover, from Fig 1 (b), we see that that the higher training accuracy also results in better generalization performance (i.e. MRCL is not just memorizing the training samples).
As a sanity check, we also trained classifiers by sampling data IID for three epochs and report the results in Fig.1 (c) and (d). The fact that MAML and MRCL do equally well with IID sampling indicates that the quality of representations () learned by both objectives are comparable and the higher performance of MRCL is indeed because the representations are more suitable for incremental learning.
Intuition Behind the Difference Between MRCL and MAML:
At an intuitive level, the primary difference between MRCL and MAML is in the inner gradient steps. For MAML, the inner gradient consists of SGD updates on a batch of data from all the classes. As a result, the objective is only maximizing fast adaptation and generalization. For MRCL, on the other hand, the inner gradient steps involve online SGD updates on a highly correlated stream of data. Consequentially, the model not only has to adapt to the task from a single trajectory but it also has to prevent subsequent inner updates from interfering with the earlier updates. This motivates the model to learn a representation that prevents forgetting of past knowledge.
Why Learn an Encoder as Opposed to a Network Initialization?
In this work, we meta-learned a representation given by as opposed to a network initialization. We empirically found that for online learning on highly correlated data-streams, a network initialization is an ineffective inductive bias. This is especially true when learning long trajectories involving thousands of SGD updates. For a more detailed explanation with some empirical results, see Fig. 2 in the appendix.
In this paper, we compared two meta-learning objectives for learning representations conducive for incremental learning. We found that MRCL – an objective that directly minimizes interference – is significantly better at learning such representations than MAML – an objective that only maximizes generalization and fast adaptation. This is contrary to what Nagabandi et al. (2019) found in their work. One explanation of why they didn’t see the benefit of incorporating online learning in meta-training is that, in their work, they also have a mechanism for detecting changes in tasks. Based on the detected task, an agent might choose to use a different neural network as model. Such a task selection mechanism may make reducing interference less important. This is further supported by looking at continued adaptation with meta-learning – one of the baselines in their paper that uses a single model for continuous adaptation. For this baseline, they did observe that an initialization learned by optimizing the MAML objective was ineffective at preventing forgetting.
- Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche …. Cited by: §1.
- A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912. Cited by: §1.
Model-agnostic meta-learning for fast adaptation of deep networks.
International Conference on Machine Learning, Cited by: §1.
- Learning to learn with gradients. Ph.D. Thesis, EECS Department, University of California, Berkeley. External Links: Cited by: §1.
Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks. In Annual cognitive science society conference, Cited by: §1.
- Catastrophic forgetting in connectionist networks. Trends in cognitive sciences. Cited by: §1.
- Meta-learning representations for continual learning. Advances in Neural Information Processing Systems. Cited by: §1, §1, §1.
- Human-level concept learning through probabilistic program induction. Science. Cited by: §4.1.
- Meta-sgd: learning to learn quickly for few-shot learning. arXiv:1707.09835. Cited by: §1.
The utility of sparse representations for control in reinforcement learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4384–4391. Cited by: §4.2.
- Meta-learning update rules for unsupervised representation learning. International Conference on Learning Representations. Cited by: §1.
- Deep online learning via -learning: continual adaptation for model-based rl. International Conference on Learning Representations. Cited by: §1, §1, §6.
- Optimization as a model for few-shot learning. International Conference on Learning Representations. Cited by: §1.
- Learning to learn without forgetting by maximizing transfer and minimizing interference. International Conference on Learning Representations. Cited by: §1.