1 Introduction
Image classification models and algorithms often require an enormous amount of labelled examples for training to achieve stateoftheart performance. Labelled examples can be expensive and timeconsuming to acquire. Human visual systems, on the other hand, are able to recognise new classes after being shown a few labelled examples. Fewshot classification (Miller et al., 2000; Li et al., 2004, 2006; Lake et al., 2011) tackles this issue by learning to adapt to unseen classes (known as novel classes) with very few labelled examples from each class. Recent works show that metalearning provides promising approaches to fewshot classification problems (Santoro et al., 2016; Finn et al., 2017; Li et al., 2017; Ravi and Larochelle, 2017; Sung et al., 2018). Metalearning or learningtolearn (Schmidhuber, 1987; Thrun and Pratt, 1998) takes the learning process a level deeper – instead of learning from the labelled examples in the training classes (known as base classes), metalearning learns the examplelearning process. The training process in metalearning that utilises the base classes is called the metatraining stage, and the evaluation process that reports the fewshot performance on the novel classes is known as the metatesting or metaevaluation stage.
Despite being a promising solution to fewshot classification problems, metalearning methods suffer from a limitation where a metalearned model loses its fewshot classification ability on previous datasets as new ones arrive subsequently for metatraining. Some popular examples of the fewshot classification datasets are Omniglot (Lake et al., 2011), CIFARFS (Bertinetto et al., 2019) and miniImageNet (Vinyals et al., 2016)
. A metalearned model is restricted to perform fewshot classification on a specific dataset, in the sense that the base and novel classes have to originate from the same dataset distribution. The current practice to fewshot classify the novel classes from different datasets is to metalearn a model for each dataset separately
(Snell et al., 2017; Vinyals et al., 2016; Sung et al., 2018; Bertinetto et al., 2019). This paper considers metalearning a single model for fewshot classification on multiple datasets that arrive sequentially for metatraining.Creating an agent that performs well on multiple tasks has long been an important goal, as this is a major step towards artificial general intelligence. Recent works exhibit the importance of excelling at multiple tasks in domains such as vision (Yuan and Yan, 2010; Zhang et al., 2014), speech (Seltzer and Droppo, 2013; Kurata and Audhkhasi, 2019) and translation (Dong et al., 2015; Johnson et al., 2017). In real life applications, an agent often encounters new tasks and knowledge in a sequential manner. The goal is to create an agent that accumulates knowledge and experiences over time in a similar fashion to how humans learn. Recent advances in humanrobot interaction (Nikolaidis et al., 2015, 2016) develop adaptation models that allow robots to adapt to human behaviours over time and react accordingly to perform well on humanrobot collaborative tasks. They demonstrate the importance of being able to process new information over time and adapt accordingly for an optimal action based on the new knowledge and previous experiences in various robotic applications. We extend this goal to the fewshot classification problem and seek for an agent that can perform fewshot classification across multiple datasets and continuously incorporate knowledge acquired from the newlyarrived datasets.
In this paper, we introduce a Bayesian online metalearning framework to train a model that is applicable to a broader scope of fewshot classification datasets by overcoming catastrophic forgetting. To the best of our knowledge, this is the first attempt to prevent catastrophic forgetting for fewshot classification problems using this newlyintroduced framework. We extend the Bayesian online learning (BOL) framework (Opper, 1998) to a Bayesian online metalearning framework using the ModelAgnostic MetaLearning (MAML) algorithm (Finn et al., 2017). MAML finds a good model parameter initialisation (called metaparameters) that can quickly adapt to novel classes using very few labelled examples, while BOL provides a principled framework for finding the posterior of the model parameters. Our framework aims to combine both BOL and MAML to find the posterior of the metaparameters. This paper builds on (Ritter et al., 2018a) which combines the BOL framework and Laplace approximation with blockdiagonal Kroneckerfactored Fisher approximation to overcome catastrophic forgetting in largescale supervised classification.
Below are the contributions we make in this paper:

We develop a Bayesian online metalearning algorithm called Bayesian Online Metalearning with Laplace Approximation (BOMLA) for sequential fewshot classification problems.

We propose a simple approximation to the Fisher corresponding to the BOMLA framework that carries over the desirable blockdiagonal Kroneckerfactored structure from the Fisher approximation in the nonmetalearning setting.

We demonstrate that BOMLA can overcome catastrophic forgetting in comparison to running MAML sequentially in various fewshot classification settings.
2 MetaLearning
Most metalearning algorithms comprise an inner loop for examplelearning and an outer loop that learns the examplelearning process. Such algorithms often require sampling a batch of tasks at each iteration, where a task is formed by sampling a subset of classes from the pool of base classes or novel classes during metatraining or metaevaluation respectively.
An offline metalearning algorithm learns a fewshot classification model only for a specific dataset . For notational convenience, we drop the subscript in this section, as there is only one dataset involved in offline metalearning. The dataset is divided into the set of base classes and novel classes for metatraining and metaevaluation respectively. Upon completing metatraining on the base class set , the goal of fewshot classification is to perform well on an unseen task sampled from the novel class set after a quick adaptation on a small subset (known as the support set) of . The performance of this unseen task is evaluated on the query set , where . Since is not accessible during metatraining, this supportquery split is mimicked on the base class set for metatraining.
2.1 ModelAgnostic MetaLearning
In this paper, we are interested in the wellknown metalearning algorithm MAML (Finn et al., 2017). Each updating step of MAML aims to improve the ability of the metaparameters to act as a good model initialisation for a quick adaptation on unseen tasks.
Each iteration of the MAML algorithm samples tasks from the base class set
and runs a few steps of stochastic gradient descent (SGD) for an inner loop taskspecific learning. The number of tasks sampled per iteration is known as the
metabatch size. For task , the inner loop outputs the taskspecific parameters from a step SGD quick adaptation on the objective with the support set and initialised at :(1) 
where . The outer loop gathers all taskspecific adaptations to update the metaparameters using the loss on the query set .
The overall MAML optimisation objective is
(2) 
Like most metalearning algorithms, MAML assumes a stationary task distribution during metatraining and metaevaluation. Under this assumption, a metalearned model is only applicable to a specific dataset distribution. When the model encounters a sequence of datasets, it loses the fewshot classification ability on previous datasets as new ones arrive for metatraining. Our work aims to metalearn a single model for fewshot classification on multiple datasets that arrive sequentially for metatraining. We achieve this goal by incorporating MAML into the BOL framework to give a Bayesian online metalearning framework that finds the posterior of the metaparameters.
2.2 Overview of Our Bayesian Online MetaLearning Approach
Our central contribution is to extend the benefits of metalearning to the Bayesian online scenario, thereby training models that can generalise across tasks whilst dealing with parameter uncertainty in the setting of sequentially arriving datasets. Online metatraining occurs sequentially on the datasets . A newlyarrived is separated into the base class set and novel class set for metatraining and metaevaluation respectively. Notationally, let and denote the collection of support sets and query sets respectively from all possible tasks in the base class set , so that
. We are interested in a MAP estimate
. Using Bayes’ rule on the posterior gives the recursive formula(3)  
(4)  
(5) 
where Eq. (3) follows from the assumption that each dataset is independent given , and Eq. (5) follows by dropping the likelihood term .
From the metalearning perspective, the parameters introduced in Eq. (5) can be viewed as the taskspecific parameters in MAML. There are various choices for the distribution in Eq. (5). In particular if we choose to set it as the deterministic function of taking several steps of SGD on loss with the support set collection and initialised at , we have
(6) 
and this recovers the MAML inner loop with SGD quick adaptation in Eq. (1). The recursion given by Eq. (5) forms the basis of our approach and the remainder of this paper explains how we implement this. In order to do so we give a mini tutorial in the following section on Bayesian Online Learning.
3 Background
This section provides a background explanation of using BOL to find the posterior of a model parameters and overcome catastrophic forgetting for largescale supervised classification. We will then apply this approach to our recursion in Eq. (5). The posterior is typically intractable due to the enormous size of the modern neural network architectures. This leads to the requirement for a good approximation of the posterior of the metaparameters. A particularly suitable candidate for this purpose in metalearning is the Laplace approximation (MacKay, 1992; Ritter et al., 2018b), as it simply adds a quadratic regulariser to the training objective. Its approximate posterior belongs to the Gaussian family and it requires computing the Hessian of the logposterior for the precision matrix. This is computationally prohibitive for a large neural networks. The blockdiagonal Kroneckerfactored Fisher approximation takes into account the parameter interactions within a layer of a neural network and propose an efficient approximation for the Hessian (Martens and Grosse, 2015; Grosse and Martens, 2016; Botev et al., 2017). We extend the BOL with Laplace approximation framework for largescale supervised classification (Ritter et al., 2018a) into a Bayesian online metalearning framework that prevents forgetting in fewshot classification problems.
3.1 Bayesian Online Learning
Upon the arrival of the th dataset for largescale supervised classification, we are interested in a MAP estimate for the parameters of a neural network. Using Bayes’ rule on the posterior gives the recursive formula
(7) 
where Eq. (7) follows from the assumption that each dataset is independent given . As the normalised posterior is usually intractable, it may be approximated by a parametric distribution with parameter . The BOL framework consists of the update step and the projection step (Opper, 1998). The update step uses the approximate posterior obtained from the previous step for an update in the form of Eq. (7):
(8) 
The new posterior might not belong to the same parametric family as . In this case, the new posterior has to be projected into the same parametric family to obtain . Opper (1998) performs this projection by minimising the KLdivergence between the new posterior and the parametric , while Ritter et al. (2018a) use the Laplace approximation instead.
3.2 Laplace Approximation
We discover that the Laplace approximation method provides a wellfitted metatraining framework for Bayesian online metalearning in Eq. (5
). Each updating step in the approximation procedure can be modified to correspond to the metaparameters for fewshot classification, instead of the model parameters for largescale supervised classification. Laplace approximation rationalises the use of a Gaussian approximate posterior by Taylor expanding the logposterior around a mode up to the second order. The second order term corresponds to the logprobability of a Gaussian distribution.
For largescale supervised classification, we consider finding a MAP estimate following from Eq. (7):
(9)  
(10) 
Since the posterior of a neural network is intractable except for small architectures, the unnormalised posterior is considered instead. Performing Taylor expansion on the logarithm of the unnormalised posterior around a mode gives
(11) 
where denotes the Hessian matrix of the negative logposterior evaluated at . The expansion in Eq. (11) suggests using a Gaussian approximate posterior. Given the parameter , a mean for step can be obtained by finding a mode of the approximate posterior as follows via standard gradientbased optimisation:
(12) 
The precision matrix is updated as , where is the Hessian matrix of the negative log likelihood for evaluated at with entries
(13) 
For a neural network model, gradientbased optimisation methods such as SGD (Robbins and Monro, 1951) and Adam (Kingma and Ba, 2015) are the standard gradientbased methods in finding a mode for the Laplace approximation in Eq. (12). We show in Section 4 that this provides a wellsuited skeleton to implement Bayesian online metalearning in Eq. (5) with the modeseeking optimisation procedure.
3.3 BlockDiagonal Hessian
Since the full Hessian matrix in Eq. (13) is intractable for large neural networks, we seek for an efficient and relatively close approximation to the Hessian matrix. Diagonal approximations (Denker and LeCun, 1991; Kirkpatrick et al., 2017) are memory and computationally efficient, but sacrifice approximation accuracy as they ignore the interaction between parameters. Consider instead separating the Hessian matrix into blocks where different blocks are associated to different layers of a neural network. A particular diagonal block corresponds to the Hessian for a particular layer of the neural network. The blockdiagonal Kroneckerfactored approximation (Martens and Grosse, 2015; Grosse and Martens, 2016; Botev et al., 2017) utilises the fact that each diagonal block of the Hessian is Kroneckerfactored for a single data point. This provides a better Hessian approximation as it takes the parameter interactions within a layer into consideration.
Ritter et al. (2018a)
use a hyperparameter
as a multiplier to the Hessian when updating the precision:(14) 
In the largescale supervised classification setting, this hyperparameter has a regularising effect on the Gaussian posterior approximation for a balance between having a good performance on a new dataset and maintaining the performance on previous datasets (Ritter et al., 2018a). A large results in a sharply peaked Gaussian posterior and is therefore unable to learn new datasets well, but can prevent forgetting previously learned datasets. A small on the other hand gives a dispersed Gaussian posterior and allows better performance on new datasets by sacrificing the performance on the previous datasets.
4 Bomla Implementation
This section demonstrates how we arrive at the Bayesian Online MetaLearning with Laplace Approximation (BOMLA) framework by implementing the Laplace approximation to the posterior of the Bayesian online metalearning framework in Eq. (5). BOMLA provides a grounded framework for an online training on the sequential fewshot classification datasets.
The Bayesian online metalearning framework in Section 2.2 with a Gaussian approximation posterior of mean and precision from the Laplace approximation gives a MAP estimate
(15) 
Using the deterministic in Eq. (6) and sampling tasks per iteration as in MAML for the optimisation in Eq. (15) leads to minimising the objective
(16) 
where for . The first term of the objective in Eq. (16) corresponds to the MAML objective in Eq. (2) with a crossentropy loss, and the second term can be seen as a regulariser.
4.1 Hessian Approximation
The Hessian matrix corresponding to the first term of the BOMLA objective in Eq. (16) is
(17) 
It is worth noting that the BOMLA Hessian deviates from the original BOL Hessian in Eq. (13). This requires deriving an adjusted approximation to the Hessian with some further assumptions.
The BOL Hessian in Eq. (13) for a single data point can be approximated using the Fisher information matrix to ensure its positive semidefiniteness (Martens and Grosse, 2015):
(18) 
In the BOMLA framework, each pair for the Fisher is associated to a task . The Fisher information matrix corresponding to the BOMLA Hessian in Eq. (17) is
(19) 
The additional Jacobian matrix breaks the Kroneckerfactored structure described by Martens and Grosse (2015) for the original Fisher in Eq. (18).
Since is a quick adaptation from , is relatively close to
. This leads to a Jacobian matrix that is approximately equal to the identity matrix. Imposing this assumption gives the approximation
(20) 
where the Kroneckerfactored structure applies. The Fisher in Eq. (20) is an approximation to the Hessian adjusted from in Eq. (17) for a single data point:
(21) 
Algorithm 1 gives the pseudocode of the BOMLA framework. The algorithm is formed of three main elements: metatraining (line 4–11), updating the Gaussian mean (line 12) and updating the Gaussian precision matrix (line 13–16) with blockdiagonal Kroneckerfactored approximation (BDKFA) to the in Eq. (20).
4.2 OnlineWithinBatch Setting
The BOMLA framework is developed to handle online metalearning on sequential datasets. It remains as a compulsory requirement for this framework to have all base classes within a dataset readily available in batch prior to metatraining. This is known as the onlinewithinbatch setting (Denevi et al., 2019) as the datasets are sequential but the tasks and examples within a dataset are processed in batch.
Recent works (Javed and White, 2019; Denevi et al., 2019) show interests in the online setting where tasks within a dataset and the examples in a task arrive sequentially for training. This is the onlinewithinonline setting (Denevi et al., 2019) which requires a positive forward transfer (LopezPaz and Ranzato, 2017) on the sequential tasks or examples for a good overall performance on the dataset. A Bayesian framework on such a setting requires a powerful posterior approximation. The posterior approximation has to be relatively flexible to capture uncertainty in future knowledge for a performance improvement on sequential tasks, and has to be close to the true posterior to avoid a performance degradation due to the approximation deviating from previous experiences. This remains as a difficult and unsolved problem in the Bayesian framework. Without such a posterior approximation, running Bayesian online metalearning for sequential tasks within a dataset may not give a good fewshot classification performance.
This paper focuses on the onlinewithinbatch framework (Denevi et al., 2019) to avoid negative backward transfer (LopezPaz and Ranzato, 2017) also known as catastrophic forgetting in the sequential datasets. A future research direction would be a Bayesian online metalearning framework with positive forward transfer ability to handle sequential tasks within a dataset that gives a good fewshot classification performance.
5 Related Work
Online metalearning:
Our work is closely related to (Denevi et al., 2019; Finn et al., 2019; Zhuang et al., 2019) in terms of problem settings. The BOMLA framework corresponds to the onlinewithinbatch setting (Denevi et al., 2019), where the datasets arrive sequentially but the tasks and examples in the dataset are in a batch. Recent online metalearning methods (Finn et al., 2019; Zhuang et al., 2019) accumulate the data as they arrive and metalearn using all data acquired so far. This is not desirable as the algorithmic complexity of training grows with the number of datasets and training time increases as new datasets arrive. The agent will eventually run out of memory for a long sequence of datasets. The BOMLA framework on the other hand is advantageous, as it only takes the posterior of the metaparameters into consideration during optimisation. This gives a framework with an algorithmic complexity independent of the length of the dataset sequence. He et al. (2019) and Harrison et al. (2019) look into continual metalearning for nonstationary task distributions where the task boundaries are unknown to the model. These works interpret continual metalearning from both ends: from the metalearning perspective, they attempt to accumulate knowledge at the metalevel, and from the continual learning perspective, they infer the tasks continually as the tasks arrive with unidentified task boundaries. The tasks of concern in these papers are primarily tasks within a single dataset, while our work focuses on the problem of continual adaptation for fewshot classification on a sequence of dissimilar datasets.
Offline metalearning:
Previous metalearning works attempt to solve fewshot classification problems in an offline setting, under the assumption of having a stationary task distribution during metatraining and metatesting. A single metalearned model is aimed to fewshot classify one specific dataset with all base classes of the dataset readily available in a batch for metatraining. Gradientbased metalearning (Finn et al., 2017; Nichol et al., 2018; Rusu et al., 2019) updates the metaparameters by accumulating the gradients of a metabatch of taskspecific inner loop updates. The metaparameters will be used as a model initialisation for a quick adaptation on the novel classes. The MAML algorithm can be cast into a probabilistic inference problem in a hierarchical Bayesian model (Grant et al., 2018). The paper also discusses the use of a Laplace approximation in the taskspecific inner loop to improve MAML using the curvature information. Metricbased metalearning (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017) utilises the metric distance between labelled examples. Such methods assume that base and novel classes are from the same dataset distribution, and the metric distance estimations can be generalised to the novel classes upon metalearning the base classes.
Continual learning:
Modern continual learning works (Goodfellow et al., 2013; Lee et al., 2017; Zenke et al., 2017)
focus primarily on largescale supervised learning, in contrast to our work that looks into continual fewshot classification across sequential datasets.
Wen et al. (2018)utilise fewshot learning to improve on overcoming catastrophic forgetting via logit matching on a small sample from the previous tasks. The online learning element in this paper is closely related to
(Kirkpatrick et al., 2017; Zenke et al., 2017; Ritter et al., 2018a) that overcome catastrophic forgetting for largescale supervised classification. In particular, our work builds on the online Laplace approximation method in (Ritter et al., 2018a). We extend this to the metalearning scenario to avoid forgetting in fewshot classification problems. Nguyen et al. (2018) provide an alternative of using variational inference instead of the Laplace approximation for approximating the posterior. It is a reasonable approach to adapt variational approximation methods to approximate the posterior of the metaparameters by adjusting the KLdivergence objective.6 Experiments
Our experiments compare BOMLA to applying MAML continuously on sequential datasets in terms of their ability to overcome catastrophic forgetting in various fewshot classification settings. We begin with an artificially generated toy dataset Rainbow Omniglot, and proceed to a more challenging dataset sequence. In both experiments we consider the 1shot 5way fewshot classification setting. The model architecture and other fewshot details can be found in Appendix A.
6.1 Rainbow Omniglot
The artificial dataset sequence Rainbow Omniglot is generated analogous to Rainbow MNIST (Finn et al., 2019). We run the experiment on a sequence of 10 different datasets randomly chosen from Rainbow Omniglot. The details to generate Rainbow Omniglot with the 10 chosen datasets and the implementation details of this experiment can be found in Appendix B.1.
Some datasets in the sequence are harder and more different than the others, particularly the ones with images rescaled to , as the images lose pixels in the character writings due to rescaling. Figure 1 shows that MAML has a high fluctuation in the performance, especially when transitioning between datasets rescaled to and datasets with original scaling. When MAML encounters a more different dataset, the metaparameters divert from previous experiences for an optimised performance on the current dataset. BOMLA on the other hand, has a relatively stable performance across dissimilar datasets, as it takes previous experiences into account when obtaining a posterior of the metaparameters with the new dataset included.
6.2 A More Challenging FewShot Classification Dataset Sequence
We implement BOMLA to a more challenging fewshot classification sequence:
Omniglot:
The Omniglot dataset (Lake et al., 2011) comprises 1623 characters from 50 alphabets and each character has 20 instances. New classes with rotations in the multiples of are formed after splitting the classes for metatraining and metaevaluation.
miniQuickDraw:
The miniQuickDraw dataset is formed by randomly sampling 100 classes and 600 instances in each class from the QuickDraw dataset (Ha and Eck, 2017). The QuickDraw dataset comprises 345 categories of drawings collected from the players in the game “Quick, Draw!”
CifarFs:
The CIFARFS dataset (Bertinetto et al., 2019) is a variation on CIFAR100 (Krizhevsky, 2009) for the fewshot classification purpose, with 100 classes of objects and each class comprises 600 images of size . We rescale the images to in this experiment.
Figure 2 shows some image instances from each dataset in the sequence. The implementation details of this experiment and the 100 classes for miniQuickDraw can be found in Appendix B.2. The transition from Omniglot to miniQuickDraw is less drastic, compared to the transition from miniQuickDraw to CIFARFS. When transitioning from Omniglot to miniQuickDraw, we observe only a small amount of forgetting for both BOMLA and MAML in Figure 3. Continuing the MAML run to the subsequent transition from miniQuickDraw to CIFARFS gives a noticeable drop in the fewshot performance on Omniglot and miniQuickDraw. The result in Figure 3 shows that BOMLA is able to prevent catastrophic forgetting in both transitions. BOMLA is able to proceed with learning miniQuickDraw with almost no forgetting on Omniglot. There is a small tradeoff in the fewshot performance for CIFARFS as BOMLA avoids catastrophically forgetting Omniglot and miniQuickDraw.
Tuning the hyperparameter in Eq. (14) corresponds to balancing between a smaller performance tradeoff on a new dataset and less forgetting on previous datasets. The value 1 results in a more concentrated Gaussian posterior and is therefore unable to learn new datasets well, but can better retain the performances on previously learned datasets. The value 0.01 on the other hand gives a widespread Gaussian posterior and learns better on new datasets by sacrificing the performance on the previous datasets. The result shows that the value gives the best balance between old and new datasets. For the best performance, we can set different values for different datasets. The best hyperparameter combination in this experiment is for miniQuickDraw and for CIFARFS.
7 Conclusion
We introduced the Bayesian Online Metalearning with Laplace Approximation (BOMLA) framework to overcome catastrophic forgetting in fewshot classification problems. BOMLA merged the BOL framework and the MAML algorithm via an adjusted objective obtained from the logposterior, with the first term corresponding to the MAML objective and the second term acting as a regulariser. We proposed the necessary adjustments in the Hessian and Fisher approximation, as we are optimising the metaparameters for fewshot classification instead of the usual model parameters in largescale supervised classification. The experiments show that BOMLA is able to retain the fewshot classification ability when trained on sequential datasets, resulting in the ability to perform fewshot classification on multiple datasets with a single metalearned model.
References
References
 Bertinetto et al. (2019) L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi. MetaLearning with Differentiable ClosedForm Solvers. In International Conference on Learning Representations, 2019.

Botev et al. (2017)
A. Botev, H. Ritter, and D. Barber.
Practical GaussNewton Optimisation for Deep Learning.
InProceedings of the 34th International Conference on Machine Learning
, 2017.  Denevi et al. (2019) G. Denevi, D. Stamos, C. Ciliberto, and M. Pontil. OnlineWithinOnline MetaLearning. In Advances in Neural Information Processing Systems 32, 2019.

Denker and LeCun (1991)
J. S. Denker and Y. LeCun.
Transforming NeuralNet Output Levels to Probability Distributions.
In Advances in Neural Information Processing Systems 3, 1991.  Dong et al. (2015) D. Dong, H. Wu, W. He, D. Yu, and H. Wang. MultiTask Learning for Multiple Language Translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015.
 Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. ModelAgnostic MetaLearning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Finn et al. (2019) C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online MetaLearning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
 Goodfellow et al. (2013) I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An Empirical Investigation of Catastrophic Forgetting in GradientBased Neural Networks. arXiv preprint, arXiv:1312.6211, 2013.
 Grant et al. (2018) E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting GradientBased MetaLearning as Hierarchical Bayes. In International Conference on Learning Representations, 2018.
 Grosse and Martens (2016) R. Grosse and J. Martens. A KroneckerFactored Approximate Fisher Matrix for Convolution Layers. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
 Ha and Eck (2017) D. Ha and D. Eck. A Neural Representation of Sketch Drawings. arXiv preprint, arXiv:1704.03477, 2017.
 Harrison et al. (2019) J. Harrison, A. Sharma, C. Finn, and M. Pavone. Continuous MetaLearning without Tasks. arXiv preprint, arXiv:1912.08866, 2019.
 He et al. (2019) X. He, J. Sygnowski, A. Galashov, A. A. Rusu, Y. Teh, and R. Pascanu. Task Agnostic Continual Learning via Meta Learning. arXiv preprint, arXiv:1906.05201, 2019.
 Javed and White (2019) K. Javed and M. White. MetaLearning Representations for Continual Learning. In Advances in Neural Information Processing Systems 32, 2019.

Johnson et al. (2017)
M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat,
F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean.
Google’s Multilingual Neural Machine Translation System: Enabling ZeroShot Translation.
Transactions of the Association for Computational Linguistics, 2017.  Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
 Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy of Sciences, 2017.
 Koch et al. (2015) G. Koch, R. Zemel, and R. Salakhutdinov. Siamese Neural Networks for OneShot Image Recognition. In 32th International Conference on Machine Learning Deep Learning Workshop, 2015.
 Krizhevsky (2009) A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
 Kurata and Audhkhasi (2019) G. Kurata and K. Audhkhasi. MultiTask CTC Training with Auxiliary Feature Reconstruction for EndtoEnd Speech Recognition. In 20th Annual Conference of the International Speech Communication Association, 2019.
 Lake et al. (2011) B. Lake, R. Salakhutdinov, J. Gross, and J.B. Tenenbaum. One Shot Learning of Simple Visual Concepts. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011.

Lee et al. (2017)
S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang.
Overcoming Catastrophic Forgetting by Incremental Moment Matching.
In Advances in Neural Information Processing Systems 30, 2017. 
Li et al. (2004)
F. Li, R. Fergus, and P. Perona.
Learning Generative Visual Models from Few Training Examples: An
Incremental Bayesian Approach Tested on 101 Object Categories.
In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
, 2004.  Li et al. (2006) F. Li, R. Fergus, and P. Perona. OneShot Learning of Object Categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
 Li et al. (2017) Z. Li, F. Zhou, F. Chen, and H. Li. MetaSGD: Learning to Learn Quickly for FewShot Learning. arXiv preprint, arXiv:1707.09835, 2017.
 LopezPaz and Ranzato (2017) D. LopezPaz and M. A. Ranzato. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems 30, 2017.

MacKay (1992)
D. J. C. MacKay.
A Practical Bayesian Framework for Backpropagation Networks.
Neural Computation, 1992.  Martens and Grosse (2015) J. Martens and R. Grosse. Optimizing Neural Networks with KroneckerFactored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
 Miller et al. (2000) E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from One Example Through Shared Densities on Transforms. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000.
 Nguyen et al. (2018) C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational Continual Learning. In International Conference on Learning Representations, 2018.
 Nichol et al. (2018) A. Nichol, J. Achiam, and J. Schulman. On FirstOrder MetaLearning Algorithms. arXiv preprint, arXiv:1803.02999, 2018.
 Nikolaidis et al. (2015) S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. Shah. Efficient Model Learning from JointAction Demonstrations for HumanRobot Collaborative Tasks. In Proceedings of the 10th International Conference on HumanRobot Interaction, 2015.
 Nikolaidis et al. (2016) S. Nikolaidis, A. Kuznetsov, D. Hsu, and S. Srinivasa. Formalizing HumanRobot Mutual Adaptation: A Bounded Memory Model. In Proceedings of the 11th International Conference on HumanRobot Interaction, 2016.
 Opper (1998) M. Opper. A Bayesian Approach to Online Learning. Cambridge University Press, 1998.
 Ravi and Larochelle (2017) S. Ravi and H. Larochelle. Optimization as a Model for FewShot Learning. In International Conference on Learning Representations, 2017.
 Ritter et al. (2018a) H. Ritter, A. Botev, and D. Barber. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting. In Advances in Neural Information Processing Systems 31, 2018a.
 Ritter et al. (2018b) H. Ritter, A. Botev, and D. Barber. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations, 2018b.
 Robbins and Monro (1951) H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 1951.
 Rusu et al. (2019) A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. MetaLearning with Latent Embedding Optimization. In International Conference on Learning Representations, 2019.
 Santoro et al. (2016) A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. MetaLearning with MemoryAugmented Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
 Schmidhuber (1987) J. Schmidhuber. Evolutionary Principles in SelfReferential Learning. On Learning How to Learn: The MetaMetaMeta…Hook. Diploma thesis, Institut für Informatik, Technische Universität München, 1987.
 Seltzer and Droppo (2013) M. L. Seltzer and J. Droppo. Multitask Learning in Deep Neural Networks for Improved Phoneme Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
 Snell et al. (2017) J. Snell, K. Swersky, and R. Zemel. Prototypical Networks for FewShot Learning. In Advances in Neural Information Processing Systems 30, 2017.
 Sung et al. (2018) F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to Compare: Relation Network for FewShot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 Thrun and Pratt (1998) S. Thrun and L. Pratt. Learning to Learn: Introduction and Overview. Springer, Boston, MA, 1998.
 Vinyals et al. (2016) O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems 29, 2016.
 Wen et al. (2018) J. Wen, Y. Cao, and R. Huang. FewShot Self Reminder to Overcome Catastrophic Forgetting. arXiv preprint, arXiv:1812.00543, 2018.
 Yuan and Yan (2010) X. Yuan and S. Yan. Visual Classification with MultiTask Joint Sparse Representation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
 Zenke et al. (2017) F. Zenke, B. Poole, and S. Ganguli. Continual Learning through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Zhang et al. (2014) Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial Landmark Detection by Deep Multitask Learning. In Proceedings of the 13th European Conference on Computer Vision, 2014.
 Zhuang et al. (2019) Z. Zhuang, Y. Wang, K. Yu, and S. Lu. Online MetaLearning on NonConvex Setting. arXiv preprint, arXiv:1910.10196, 2019.
Appendix A FewShot Details
We average over 100 tasks sampled from the novel classes when reporting the metaevaluation accuracy in all of the experiments. Each task in both of the 1shot 5way experiments consists of a support set with 1 sample per class and a query set with 15 samples per class. We use the model architecture proposed by Vinyals et al. (2016) that takes 4 modules with 64 filters of size
, followed by a batch normalisation, a ReLU activation and a
maxpooling. A fullyconnected layer is appended to the final module before getting the class probabilities with softmax.Appendix B Implementation Details
b.1 Rainbow Omniglot
Analogous to the Rainbow MNIST (Finn et al., 2019), our Rainbow Omniglot sequence is generated by transforming the character images from Omniglot in the following ways: scaling, rotating and changing the background colour of the images. To generate one of the sequential datasets, we apply a combined transformation formed by randomly selecting a scaling (to size or original size), a rotation degree (0°, 90°, 180°or 270°) and a background colour out of seven different colours. Rainbow MNIST scales the images to either half size or original, but we find the factor of half to be too small to generate reasonably interpretable images in Omniglot.
The sequence of the 10 Rainbow Omniglot datasets in this experiment is as follows:

Original scale, 0°rotation and red background

Original scale, 270°rotation and blue background

scale, 0°rotation and red background

scale, 270°rotation and green background

scale, 180°rotation and red background

Original scale, 180°rotation and white background

Original scale, 0°rotation and indigo background

scale, 270°rotation and cyan background

Original scale, 0°rotation and yellow background

Original scale, 270°rotation and green background
Each dataset is metatrained for 3000 iterations using Adam with learning rate 0.01 and metabatch size 32 for the outer loop optimisation. We perform a onestep SGD adaptation with learning rate 0.4 for the inner loop update on each task. We sample 1000 tasks to approximate the Hessian when updating the Gaussian precision matrix.
b.2 More Challenging Experiment
For each of the Omniglot, miniQuickDraw and CIFARFS datasets, we update the metaparameters in the outer loops using Adam with learning rate 0.001 and metabatch size 32. Similar to the Rainbow Omniglot, the inner loop for Omniglot in this sequence does a onestep SGD with learning rate 0.4. The miniQuickDraw dataset uses a threestep SGD with learning rate 0.2 as an inner taskspecific update. For CIFARFS, we perform a fivestep SGD with learning rate 0.1 for an inner loop update. For each dataset, we sample 2000 tasks to approximate the Hessian when updating the Gaussian precision matrix.
Below are the 100 classes in the miniQuickDraw dataset: ’mailbox’, ’whale’, ’peanut’, ’vase’, ’octagon’, ’dumbbell’, ’hockey puck’, ’chandelier’, ’ocean’, ’tennis racquet’, ’bush’, ’potato’, ’tent’, ’lobster’, ’pool’, ’squirrel’, ’megaphone’, ’bucket’, ’golf club’, ’jacket’, ’computer’, ’keyboard’, ’basket’, ’underwear’, ’asparagus’, ’cactus’, ’arm’, ’oven’, ’elephant’, ’moon’, ’giraffe’, ’couch’, ’clock’, ’suitcase’, ’snowflake’, ’scorpion’, ’skyscraper’, ’paint can’, ’dragon’, ’windmill’, ’skateboard’, ’fish’, ’wristwatch’, ’calculator’, ’cat’, ’hammer’, ’sheep’, ’necklace’, ’bear’, ’anvil’, ’bulldozer’, ’scissors’, ’skull’, ’syringe’, ’zebra’, ’helmet’, ’bench’, ’harp’, ’river’, ’monkey’, ’bread’, ’donut’, ’train’, ’flamingo’, ’drill’, ’peas’, ’shorts’, ’book’, ’mushroom’, ’brain’, ’fireplace’, ’tshirt’, ’horse’, ’cell phone’, ’hexagon’, ’zigzag’, ’strawberry’, ’sock’, ’rainbow’, ’crocodile’, ’tree’, ’bird’, ’spreadsheet’, ’teddybear’, ’The Mona Lisa’, ’bracelet’, ’flying saucer’, ’tractor’, ’bathtub’, ’cruise ship’, ’car’, ’parachute’, ’grass’, ’guitar’, ’The Eiffel Tower’, ’ear’, ’drums’, ’circle’, ’compass’, ’bandage’
Comments
There are no comments yet.