Bayesian Online Meta-Learning with Laplace Approximation

04/30/2020 ∙ by Pau Ching Yap, et al. ∙ UCL 0

Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem for large-scale supervised classification, little has been done to overcome catastrophic forgetting for few-shot classification problems. We demonstrate that the popular gradient-based few-shot meta-learning algorithm Model-Agnostic Meta-Learning (MAML) indeed suffers from catastrophic forgetting and introduce a Bayesian online meta-learning framework that tackles this problem. Our framework incorporates MAML into a Bayesian online learning algorithm with Laplace approximation. This framework enables few-shot classification on a range of sequentially arriving datasets with a single meta-learned model. The experimental evaluations demonstrate that our framework can effectively prevent forgetting in various few-shot classification settings compared to applying MAML sequentially.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image classification models and algorithms often require an enormous amount of labelled examples for training to achieve state-of-the-art performance. Labelled examples can be expensive and time-consuming to acquire. Human visual systems, on the other hand, are able to recognise new classes after being shown a few labelled examples. Few-shot classification (Miller et al., 2000; Li et al., 2004, 2006; Lake et al., 2011) tackles this issue by learning to adapt to unseen classes (known as novel classes) with very few labelled examples from each class. Recent works show that meta-learning provides promising approaches to few-shot classification problems (Santoro et al., 2016; Finn et al., 2017; Li et al., 2017; Ravi and Larochelle, 2017; Sung et al., 2018). Meta-learning or learning-to-learn (Schmidhuber, 1987; Thrun and Pratt, 1998) takes the learning process a level deeper – instead of learning from the labelled examples in the training classes (known as base classes), meta-learning learns the example-learning process. The training process in meta-learning that utilises the base classes is called the meta-training stage, and the evaluation process that reports the few-shot performance on the novel classes is known as the meta-testing or meta-evaluation stage.

Despite being a promising solution to few-shot classification problems, meta-learning methods suffer from a limitation where a meta-learned model loses its few-shot classification ability on previous datasets as new ones arrive subsequently for meta-training. Some popular examples of the few-shot classification datasets are Omniglot (Lake et al., 2011), CIFAR-FS (Bertinetto et al., 2019) and miniImageNet (Vinyals et al., 2016)

. A meta-learned model is restricted to perform few-shot classification on a specific dataset, in the sense that the base and novel classes have to originate from the same dataset distribution. The current practice to few-shot classify the novel classes from different datasets is to meta-learn a model for each dataset separately

(Snell et al., 2017; Vinyals et al., 2016; Sung et al., 2018; Bertinetto et al., 2019). This paper considers meta-learning a single model for few-shot classification on multiple datasets that arrive sequentially for meta-training.

Creating an agent that performs well on multiple tasks has long been an important goal, as this is a major step towards artificial general intelligence. Recent works exhibit the importance of excelling at multiple tasks in domains such as vision (Yuan and Yan, 2010; Zhang et al., 2014), speech (Seltzer and Droppo, 2013; Kurata and Audhkhasi, 2019) and translation (Dong et al., 2015; Johnson et al., 2017). In real life applications, an agent often encounters new tasks and knowledge in a sequential manner. The goal is to create an agent that accumulates knowledge and experiences over time in a similar fashion to how humans learn. Recent advances in human-robot interaction (Nikolaidis et al., 2015, 2016) develop adaptation models that allow robots to adapt to human behaviours over time and react accordingly to perform well on human-robot collaborative tasks. They demonstrate the importance of being able to process new information over time and adapt accordingly for an optimal action based on the new knowledge and previous experiences in various robotic applications. We extend this goal to the few-shot classification problem and seek for an agent that can perform few-shot classification across multiple datasets and continuously incorporate knowledge acquired from the newly-arrived datasets.

In this paper, we introduce a Bayesian online meta-learning framework to train a model that is applicable to a broader scope of few-shot classification datasets by overcoming catastrophic forgetting. To the best of our knowledge, this is the first attempt to prevent catastrophic forgetting for few-shot classification problems using this newly-introduced framework. We extend the Bayesian online learning (BOL) framework (Opper, 1998) to a Bayesian online meta-learning framework using the Model-Agnostic Meta-Learning (MAML) algorithm (Finn et al., 2017). MAML finds a good model parameter initialisation (called meta-parameters) that can quickly adapt to novel classes using very few labelled examples, while BOL provides a principled framework for finding the posterior of the model parameters. Our framework aims to combine both BOL and MAML to find the posterior of the meta-parameters. This paper builds on (Ritter et al., 2018a) which combines the BOL framework and Laplace approximation with block-diagonal Kronecker-factored Fisher approximation to overcome catastrophic forgetting in large-scale supervised classification.

Below are the contributions we make in this paper:

  • We develop a Bayesian online meta-learning algorithm called Bayesian Online Meta-learning with Laplace Approximation (BOMLA) for sequential few-shot classification problems.

  • We propose a simple approximation to the Fisher corresponding to the BOMLA framework that carries over the desirable block-diagonal Kronecker-factored structure from the Fisher approximation in the non-meta-learning setting.

  • We demonstrate that BOMLA can overcome catastrophic forgetting in comparison to running MAML sequentially in various few-shot classification settings.

2 Meta-Learning

Most meta-learning algorithms comprise an inner loop for example-learning and an outer loop that learns the example-learning process. Such algorithms often require sampling a batch of tasks at each iteration, where a task is formed by sampling a subset of classes from the pool of base classes or novel classes during meta-training or meta-evaluation respectively.

An offline meta-learning algorithm learns a few-shot classification model only for a specific dataset . For notational convenience, we drop the subscript in this section, as there is only one dataset involved in offline meta-learning. The dataset is divided into the set of base classes and novel classes for meta-training and meta-evaluation respectively. Upon completing meta-training on the base class set , the goal of few-shot classification is to perform well on an unseen task sampled from the novel class set after a quick adaptation on a small subset (known as the support set) of . The performance of this unseen task is evaluated on the query set , where . Since is not accessible during meta-training, this support-query split is mimicked on the base class set for meta-training.

2.1 Model-Agnostic Meta-Learning

In this paper, we are interested in the well-known meta-learning algorithm MAML (Finn et al., 2017). Each updating step of MAML aims to improve the ability of the meta-parameters to act as a good model initialisation for a quick adaptation on unseen tasks.

Each iteration of the MAML algorithm samples tasks from the base class set

and runs a few steps of stochastic gradient descent (SGD) for an inner loop task-specific learning. The number of tasks sampled per iteration is known as the

meta-batch size. For task , the inner loop outputs the task-specific parameters from a -step SGD quick adaptation on the objective with the support set and initialised at :

(1)

where . The outer loop gathers all task-specific adaptations to update the meta-parameters using the loss on the query set .

The overall MAML optimisation objective is

(2)

Like most meta-learning algorithms, MAML assumes a stationary task distribution during meta-training and meta-evaluation. Under this assumption, a meta-learned model is only applicable to a specific dataset distribution. When the model encounters a sequence of datasets, it loses the few-shot classification ability on previous datasets as new ones arrive for meta-training. Our work aims to meta-learn a single model for few-shot classification on multiple datasets that arrive sequentially for meta-training. We achieve this goal by incorporating MAML into the BOL framework to give a Bayesian online meta-learning framework that finds the posterior of the meta-parameters.

2.2 Overview of Our Bayesian Online Meta-Learning Approach

Our central contribution is to extend the benefits of meta-learning to the Bayesian online scenario, thereby training models that can generalise across tasks whilst dealing with parameter uncertainty in the setting of sequentially arriving datasets. Online meta-training occurs sequentially on the datasets . A newly-arrived is separated into the base class set and novel class set for meta-training and meta-evaluation respectively. Notationally, let and denote the collection of support sets and query sets respectively from all possible tasks in the base class set , so that

. We are interested in a MAP estimate

. Using Bayes’ rule on the posterior gives the recursive formula

(3)
(4)
(5)

where Eq. (3) follows from the assumption that each dataset is independent given , and Eq. (5) follows by dropping the likelihood term .

From the meta-learning perspective, the parameters introduced in Eq. (5) can be viewed as the task-specific parameters in MAML. There are various choices for the distribution in Eq. (5). In particular if we choose to set it as the deterministic function of taking several steps of SGD on loss with the support set collection and initialised at , we have

(6)

and this recovers the MAML inner loop with SGD quick adaptation in Eq. (1). The recursion given by Eq. (5) forms the basis of our approach and the remainder of this paper explains how we implement this. In order to do so we give a mini tutorial in the following section on Bayesian Online Learning.

3 Background

This section provides a background explanation of using BOL to find the posterior of a model parameters and overcome catastrophic forgetting for large-scale supervised classification. We will then apply this approach to our recursion in Eq. (5). The posterior is typically intractable due to the enormous size of the modern neural network architectures. This leads to the requirement for a good approximation of the posterior of the meta-parameters. A particularly suitable candidate for this purpose in meta-learning is the Laplace approximation (MacKay, 1992; Ritter et al., 2018b), as it simply adds a quadratic regulariser to the training objective. Its approximate posterior belongs to the Gaussian family and it requires computing the Hessian of the log-posterior for the precision matrix. This is computationally prohibitive for a large neural networks. The block-diagonal Kronecker-factored Fisher approximation takes into account the parameter interactions within a layer of a neural network and propose an efficient approximation for the Hessian (Martens and Grosse, 2015; Grosse and Martens, 2016; Botev et al., 2017). We extend the BOL with Laplace approximation framework for large-scale supervised classification (Ritter et al., 2018a) into a Bayesian online meta-learning framework that prevents forgetting in few-shot classification problems.

3.1 Bayesian Online Learning

Upon the arrival of the -th dataset for large-scale supervised classification, we are interested in a MAP estimate for the parameters of a neural network. Using Bayes’ rule on the posterior gives the recursive formula

(7)

where Eq. (7) follows from the assumption that each dataset is independent given . As the normalised posterior is usually intractable, it may be approximated by a parametric distribution with parameter . The BOL framework consists of the update step and the projection step (Opper, 1998). The update step uses the approximate posterior obtained from the previous step for an update in the form of Eq. (7):

(8)

The new posterior might not belong to the same parametric family as . In this case, the new posterior has to be projected into the same parametric family to obtain . Opper (1998) performs this projection by minimising the KL-divergence between the new posterior and the parametric , while Ritter et al. (2018a) use the Laplace approximation instead.

3.2 Laplace Approximation

We discover that the Laplace approximation method provides a well-fitted meta-training framework for Bayesian online meta-learning in Eq. (5

). Each updating step in the approximation procedure can be modified to correspond to the meta-parameters for few-shot classification, instead of the model parameters for large-scale supervised classification. Laplace approximation rationalises the use of a Gaussian approximate posterior by Taylor expanding the log-posterior around a mode up to the second order. The second order term corresponds to the log-probability of a Gaussian distribution.

For large-scale supervised classification, we consider finding a MAP estimate following from Eq. (7):

(9)
(10)

Since the posterior of a neural network is intractable except for small architectures, the unnormalised posterior is considered instead. Performing Taylor expansion on the logarithm of the unnormalised posterior around a mode gives

(11)

where denotes the Hessian matrix of the negative log-posterior evaluated at . The expansion in Eq. (11) suggests using a Gaussian approximate posterior. Given the parameter , a mean for step can be obtained by finding a mode of the approximate posterior as follows via standard gradient-based optimisation:

(12)

The precision matrix is updated as , where is the Hessian matrix of the negative log likelihood for evaluated at with entries

(13)

For a neural network model, gradient-based optimisation methods such as SGD (Robbins and Monro, 1951) and Adam (Kingma and Ba, 2015) are the standard gradient-based methods in finding a mode for the Laplace approximation in Eq. (12). We show in Section 4 that this provides a well-suited skeleton to implement Bayesian online meta-learning in Eq. (5) with the mode-seeking optimisation procedure.

3.3 Block-Diagonal Hessian

Since the full Hessian matrix in Eq. (13) is intractable for large neural networks, we seek for an efficient and relatively close approximation to the Hessian matrix. Diagonal approximations (Denker and LeCun, 1991; Kirkpatrick et al., 2017) are memory and computationally efficient, but sacrifice approximation accuracy as they ignore the interaction between parameters. Consider instead separating the Hessian matrix into blocks where different blocks are associated to different layers of a neural network. A particular diagonal block corresponds to the Hessian for a particular layer of the neural network. The block-diagonal Kronecker-factored approximation (Martens and Grosse, 2015; Grosse and Martens, 2016; Botev et al., 2017) utilises the fact that each diagonal block of the Hessian is Kronecker-factored for a single data point. This provides a better Hessian approximation as it takes the parameter interactions within a layer into consideration.

Ritter et al. (2018a)

use a hyperparameter

as a multiplier to the Hessian when updating the precision:

(14)

In the large-scale supervised classification setting, this hyperparameter has a regularising effect on the Gaussian posterior approximation for a balance between having a good performance on a new dataset and maintaining the performance on previous datasets (Ritter et al., 2018a). A large results in a sharply peaked Gaussian posterior and is therefore unable to learn new datasets well, but can prevent forgetting previously learned datasets. A small on the other hand gives a dispersed Gaussian posterior and allows better performance on new datasets by sacrificing the performance on the previous datasets.

4 Bomla Implementation

This section demonstrates how we arrive at the Bayesian Online Meta-Learning with Laplace Approximation (BOMLA) framework by implementing the Laplace approximation to the posterior of the Bayesian online meta-learning framework in Eq. (5). BOMLA provides a grounded framework for an online training on the sequential few-shot classification datasets.

The Bayesian online meta-learning framework in Section 2.2 with a Gaussian approximation posterior of mean and precision from the Laplace approximation gives a MAP estimate

(15)

Using the deterministic in Eq. (6) and sampling tasks per iteration as in MAML for the optimisation in Eq. (15) leads to minimising the objective

(16)

where for . The first term of the objective in Eq. (16) corresponds to the MAML objective in Eq. (2) with a cross-entropy loss, and the second term can be seen as a regulariser.

4.1 Hessian Approximation

The Hessian matrix corresponding to the first term of the BOMLA objective in Eq. (16) is

(17)

It is worth noting that the BOMLA Hessian deviates from the original BOL Hessian in Eq. (13). This requires deriving an adjusted approximation to the Hessian with some further assumptions.

The BOL Hessian in Eq. (13) for a single data point can be approximated using the Fisher information matrix to ensure its positive semi-definiteness (Martens and Grosse, 2015):

(18)

In the BOMLA framework, each pair for the Fisher is associated to a task . The Fisher information matrix corresponding to the BOMLA Hessian in Eq. (17) is

(19)

The additional Jacobian matrix breaks the Kronecker-factored structure described by Martens and Grosse (2015) for the original Fisher in Eq. (18).

Since is a quick adaptation from , is relatively close to

. This leads to a Jacobian matrix that is approximately equal to the identity matrix. Imposing this assumption gives the approximation

(20)

where the Kronecker-factored structure applies. The Fisher in Eq. (20) is an approximation to the Hessian adjusted from in Eq. (17) for a single data point:

(21)

Algorithm 1 gives the pseudo-code of the BOMLA framework. The algorithm is formed of three main elements: meta-training (line 411), updating the Gaussian mean (line 12) and updating the Gaussian precision matrix (line 1316) with block-diagonal Kronecker-factored approximation (BD-KFA) to the in Eq. (20).

1:  Require: sequential base class sets , learning rate , posterior regulariser , meta-batch size
2:  Initialise: , ,
3:  for  to  do
4:     for  do
5:        for  to  do
6:           Sample task
7:           Inner update
8:        end for
9:        Evaluate loss in Eq. (16)
10:        Outer update
11:     end for
12:     Update mean
13:     Sample tasks
14:     Run inner update in line 7 for each task
15:     Approximate with BD-KFA to in Eq. (20)
16:     Update precision
17:  end for
Algorithm 1 Bayesian Online Meta-Learning with Laplace Approximation

4.2 Online-Within-Batch Setting

The BOMLA framework is developed to handle online meta-learning on sequential datasets. It remains as a compulsory requirement for this framework to have all base classes within a dataset readily available in batch prior to meta-training. This is known as the online-within-batch setting (Denevi et al., 2019) as the datasets are sequential but the tasks and examples within a dataset are processed in batch.

Recent works (Javed and White, 2019; Denevi et al., 2019) show interests in the online setting where tasks within a dataset and the examples in a task arrive sequentially for training. This is the online-within-online setting (Denevi et al., 2019) which requires a positive forward transfer (Lopez-Paz and Ranzato, 2017) on the sequential tasks or examples for a good overall performance on the dataset. A Bayesian framework on such a setting requires a powerful posterior approximation. The posterior approximation has to be relatively flexible to capture uncertainty in future knowledge for a performance improvement on sequential tasks, and has to be close to the true posterior to avoid a performance degradation due to the approximation deviating from previous experiences. This remains as a difficult and unsolved problem in the Bayesian framework. Without such a posterior approximation, running Bayesian online meta-learning for sequential tasks within a dataset may not give a good few-shot classification performance.

This paper focuses on the online-within-batch framework (Denevi et al., 2019) to avoid negative backward transfer (Lopez-Paz and Ranzato, 2017) also known as catastrophic forgetting in the sequential datasets. A future research direction would be a Bayesian online meta-learning framework with positive forward transfer ability to handle sequential tasks within a dataset that gives a good few-shot classification performance.

5 Related Work

Online meta-learning:

Our work is closely related to (Denevi et al., 2019; Finn et al., 2019; Zhuang et al., 2019) in terms of problem settings. The BOMLA framework corresponds to the online-within-batch setting (Denevi et al., 2019), where the datasets arrive sequentially but the tasks and examples in the dataset are in a batch. Recent online meta-learning methods (Finn et al., 2019; Zhuang et al., 2019) accumulate the data as they arrive and meta-learn using all data acquired so far. This is not desirable as the algorithmic complexity of training grows with the number of datasets and training time increases as new datasets arrive. The agent will eventually run out of memory for a long sequence of datasets. The BOMLA framework on the other hand is advantageous, as it only takes the posterior of the meta-parameters into consideration during optimisation. This gives a framework with an algorithmic complexity independent of the length of the dataset sequence. He et al. (2019) and Harrison et al. (2019) look into continual meta-learning for non-stationary task distributions where the task boundaries are unknown to the model. These works interpret continual meta-learning from both ends: from the meta-learning perspective, they attempt to accumulate knowledge at the meta-level, and from the continual learning perspective, they infer the tasks continually as the tasks arrive with unidentified task boundaries. The tasks of concern in these papers are primarily tasks within a single dataset, while our work focuses on the problem of continual adaptation for few-shot classification on a sequence of dissimilar datasets.

Offline meta-learning:

Previous meta-learning works attempt to solve few-shot classification problems in an offline setting, under the assumption of having a stationary task distribution during meta-training and meta-testing. A single meta-learned model is aimed to few-shot classify one specific dataset with all base classes of the dataset readily available in a batch for meta-training. Gradient-based meta-learning (Finn et al., 2017; Nichol et al., 2018; Rusu et al., 2019) updates the meta-parameters by accumulating the gradients of a meta-batch of task-specific inner loop updates. The meta-parameters will be used as a model initialisation for a quick adaptation on the novel classes. The MAML algorithm can be cast into a probabilistic inference problem in a hierarchical Bayesian model (Grant et al., 2018). The paper also discusses the use of a Laplace approximation in the task-specific inner loop to improve MAML using the curvature information. Metric-based meta-learning (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017) utilises the metric distance between labelled examples. Such methods assume that base and novel classes are from the same dataset distribution, and the metric distance estimations can be generalised to the novel classes upon meta-learning the base classes.

Continual learning:

Modern continual learning works (Goodfellow et al., 2013; Lee et al., 2017; Zenke et al., 2017)

focus primarily on large-scale supervised learning, in contrast to our work that looks into continual few-shot classification across sequential datasets.

Wen et al. (2018)

utilise few-shot learning to improve on overcoming catastrophic forgetting via logit matching on a small sample from the previous tasks. The online learning element in this paper is closely related to

(Kirkpatrick et al., 2017; Zenke et al., 2017; Ritter et al., 2018a) that overcome catastrophic forgetting for large-scale supervised classification. In particular, our work builds on the online Laplace approximation method in (Ritter et al., 2018a). We extend this to the meta-learning scenario to avoid forgetting in few-shot classification problems. Nguyen et al. (2018) provide an alternative of using variational inference instead of the Laplace approximation for approximating the posterior. It is a reasonable approach to adapt variational approximation methods to approximate the posterior of the meta-parameters by adjusting the KL-divergence objective.

6 Experiments

Our experiments compare BOMLA to applying MAML continuously on sequential datasets in terms of their ability to overcome catastrophic forgetting in various few-shot classification settings. We begin with an artificially generated toy dataset Rainbow Omniglot, and proceed to a more challenging dataset sequence. In both experiments we consider the 1-shot 5-way few-shot classification setting. The model architecture and other few-shot details can be found in Appendix A.

6.1 Rainbow Omniglot

The artificial dataset sequence Rainbow Omniglot is generated analogous to Rainbow MNIST (Finn et al., 2019). We run the experiment on a sequence of 10 different datasets randomly chosen from Rainbow Omniglot. The details to generate Rainbow Omniglot with the 10 chosen datasets and the implementation details of this experiment can be found in Appendix B.1.

Figure 1: The statistics of the meta-evaluation accuracy over all datasets up to (and including) the current meta-training dataset index. Higher accuracy values with more stable lines within each statistic category indicate better results. BOMLA with outperforms MAML and other values in this context.

Some datasets in the sequence are harder and more different than the others, particularly the ones with images rescaled to , as the images lose pixels in the character writings due to rescaling. Figure 1 shows that MAML has a high fluctuation in the performance, especially when transitioning between datasets rescaled to and datasets with original scaling. When MAML encounters a more different dataset, the meta-parameters divert from previous experiences for an optimised performance on the current dataset. BOMLA on the other hand, has a relatively stable performance across dissimilar datasets, as it takes previous experiences into account when obtaining a posterior of the meta-parameters with the new dataset included.

6.2 A More Challenging Few-Shot Classification Dataset Sequence

We implement BOMLA to a more challenging few-shot classification sequence:

Figure 2: Examples of image instances from Omniglot (left), miniQuickDraw (centre) and CIFAR-FS (right).

Figure 3: Meta-evaluation accuracy on Omniglot, miniQuickDraw and CIFAR-FS along meta-training. Higher accuracy values indicate better results with less forgetting as we proceed to new datasets. MAML catastrophically forgets the previously meta-learned Omniglot and miniQuickDraw when meta-training on CIFAR-FS, whereas BOMLA gives a good performance balance between the previous and current datasets.
Omniglot:

The Omniglot dataset (Lake et al., 2011) comprises 1623 characters from 50 alphabets and each character has 20 instances. New classes with rotations in the multiples of are formed after splitting the classes for meta-training and meta-evaluation.

miniQuickDraw:

The miniQuickDraw dataset is formed by randomly sampling 100 classes and 600 instances in each class from the QuickDraw dataset (Ha and Eck, 2017). The QuickDraw dataset comprises 345 categories of drawings collected from the players in the game “Quick, Draw!”

Cifar-Fs:

The CIFAR-FS dataset (Bertinetto et al., 2019) is a variation on CIFAR100 (Krizhevsky, 2009) for the few-shot classification purpose, with 100 classes of objects and each class comprises 600 images of size . We rescale the images to in this experiment.

Figure 2 shows some image instances from each dataset in the sequence. The implementation details of this experiment and the 100 classes for miniQuickDraw can be found in Appendix B.2. The transition from Omniglot to miniQuickDraw is less drastic, compared to the transition from miniQuickDraw to CIFAR-FS. When transitioning from Omniglot to miniQuickDraw, we observe only a small amount of forgetting for both BOMLA and MAML in Figure 3. Continuing the MAML run to the subsequent transition from miniQuickDraw to CIFAR-FS gives a noticeable drop in the few-shot performance on Omniglot and miniQuickDraw. The result in Figure 3 shows that BOMLA is able to prevent catastrophic forgetting in both transitions. BOMLA is able to proceed with learning miniQuickDraw with almost no forgetting on Omniglot. There is a small trade-off in the few-shot performance for CIFAR-FS as BOMLA avoids catastrophically forgetting Omniglot and miniQuickDraw.

Tuning the hyperparameter in Eq. (14) corresponds to balancing between a smaller performance trade-off on a new dataset and less forgetting on previous datasets. The value 1 results in a more concentrated Gaussian posterior and is therefore unable to learn new datasets well, but can better retain the performances on previously learned datasets. The value 0.01 on the other hand gives a widespread Gaussian posterior and learns better on new datasets by sacrificing the performance on the previous datasets. The result shows that the value gives the best balance between old and new datasets. For the best performance, we can set different values for different datasets. The best hyperparameter combination in this experiment is for miniQuickDraw and for CIFAR-FS.

7 Conclusion

We introduced the Bayesian Online Meta-learning with Laplace Approximation (BOMLA) framework to overcome catastrophic forgetting in few-shot classification problems. BOMLA merged the BOL framework and the MAML algorithm via an adjusted objective obtained from the log-posterior, with the first term corresponding to the MAML objective and the second term acting as a regulariser. We proposed the necessary adjustments in the Hessian and Fisher approximation, as we are optimising the meta-parameters for few-shot classification instead of the usual model parameters in large-scale supervised classification. The experiments show that BOMLA is able to retain the few-shot classification ability when trained on sequential datasets, resulting in the ability to perform few-shot classification on multiple datasets with a single meta-learned model.

References

  • Bertinetto et al. (2019) L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi. Meta-Learning with Differentiable Closed-Form Solvers. In International Conference on Learning Representations, 2019.
  • Botev et al. (2017) A. Botev, H. Ritter, and D. Barber.

    Practical Gauss-Newton Optimisation for Deep Learning.

    In

    Proceedings of the 34th International Conference on Machine Learning

    , 2017.
  • Denevi et al. (2019) G. Denevi, D. Stamos, C. Ciliberto, and M. Pontil. Online-Within-Online Meta-Learning. In Advances in Neural Information Processing Systems 32, 2019.
  • Denker and LeCun (1991) J. S. Denker and Y. LeCun.

    Transforming Neural-Net Output Levels to Probability Distributions.

    In Advances in Neural Information Processing Systems 3, 1991.
  • Dong et al. (2015) D. Dong, H. Wu, W. He, D. Yu, and H. Wang. Multi-Task Learning for Multiple Language Translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015.
  • Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
  • Finn et al. (2019) C. Finn, A. Rajeswaran, S. Kakade, and S. Levine. Online Meta-Learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Goodfellow et al. (2013) I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. arXiv preprint, arXiv:1312.6211, 2013.
  • Grant et al. (2018) E. Grant, C. Finn, S. Levine, T. Darrell, and T. Griffiths. Recasting Gradient-Based Meta-Learning as Hierarchical Bayes. In International Conference on Learning Representations, 2018.
  • Grosse and Martens (2016) R. Grosse and J. Martens. A Kronecker-Factored Approximate Fisher Matrix for Convolution Layers. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
  • Ha and Eck (2017) D. Ha and D. Eck. A Neural Representation of Sketch Drawings. arXiv preprint, arXiv:1704.03477, 2017.
  • Harrison et al. (2019) J. Harrison, A. Sharma, C. Finn, and M. Pavone. Continuous Meta-Learning without Tasks. arXiv preprint, arXiv:1912.08866, 2019.
  • He et al. (2019) X. He, J. Sygnowski, A. Galashov, A. A. Rusu, Y. Teh, and R. Pascanu. Task Agnostic Continual Learning via Meta Learning. arXiv preprint, arXiv:1906.05201, 2019.
  • Javed and White (2019) K. Javed and M. White. Meta-Learning Representations for Continual Learning. In Advances in Neural Information Processing Systems 32, 2019.
  • Johnson et al. (2017) M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean.

    Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.

    Transactions of the Association for Computational Linguistics, 2017.
  • Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
  • Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the National Academy of Sciences, 2017.
  • Koch et al. (2015) G. Koch, R. Zemel, and R. Salakhutdinov. Siamese Neural Networks for One-Shot Image Recognition. In 32th International Conference on Machine Learning Deep Learning Workshop, 2015.
  • Krizhevsky (2009) A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
  • Kurata and Audhkhasi (2019) G. Kurata and K. Audhkhasi. Multi-Task CTC Training with Auxiliary Feature Reconstruction for End-to-End Speech Recognition. In 20th Annual Conference of the International Speech Communication Association, 2019.
  • Lake et al. (2011) B. Lake, R. Salakhutdinov, J. Gross, and J.B. Tenenbaum. One Shot Learning of Simple Visual Concepts. In Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011.
  • Lee et al. (2017) S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang.

    Overcoming Catastrophic Forgetting by Incremental Moment Matching.

    In Advances in Neural Information Processing Systems 30, 2017.
  • Li et al. (2004) F. Li, R. Fergus, and P. Perona. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. In

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

    , 2004.
  • Li et al. (2006) F. Li, R. Fergus, and P. Perona. One-Shot Learning of Object Categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
  • Li et al. (2017) Z. Li, F. Zhou, F. Chen, and H. Li. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. arXiv preprint, arXiv:1707.09835, 2017.
  • Lopez-Paz and Ranzato (2017) D. Lopez-Paz and M. A. Ranzato. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems 30, 2017.
  • MacKay (1992) D. J. C. MacKay.

    A Practical Bayesian Framework for Backpropagation Networks.

    Neural Computation, 1992.
  • Martens and Grosse (2015) J. Martens and R. Grosse. Optimizing Neural Networks with Kronecker-Factored Approximate Curvature. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • Miller et al. (2000) E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from One Example Through Shared Densities on Transforms. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000.
  • Nguyen et al. (2018) C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational Continual Learning. In International Conference on Learning Representations, 2018.
  • Nichol et al. (2018) A. Nichol, J. Achiam, and J. Schulman. On First-Order Meta-Learning Algorithms. arXiv preprint, arXiv:1803.02999, 2018.
  • Nikolaidis et al. (2015) S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. Shah. Efficient Model Learning from Joint-Action Demonstrations for Human-Robot Collaborative Tasks. In Proceedings of the 10th International Conference on Human-Robot Interaction, 2015.
  • Nikolaidis et al. (2016) S. Nikolaidis, A. Kuznetsov, D. Hsu, and S. Srinivasa. Formalizing Human-Robot Mutual Adaptation: A Bounded Memory Model. In Proceedings of the 11th International Conference on Human-Robot Interaction, 2016.
  • Opper (1998) M. Opper. A Bayesian Approach to Online Learning. Cambridge University Press, 1998.
  • Ravi and Larochelle (2017) S. Ravi and H. Larochelle. Optimization as a Model for Few-Shot Learning. In International Conference on Learning Representations, 2017.
  • Ritter et al. (2018a) H. Ritter, A. Botev, and D. Barber. Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting. In Advances in Neural Information Processing Systems 31, 2018a.
  • Ritter et al. (2018b) H. Ritter, A. Botev, and D. Barber. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations, 2018b.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 1951.
  • Rusu et al. (2019) A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-Learning with Latent Embedding Optimization. In International Conference on Learning Representations, 2019.
  • Santoro et al. (2016) A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-Learning with Memory-Augmented Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
  • Schmidhuber (1987) J. Schmidhuber. Evolutionary Principles in Self-Referential Learning. On Learning How to Learn: The Meta-Meta-Meta…-Hook. Diploma thesis, Institut für Informatik, Technische Universität München, 1987.
  • Seltzer and Droppo (2013) M. L. Seltzer and J. Droppo. Multi-task Learning in Deep Neural Networks for Improved Phoneme Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
  • Snell et al. (2017) J. Snell, K. Swersky, and R. Zemel. Prototypical Networks for Few-Shot Learning. In Advances in Neural Information Processing Systems 30, 2017.
  • Sung et al. (2018) F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr, and T. M. Hospedales. Learning to Compare: Relation Network for Few-Shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Thrun and Pratt (1998) S. Thrun and L. Pratt. Learning to Learn: Introduction and Overview. Springer, Boston, MA, 1998.
  • Vinyals et al. (2016) O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems 29, 2016.
  • Wen et al. (2018) J. Wen, Y. Cao, and R. Huang. Few-Shot Self Reminder to Overcome Catastrophic Forgetting. arXiv preprint, arXiv:1812.00543, 2018.
  • Yuan and Yan (2010) X. Yuan and S. Yan. Visual Classification with Multi-Task Joint Sparse Representation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
  • Zenke et al. (2017) F. Zenke, B. Poole, and S. Ganguli. Continual Learning through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, 2017.
  • Zhang et al. (2014) Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial Landmark Detection by Deep Multi-task Learning. In Proceedings of the 13th European Conference on Computer Vision, 2014.
  • Zhuang et al. (2019) Z. Zhuang, Y. Wang, K. Yu, and S. Lu. Online Meta-Learning on Non-Convex Setting. arXiv preprint, arXiv:1910.10196, 2019.

Appendix A Few-Shot Details

We average over 100 tasks sampled from the novel classes when reporting the meta-evaluation accuracy in all of the experiments. Each task in both of the 1-shot 5-way experiments consists of a support set with 1 sample per class and a query set with 15 samples per class. We use the model architecture proposed by Vinyals et al. (2016) that takes 4 modules with 64 filters of size

, followed by a batch normalisation, a ReLU activation and a

max-pooling. A fully-connected layer is appended to the final module before getting the class probabilities with softmax.

Appendix B Implementation Details

b.1 Rainbow Omniglot

Analogous to the Rainbow MNIST (Finn et al., 2019), our Rainbow Omniglot sequence is generated by transforming the character images from Omniglot in the following ways: scaling, rotating and changing the background colour of the images. To generate one of the sequential datasets, we apply a combined transformation formed by randomly selecting a scaling (to size or original size), a rotation degree (0°, 90°, 180°or 270°) and a background colour out of seven different colours. Rainbow MNIST scales the images to either half size or original, but we find the factor of half to be too small to generate reasonably interpretable images in Omniglot.

The sequence of the 10 Rainbow Omniglot datasets in this experiment is as follows:

  1. Original scale, 0°rotation and red background

  2. Original scale, 270°rotation and blue background

  3. scale, 0°rotation and red background

  4. scale, 270°rotation and green background

  5. scale, 180°rotation and red background

  6. Original scale, 180°rotation and white background

  7. Original scale, 0°rotation and indigo background

  8. scale, 270°rotation and cyan background

  9. Original scale, 0°rotation and yellow background

  10. Original scale, 270°rotation and green background

Each dataset is meta-trained for 3000 iterations using Adam with learning rate 0.01 and meta-batch size 32 for the outer loop optimisation. We perform a one-step SGD adaptation with learning rate 0.4 for the inner loop update on each task. We sample 1000 tasks to approximate the Hessian when updating the Gaussian precision matrix.

b.2 More Challenging Experiment

For each of the Omniglot, miniQuickDraw and CIFAR-FS datasets, we update the meta-parameters in the outer loops using Adam with learning rate 0.001 and meta-batch size 32. Similar to the Rainbow Omniglot, the inner loop for Omniglot in this sequence does a one-step SGD with learning rate 0.4. The miniQuickDraw dataset uses a three-step SGD with learning rate 0.2 as an inner task-specific update. For CIFAR-FS, we perform a five-step SGD with learning rate 0.1 for an inner loop update. For each dataset, we sample 2000 tasks to approximate the Hessian when updating the Gaussian precision matrix.

Below are the 100 classes in the miniQuickDraw dataset: ’mailbox’, ’whale’, ’peanut’, ’vase’, ’octagon’, ’dumbbell’, ’hockey puck’, ’chandelier’, ’ocean’, ’tennis racquet’, ’bush’, ’potato’, ’tent’, ’lobster’, ’pool’, ’squirrel’, ’megaphone’, ’bucket’, ’golf club’, ’jacket’, ’computer’, ’keyboard’, ’basket’, ’underwear’, ’asparagus’, ’cactus’, ’arm’, ’oven’, ’elephant’, ’moon’, ’giraffe’, ’couch’, ’clock’, ’suitcase’, ’snowflake’, ’scorpion’, ’skyscraper’, ’paint can’, ’dragon’, ’windmill’, ’skateboard’, ’fish’, ’wristwatch’, ’calculator’, ’cat’, ’hammer’, ’sheep’, ’necklace’, ’bear’, ’anvil’, ’bulldozer’, ’scissors’, ’skull’, ’syringe’, ’zebra’, ’helmet’, ’bench’, ’harp’, ’river’, ’monkey’, ’bread’, ’donut’, ’train’, ’flamingo’, ’drill’, ’peas’, ’shorts’, ’book’, ’mushroom’, ’brain’, ’fireplace’, ’t-shirt’, ’horse’, ’cell phone’, ’hexagon’, ’zigzag’, ’strawberry’, ’sock’, ’rainbow’, ’crocodile’, ’tree’, ’bird’, ’spreadsheet’, ’teddy-bear’, ’The Mona Lisa’, ’bracelet’, ’flying saucer’, ’tractor’, ’bathtub’, ’cruise ship’, ’car’, ’parachute’, ’grass’, ’guitar’, ’The Eiffel Tower’, ’ear’, ’drums’, ’circle’, ’compass’, ’bandage’