1 Introduction
A common assumption made by many machine learning algorithms is that the observations in the dataset are independent and identically distributed (i.i.d). However, there are many scenarios where this assumption is violated because the underlying data distribution is nonstationary. For instance, in reinforcement learning (RL), the observations depend on the current policy of the agent, which may change over time. In addition, the environments with which the agent interacts are usually nonstationary. In supervised learning tasks, due to computational or legal reasons, one might be forced to retrain a deployed model only on the recently collected data, which might come from a different distribution than that of the previous data. In all these scenarios, blindly assuming i.i.d will not only lead to inefficient learning procedure, but also catastrophic interference
(McCloskey and Cohen, 1989).One research area that addresses this problem is continual learning, where the nonstationarity of data is usually described as a sequence of distinct tasks. A list of desiderata for continual learning (Schwarz et al., 2018)
include the ability to not forget, forward positive transfer (learning new tasks faster by leveraging previously acquired knowledge), and backwards positive transfer (improvement on previous tasks because of new skills learned), bounded memory budget regardless the number of tasks and so forth. Since these desiderata are often competing with each other, most continual learning methods aim for some of them instead of all, and to simplify the problem, they usually assume that the task labels or the boundaries between different tasks are known.
In this work, we aim to develop algorithms that can continually learn a sequence of tasks without knowing their labels or boundaries. Furthermore, we argue that in a more challenging scenario where the tasks are not only different but also conflicting with each other, most existing approaches will fail. To overcome these challenges, we propose a framework that applies metalearning techniques to continual learning problems, and shift our focus from less forgetting to faster remembering: to rapidly recall a previously learned task, given the right context as a cue.
2 Problem Statement
We consider the online learning scenario studied by Hochreiter et al. (2001); Vinyals et al. (2016); Nagabandi et al. (2019), where at each time step , the network receives an input and gives a prediction using a model parametrised by . It then receives the ground truth , which can be used to adapt its parameters and to improve its performance on future predictions. If the data distribution is nonstationary (e.g., might be sampled from some task for a while, then the task switches to ), then training on the new data might lead to catastrophic forgetting – the new parameters can solve task but not task anymore: , .
Many continual learning methods were proposed to alleviate the problem of catastrophic forgetting. However, most of these approaches require either the information of task index ( or
) or at least the moment when the task switches. Only recently, the continual learning community started to focus on task agnostic methods
(Zeno et al., 2018; Aljundi et al., 2019). However, all these methods have the underlying assumption that no matter what tasks it has been learning, at any time , it is possible to find parameters that fit all previous observations with high accuracy: . This assumption is, however, not valid when the target depends not only on the observation but also on some hidden task (or context) variable : , a common scenario in partially observable environments (Monahan, 1982; Cassandra et al., 1994). In this case, when the context has changed (), even if the observation remains the same (), the targets may be different (). As a result, it is impossible to find a single parameter vector
that fits both mappings: . It follows that, in this case, catastrophic forgetting cannot be avoided without inferring the task variable .3 What & How Framework
Here we propose a framework for task agnostic continual learning that explicitly infers the current task from some context data and predicts targets based on both the inputs and task representations . The framework consists of two modules: an encoder or task inference network that predicts the current task representation based on the context data , and a decoder that maps the task representation to a task specific model , which makes predictions conditional on the current task.
Under this framework, even when the inputs and are the same, the predictions and can differ from each other depending on the contexts. In this work, we choose the recent observations as the context dataset . This choice is reasonable in an environment where is piecewise stationary or changes smoothly. An overview of this framework is illustrated in Figure 0(a).
3.1 Meta Learning as Task Inference
A similar separation of concern can be found in the metalearning literature. In fact, many recently proposed metalearning methods can be seen as instances of this framework. For example, Conditional Neural Processes (CNP) (Garnelo et al., 2018) embed the observation and target pairs in context data by an encoder . The embeddings are then aggregated by a commutative operation (such as the mean operation) to obtain a single embedding of the context: . At inference time, the context embedding is passed as an additional input to a decoder to produce the conditional outputs: .
ModelAgnostic MetaLearning (MAML) (Finn et al., 2017) infers the current task by applying one or a few steps of gradient descent on the context data . The resulting taskspecific parameters can be considered a highdimensional representation of the current task returned by a What encoder: , where are meta parameters, and the inner loop learning rate. The How decoder of MAML returns the task specific model by simply parametrizing with : .
Rusu et al. (2019) proposed Latent Embedding Optimization (LEO) which combines the encoder/decoder structure with the idea of inner loop finetuning from MAML. The latent task embedding
is first sampled from a Gaussian distribution
whose meanand variance
are generated by averaging the outputs of a relation network: , where is a relation network and is an encoder. Taskdependent weights can then be sampled from a decoder : , where . The final task representation is obtained by a few steps of gradient descent: , and the final task specific weights are decoded from : , where .In Fast Context Adaptation via MetaLearning (CAVIA) (Zintgraf et al., 2019), a neural network model takes a context vector as an additional input: . The context vector is inferred from context data by a few steps of gradient descent: . Then a context dependent model is returned by the How decoder: .
Table 1 in Appendix summarizes how these methods can be seen as instances of the What & How framework. Under this framework, we can separate the task specific parameters of from the task agnostic parameters of and .
3.2 Continual Meta Learning
In order to train these meta learning models, one normally has to sample data from multiple tasks at the same time during training. However, this is not feasible in a continual learning scenario, where tasks are encountered sequentially and only a single task is presented to the agent at any moment. As a result, the meta models (What & How functions) themselves are prone to catastrophic forgetting. Hence, the second necessary component of our framework is to apply continual learning methods to stabilize the learning of meta parameters. In general, any continual learning method that can be adapted to consolidate memory at every iteration instead of at every task switch can be applied in our framework, such as Online EWC (Schwarz et al., 2018)
and Memory Aware Synapses (MAS)
(Aljundi et al., 2018). In order to highlight the effect of explicit task inference for task agnostic continual learning, we choose a particular method called Bayesian Gradient Descent (BGD) (Zeno et al., 2018) to implement our framework. We show that by applying BGD on the metalevel models ( and ), the network can continually learn a sequence of tasks that are impossible to learn when BGD is applied to the bottomlevel model .Formally, let be the vector of meta parameters, (i.e. the parameters of and , for instance, in MAML). We model its distribution by a factorized Gaussian . Given a context dataset and the current observations , the meta loss can be defined as the loss of the task specific model on the current observations: , where . With the meta loss defined, it is then possible to optimize using the BGD update rules derived from the online variational Bayes’ rule and a reparametrization trick ():
(1) 
where is the gradient of the meta loss with respect to sampled parameters and is a learning rate. The expectations are computed via Monte Carlo method:
(2) 
An intuitive interpretation of BGD learning rules is that weights with smaller uncertainty are more important for the knowledge accumulated so far, thus they should change slower in the future in order to preserve the learned skills.
3.3 Instantiation of the Framework
Using the What & How framework, one can compose arbitrarily many continual meta learning methods. To show that this framework is independent from a particular implementation, we propose two such instances by adapting previous continual learning methods to this meta learning framework.
MetaCoG
Contextdependent gating of subspaces (He and Jaeger, 2018), parameters (Mallya and Lazebnik, 2018) or units (Serra et al., 2018) of a single network have proven effective at alleviating catastrophic forgetting. Recently, Masse et al. (2018) showed that combining context dependent gating with a synaptic stabilization method can achieve even better performance than using either method alone. Therefore, we explore the use of context dependent masks as our task representations, and define the task specific model as the subnetwork selected by these masks.
At every time step , we infer the latent masks based on the context dataset
by one or a few steps of gradient descent of an inner loop loss function
with respect to :(3) 
where is a fixed initial value of the mask variables,
is an elementwise sigmoid function to ensure that the masks are in
, and is elementwise multiplication. In general, can be any objective function. For instance, for a regression task, one can use a mean squared error with an regularization that enforces sparsity of :(4) 
The resulting masks are then used to gate the base network parameters in order to make a contextdependent prediction: . Once the ground truth is revealed, we can define the meta loss as the loss of the masked network on the current data: and optimize the distribution of task agnostic meta variable by BGD.
The intuition here is that the parameters of the base network should allow fast adaptations of the masks . Since the contextdependent gating mechanism is trained in a metalearning fashion, we call this particular instance of our framework Meta Contextdependent Gating (MetaCoG). We note that while we draw our inspiration from the idea of selecting a subnetwork using the masks , in the formulated algorithm rather plays the role of modulating the parameters (i.e. in practice we noticed that entries of do not necessarily converge to or ).
Note that the inner loop loss used to infer the context variable does not have to be the same as the meta loss . In fact, one can choose an auxiliary loss function for as long as it is informative about the current task.
MetaELLA
The second instance of the framework is based on the GOMTL model (Kumar and Daume III, 2012) and the Efficient Lifelong Learning Algorithm (ELLA) (Ruvolo and Eaton, 2013). In a multitask learning setting, ELLA tries to solve each task with a task specific parameter vector by linearly combining a shared dictionary of latent model components using a taskspecific coefficient vector : , where is learned by minimizing the objective function
(5) 
Instead of directly optimizing , we adapt ELLA to the What & How framework by considering as the task representation returned by a What encoder and as parameters of a How decoder. The objective can then be minimized in a continual meta learning fashion. At time , current task representation is obtained by minimizing the inner loop loss by one or a few steps of gradient descent from fixed initial value : .
Similar to MetaCoG, the parametric distribution of the meta variable can be optimized with respect to the meta loss using BGD.
4 Related Work
Continual learning has seen a surge in popularity in the last few years, with multiple approaches being proposed to address the problem of catastrophic forgetting. These approaches can be largely categorized into the following types (Parisi et al., 2019): Rehearsal based methods rely on solving the multitask objective, where the performance on all previous tasks is optimized concurrently. They focus on techniques to either efficiently store data points from previous tasks (Robins, 1995; LopezPaz et al., 2017) or to train a generative model to produce pseudoexamples (Shin et al., 2017). Then the stored and generated data can be used to approximate the losses of previous tasks. Structural based methods exploit modularity to reduce interference, localizing the updates to a subset of weights. Rusu et al. (2016) proposed to learn a new module for each task with lateral connection to previous modules. It prevents catastrophic forgetting and maximizes forward transfer. In Golkar et al. (2019), pruning techniques are used to minimize the growth of the model with each observed tasks. Finally, Regularization based methods draw inspiration from Bayesian learning, and can be seen as utilizing the posterior after learning a sequence of tasks as a prior to regularize learning of the new task. These methods differ from each other in how the prior and implicitly the posterior are parametrized and approximated. For instance, Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017)
relies on a Gaussian approximation with a diagonal covariance, estimated using a Laplace approximation. Variational Continual Learning (VCL)
(Nguyen et al., 2017) learns directly the parameters of the Gaussian relying on the reparametrization trick. Ritter et al. (2018) achieved better approximation with blockdiagonal covariance.While effective at preventing forgetting, the abovementioned methods either rely on knowledge of task boundaries or require task labels to select a submodule for adaptation and prediction, hence cannot be directly applied in the task agnostic scenario considered here. To circumvent this issue, (Kirkpatrick et al., 2017) used ForgetMeNot (FMN) (Milan et al., 2016) to detect task boundaries and combined it with EWC to consolidate memory when task switches. However, FMN requires a generative model that computes exact data likelihood, which limits it from scaling to complex tasks. More recently, Aljundi et al. (2019) proposed a rehearsalbased method to select a finite number of data that are representative of all data seen so far. This method, similar to BGD, assumes that it is possible to learn one model that fits all previous data, neglecting the scenario where different tasks may conflict each other, hence does not allow taskspecific adaptations.
Metalearning, or learning to learn (Schmidhuber, 1987), assumes simultaneous access to multiple tasks during metatraining, and focuses on the ability of the agent to quickly learn a new task at metatesting time. As with continual learning, different families of approaches exist for metalearning. Memory based methods Santoro et al. (2016) rely on a recurrent model (optimizer) such as LSTM to learn a historydependent update function for the lowerlevel learner (optimizee). Andrychowicz et al. (2016)
trained an LSTM to replace the stochastic gradient descent algorithm by minimizing the sum of the losses of the optimizees on multiple prior tasks.
Ravi and Larochelle (2016) use an LSTMbased metalearner to transform the gradient and loss of the baselearners on every new example to the final updates of the model parameters. Metric based methods learn an embedding space in which other tasks can be solved efficiently. Koch (2015)trained siamese networks to tell if two images are similar by converting the distance between their feature embeddings to the probability of whether they are from the same class.
Vinyals et al. (2016) proposed the matching network to improve the embeddings of a test image and the support images by taking the entire support set as context input. The approaches discussed in Section 3.1 instead belong to the family of optimization based metalearning methods. In this domain, the most relevant work is from Nagabandi et al. (2019), where they studied fast adaptation in a nonstationary environment by learning an ensemble of networks, one for each task. Unlike our work, they used MAML for initialization of new networks in the ensemble instead of task inference. A drawback of this approach is that the size of the ensemble grows over time and is unbounded, hence can become memoryconsuming when there are many tasks.5 Experiments
To demonstrate the effectiveness of the proposed framework, we compare BGD and Adam(Kingma and Ba, 2014) to three instances of this framework on a range of task agnostic continual learning experiments. The first instance is simply applying BGD on the meta variable of MAML instead of on the task specific parameters. We refer to this method as MetaBGD. The other two are MetaCoG and MetaELLA, introduced in Section 3.3. In all experiments, we present tasks consecutively and each task lasts for iterations. At every iteration , a batch of samples from the training set of the current task are presented to the learners, and the context data used for task inference is simply the previous minibatch with their corresponding targets: . At the end of the entire training process, we test the learners’ performance on the testing set of every task, given a minibatch of training data from that task as context data. Since the meta learners take five gradient steps in the inner loop for task inference, we also allow BGD and Adam to take five gradient steps on the context data before testing their performances. We focus on analyzing the main results in this section, experimental details are provided in the Appendix B.
5.1 Sine Curve Regression
We start with a regression problem commonly used in meta learning literature, where each task corresponds to a sine curve to be fitted. In this experiment, we randomly generate 10 sine curves and present them sequentially to a 3layer MLP. Figure 2 shows the mean squared error (MSE) of each task after the entire training process. Adam and BGD perform significantly worse than the meta learners, even though they have taken the same number of gradient steps on the context data. The reason for this large gap of performance becomes evident by looking at Figure 3, which shows the learners’ predictions on testing data of the last task and the third task, given their corresponding context data. All learners can solve the last task almost perfectly, but when the context data of the third task is provided, meta learners can quickly remember it, while BGD and Adam are unable to adapt to the task they have previously learned.
5.2 LabelPermuted MNIST
A classical experiment for continual learning is permuted MNIST (Goodfellow et al., 2013; Kirkpatrick et al., 2017), where a new task is created by shuffling the pixels of all images in MNIST by a fixed permutation. In this experiment, however, we shuffle the classes in the labels instead of the pixels in the images. The reason for this change is to ensure that it is not possible to guess the current task simply based on the images. In this way, we can test whether our framework is able to quickly adapt its behavior according to the current context. Five tasks are created with this method and are presented sequentially to an MLP.
We test the learners’ classification accuracy of each task at the end of the entire learning process, using a minibatch of training set as context data. As can be seen from Figure 4, all learners perform well on the last task. However, BGD and Adam have chancelevel accuracy on previous tasks due to their incapability of inferring tasks from context data, while the meta learners are able to recall those tasks within 5 inner loop updates on the context data.
Figure 5 displays the accuracy curve when we play the tasks again, for 10 iterations each, after the first training process. The tasks are presented in the same order as they were learned for the first time. It is clear that one iteration after the task changes, when the correct context data is available, the meta learners are able to recall the current task to almost perfection, while Adam and BGD have to relearn each task from scratch.
5.3 Omniglot
We have seen in previous two experiments that when the task information is hidden from the network, continual learning is impossible without task inference. In this experiment, we show that our framework is favourable even when the task identity is reflected in the inputs. To this end, we test our framework and BGD by sequential learning of handwritten characters from the Omniglot dataset (Lake et al., 2015)
, which consists of 50 alphabets with various number of characters per alphabet. Considering every alphabet as a task, we present 10 alphabets sequentially to a convolutional neural network and train it to classify 20 characters from each alphabet.
Most continual learning methods (including BGD) require a multihead output in order to overcome catastrophic forgetting in this setup. The idea is to use a separate output layer per task, and to only compute the error on the current head during training and only make predictions from the current head during testing. Therefore, task index has to be available in this case in order to select the correct head.
Unlike these previous works, we evaluate our framework with a single head of 200 output units in this experiment. Figure 6 summarizes the results of this experiment. For every task, we measure its corresponding testing accuracy twice: once immediately after that task is learned (no forgetting yet), and once after all ten tasks are learned. Our framework with a single head can achieve comparable results as BGD with multiple heads, whereas BGD with a single head completely forgets previous tasks.
6 Conclusions
In this work, we showed that when the objective of a learning algorithm depends on both the inputs and context, catastrophic forgetting is inevitable without conditioning the model on the context. A framework that can infer task information explicitly from context data was proposed to resolve this problem. The framework separates the inference process into two components: one for representing What task is expected to be solved, and the other for describing How to solve the given task. In addition, our framework unifies many meta learning methods and thus establishes a connection between continual learning and meta learning, and leverages the advantages of both.
There are two perspectives of viewing the proposed framework: from the meta learning perspective, our framework addresses the continual meta learning problem by applying continual learning techniques on the meta variables, therefore allowing the meta knowledge to accumulate over an extended period; from the continual learning perspective, our framework addresses the task agnostic continual learning problem by explicitly inferring the task when the task information is not available, and this allows us to shift the focus of continual learning from less forgetting to faster remembering, given the right context.
For future work, we would like to test this framework for reinforcement learning tasks in partially observable environments, where the optimal policy has to depend on the hidden task or context information.
References

Aljundi et al. [2018]
Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and
Tinne Tuytelaars.
Memory aware synapses: Learning what (not) to forget.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 139–154, 2018.  Aljundi et al. [2019] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Online continual learning with no task boundaries. arXiv preprint arXiv:1903.04476, 2019.
 Andrychowicz et al. [2016] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
 Cassandra et al. [1994] Anthony R Cassandra, Leslie Pack Kaelbling, and Michael L Littman. Acting optimally in partially observable stochastic domains. In AAAI, 1994.
 Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.
 Garnelo et al. [2018] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1690–1699, 2018.
 Golkar et al. [2019] Siavash Golkar, Michael Kagan, and Kyunghyun Cho. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476, 2019.
 Goodfellow et al. [2013] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211, 2013.

He and Jaeger [2018]
Xu He and Herbert Jaeger.
Overcoming catastrophic interference using conceptoraided backpropagation.
In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1al7jg0b.  Hochreiter et al. [2001] Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning to learn using gradient descent. In Proceedings of the International Conference on Artificial Neural Networks, ICANN ’01, pages 87–94, London, UK, 2001. SpringerVerlag. ISBN 3540424865. URL http://dl.acm.org/citation.cfm?id=646258.684281.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.

Koch [2015]
Gregory Koch.
Siamese neural networks for oneshot image recognition.
In
ICML Deep Learning workshop
, 2015.  Kumar and Daume III [2012] Abhishek Kumar and Hal Daume III. Learning task grouping and overlap in multitask learning. arXiv preprint arXiv:1206.6417, 2012.
 Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 LopezPaz et al. [2017] David LopezPaz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017.

Mallya and Lazebnik [2018]
Arun Mallya and Svetlana Lazebnik.
Packnet: Adding multiple tasks to a single network by iterative
pruning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 7765–7773, 2018.  Masse et al. [2018] Nicolas Y. Masse, Gregory D. Grant, and David J. Freedman. Alleviating catastrophic forgetting using contextdependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44):E10467–E10475, 2018. ISSN 00278424. doi: 10.1073/pnas.1803839115. URL https://www.pnas.org/content/115/44/E10467.
 McCloskey and Cohen [1989] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
 Milan et al. [2016] Kieran Milan, Joel Veness, James Kirkpatrick, Michael Bowling, Anna Koop, and Demis Hassabis. The forgetmenot process. In Advances in Neural Information Processing Systems, pages 3702–3710, 2016.

Monahan [1982]
George E Monahan.
State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms.
Management Science, 28(1):1–16, 1982.  Nagabandi et al. [2019] Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via metalearning: Continual adaptation for modelbased RL. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyxAfnA5tm.
 Nguyen et al. [2017] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
 Parisi et al. [2019] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
 Ravi and Larochelle [2016] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. International Conference on Learning Representations, 2016.
 Ritter et al. [2018] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems, pages 3738–3748, 2018.
 Robins [1995] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
 Rusu et al. [2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 Rusu et al. [2019] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Metalearning with latent embedding optimization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=BJgklhAcK7.
 Ruvolo and Eaton [2013] Paul Ruvolo and Eric Eaton. Ella: An efficient lifelong learning algorithm. In International Conference on Machine Learning, pages 507–515, 2013.
 Santoro et al. [2016] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
 Schmidhuber [1987] Juergen Schmidhuber. Evolutionary principles in selfreferential learning. Diploma thesis, TU Munich, 1987.
 Schwarz et al. [2018] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka GrabskaBarwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, pages 4535–4544, 2018.
 Serra et al. [2018] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4555–4564, 2018.
 Shin et al. [2017] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
 Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
 Zeno et al. [2018] Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Bayesian gradient descent: Online variational bayes learning with increased robustness to catastrophic forgetting and weight pruning. arXiv preprint arXiv:1803.10123, 2018.
 Zintgraf et al. [2019] Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. CAVIA: Fast context adaptation via metalearning, 2019.
Appendix A Meta Learning as Task Inferences
Methods  

MAML  
CNP  
LEO  
CAVIA 
Appendix B Experiment Details
b.1 Model Configurations
In all experiments, the number of samples in Eq. 2 is set to 10. In MetaCoG, the initial value of masks is 0. In MetaELLA, we use components in the dictionary, and the initial value of latent code is set to
. Adam baseline were trained with the default hyperparameters recommended in
Kingma and Ba [2014]. The hyperparameters of other methods are tuned by a Bayesian optimization algorithm and are summarized in Table 2. Error bars for all experiments are standard deviations computed from 10 trials with different random seeds.
b.2 Sine Curve Regression
The amplitudes and phases of sine curves are sampled uniformly from and , respectively. For both training and testing, input data points are sampled uniformly from . The size of training and testing sets for each task are 5000 and 100, respectively. Each sine curve is presented for iterations, and a minibatch of 128 data points is provided at every iteration for training. The 3layer MLP has 50 units with nonlinearity in each hidden layer.
b.3 LabelPermuted MNIST
All tasks are presented for 1000 iterations and the minibatch size is 128. The network used in this experiment was a MLP with 2 hidden layers of 300 ReLU units.
b.4 Omniglot
We use 20 characters from each alphabet for classification. Out of the 20 images of each character, 15 were used for training and 5 for testing. Each alphabet was trained for 200 epochs with minibatch size 128. The CNN used in this experiment has two convolutional layers, both with 40 channels and kernel size 5. ReLU and max pooling are applied after each convolution layer, and the output is passed to a fully connected layer of size 300 before the final layer.
Hyperparameters  Sine Curve  LabelPermuted MNIST  Omniglot 

MetaBGD  0.0419985  0.45  0.207496 
0.0368604  0.050  0.0341916  
5.05646  1.0  15.8603  
MetaCoG  0.849212  10.000  5.53639 
0.0426860  0.034  0.0133221  
1.48236e6  1.000e5  3.04741e6  
38.6049  1.0  80.0627  
MetaElla  0.0938662  0.400  0.346027 
0.0298390  0.010  0.0194483  
0.0216156  0.010  0.0124128  
42.6035  1.0  24.7476  
BGD  0.0246160  0.060  0.0311284 
20.3049  1.0  16.2192 