1 Solving the Continual Learning Problem
A longheld goal of AI is to build agents capable of operating autonomously for long periods. Such agents must incrementally learn and adapt to a changing environment while maintaining memories of what they have learned before, a setting known as lifelong learning (Thrun, 1994, 1996). In this paper we explore a variant called continual learning (Ring, 1994; LopezPaz & Ranzato, 2017). Continual learning assumes that the learner is exposed to a sequence of tasks, where each task is a sequence of experiences from the same distribution. We would like to develop a solution in this setting by discovering notions of tasks without supervision while learning incrementally after every experience. This is challenging because in standard offline single task and multitask learning (Caruana, 1997) it is implicitly assumed that the data is drawn from an i.i.d. stationary distribution. Neural networks tend to struggle whenever this is not the case (Goodrich, 2015).
Over the years, solutions to the continual learning problem have been largely driven by prominent conceptualizations of the issues faced by neural networks. One popular view is catastrophic forgetting (interference) (McCloskey & Cohen, 1989), in which the primary concern is the lack of stability in neural networks, and the main solution is to limit the extent of weight sharing across experiences by focusing on preserving past knowledge (Kirkpatrick et al., 2017; Zenke et al., 2017; Lee et al., 2017). Another popular and more complex conceptualization is the stabilityplasticity dilemma (Carpenter & Grossberg, 1987). In this view, the primary concern is the balance between network stability (to preserve past knowledge) and plasticity (to rapidly learn the current experience). Recently proposed techniques focus on balancing limited weight sharing with some mechanism to ensure fast learning (Li & Hoiem, 2016; Riemer et al., 2016; LopezPaz & Ranzato, 2017; Rosenbaum et al., 2018; Lee et al., 2018; Serrà et al., 2018). In this paper, we extend this view by noting that for continual learning over an unbounded number of distributions, we need to consider the stabilityplasticity tradeoff in both the forward and backward directions in time (Figure 1A).
The transferinterference tradeoff proposed in this paper (section 2) makes explicit the connection between controlling the extent of weight sharing and managing the continual learning problem. The key difference in perspective with past conceptualizations of continual learning is that we are not just concerned with current interference and transfer with respect to past examples, but also with the dynamics of transfer and interference moving forward as we learn. This new view of the problem leads to a natural metalearning perspective on continual learning: we would like to learn to change our parameters, so that we effect the dynamics of transfer and interference in a general sense. Ideally, not just in the past, but in the future as well. To the extent that our metalearning into the future generalizes, this should effectively make it easier for our model to perform continual learning in nonstationary settings. We achieve this by building off past work on experience replay (Murre, 1992; Lin, 1992; Robins, 1995) that has been a mainstay for solving nonstationary problems with neural networks (Mnih et al., 2015). We argue that the reason lies in replay’s ability to stabilize learning without sacrificing weightsharing. In this work, we propose a novel metaexperience replay (MER) algorithm that combines experience replay with optimization based metalearning (section 3) as a first step towards modeling this perspective. Moreover, our experiments (sections 4, 5, and 6), confirm our theory. MER shows great promise across a variety of supervised continual learning and continual reinforcement learning settings. In contrast to past work on metalearning for few shot learning (Santoro et al., 2016; Vinyals et al., 2016; Ravi & Larochelle, 2016; Finn et al., 2017) and reinforcement learning across successive tasks (AlShedivat et al., 2018), we are not only trying to simply improve the speed of learning on new data, but also doing so in a way that preserves knowledge of past data and generalizes to future data. Critically, our approach is not reliant on any notion of tasks provided by humans to the model and in most of the settings we explore must detect the concept of new tasks without supervision.
2 The TransferInterference Tradeoff for Continual Learning
At an instant in time with parameters and loss , we can define^{1}^{1}1Throughout our paper we will discuss ideas in terms of the supervised learning problem formulation. Extensions to the reinforcement learning formulation are straightforward and we provide more details in the appendix. operational measures of transfer and interference between two arbitrary distinct examples and while training with SGD. We can say that transfer occurs when:
(1) 
where is the dot product operator. This implies that learning example will without repetition improve performance on example and vice versa (Figure 1B). On the other hand we can say that interference occurs when:
(2) 
Here, in contrast, learning example will lead to unlearning (i.e. forgetting) of example and vice versa (Figure 1C). The potential for transfer is maximized when weight sharing is maximized while potential for interference is minimized when weight sharing is minimized (Appendix A).
Past solutions for the stabilityplasticity dilemma in continual learning operate in a simplified temporal context whereby learning is divided into two phases: all past experiences are lumped together as old memories and the data that is currently being learned qualifies as new learning. In this setting, the goal is to minimize the interference projecting backward in time, and many of the approaches that succeed in this respect do so by reducing the degree of weight sharing in one form or another. In the appendix we explain how our baseline approaches (Kirkpatrick et al., 2017; LopezPaz & Ranzato, 2017) each fit within this paradigm.
The important issue with this perspective, however, is that the system is not done with learning that will occur in the future, and what the future may bring is largely unknown. This makes it incumbent upon us to do nothing to potentially undermine the networks ability to effectively learn in an uncertain future. This consideration makes us extend the temporal horizon of the stabilityplasticity problem forward, turning it, more generally, into a continual learning problem that we label as solving the TransferInterference Tradeoff (Figure 1A). Specifically, it is important not only to reduce backward interference from our current point in time, but we must do so in a manner that does not limit our ability to learn in the future. This more general perspective acknowledges a subtlety in the problem: the issue of weight sharing across tasks arises both backward and forward in time. With this temporally symmetric perspective, the transferinterference tradeoff becomes clear. The weight sharing across tasks that enables transfer to improve future performance must not disrupt performance on what has come previously. This brings us to our central point: if we mitigate interference by reducing weightsharing, we are likely to worsen the networks capacity for transfer into the future, since transfer depends precisely on weight sharing as well. What is the future will eventually become the past, and performance must be good at all points in time. As such, our work adopts a metalearning perspective on the continual learning problem. We would like to learn to learn each example in a way that generalizes to other examples from the overall distribution.
3 A System for Learning to Learn without Forgetting
In typical offline supervised learning, we can express our optimization objective over the stationary distribution of pairs within the dataset :
(3) 
where
is the loss function, which can be selected to fit the problem. If we would like to maximize transfer and minimize interference, we can imagine it would be useful to add an auxiliary loss to the objective to bias the learning process in that direction. Considering equations
1 and 2, one obviously beneficial choice would be to also directly consider the gradients with respect to the loss function evaluated at randomly chosen datapoints. If we could maximize the dot products between gradients at these different points, it would directly encourage the network to share parameters where gradient directions align and keep parameters separate where interference is caused by gradients in opposite directions. So, ideally we would like to optimize for the following objective:(4) 
where and are randomly sampled unique data points. In our work we will attempt to design a continual learning system that optimizes for this objective. However, there are multiple problems that must be addressed to implement this kind of learning process in practical settings. The first problem is that continual learning deals with learning over a continuous nonstationary stream of data. We address this by following past work and implementing an experience replay module that augments online learning so that we can approximately optimize over the stationary distribution of all examples seen so far. Another practical problem is that the gradients of this loss depend on the second derivative with respect to the loss function, which is expensive to compute. We address this issue by indirectly approximating this objective to a first order Taylor series expansion using an online metalearning algorithm with minimal computational overhead.
3.1 Experience Replay
Learning objective: The continual lifelong learning setting poses a challenge for the optimization of neural networks as examples come one by one in a nonstationary stream. Instead, we would like our network to optimize over the stationary distribution of all examples seen so far. Experience replay (Lin, 1992; Murre, 1992)
is an old technique that remains a central component of deep learning systems attempting to learn in nonstationary settings, and we will adopt here conventions from recent work
(Zhang & Sutton, 2017; Riemer et al., 2017b) leveraging this approach. The central feature of experience replays is keeping a memory of examples seen that is interleaved with the training of the current example with the goal of making training more stable. As a result, experience replay approximates the objective in equation 3 to the extent that approximates :(5) 
has a current size and maximum size . In our work, we update the buffer with reservoir sampling (Appendix D
). This ensures that at every timestep the probability that any of the
examples seen has of being in the buffer is equal to . The content of the buffer resembles a stationary distribution over all examples seen to the extent that the items stored captures the variation of past examples. Following the standard practice in offline learning, we train by randomly sampling a batch from the distribution captured by .Prioritizing the current example: the variant of experience replay we explore differs from offline learning in that the current example has a special role ensuring that it is always interleaved with the examples sampled from the replay buffer. This is because before we proceed to the next example, we want to make sure our algorithm has the ability to optimize for the current example (particularly if it is not added to the memory). Over examples seen, this still implies that we have trained with each example as the current example with probability per step of . We provide an algorithm further detailing how experience replay is used in this work in the appendix (algorithm 4).
Concerns about storing examples: Obviously, it is not scalable to store every experience seen in memory. As such, in this work we focus on showing that we can achieve greater performance than baseline techniques when each approach is provided with only a small memory buffer.
3.2 Combining Experience Replay with Optimization Based MetaLearning
First order metalearning: One of the most popular metalearning algorithms to date is Model Agnostic MetaLearning (MAML) (Finn et al., 2017). MAML is an optimization based metalearning algorithm with nice properties such as the ability to approximate any learning algorithm and the ability to generalize well to learning data outside of the previous distribution (Finn & Levine, 2017). One aspect of MAML that limits its scalability is the need to explicitly compute second derivatives. The authors proposed a variant called firstorder MAML (FOMAML), which is defined by ignoring the second derivative terms to address this issue and surprisingly found that it achieved very similar performance. Recently, this phenomenon was explained by (Nichol & Schulman, 2018) who noted through Taylor expansion that the two algorithms were approximately optimizing for the same loss function. Nichol & Schulman (2018) also proposed an algorithm, Reptile, that efficiently optimizes for approximately the same objective while not requiring that the data be split into training and testing splits for each task learned as MAML does. Reptile is implemented by optimizing across batches of data sequentially with an SGD based optimizer and learning rate . After training on these batches, we take the initial parameters before training and update them to where is the learning rate for the metalearning update. The process repeats for each series of batches (algorithm 2). Shown in terms of gradients in (Nichol & Schulman, 2018), Reptile approximately optimizes for the following objective over a set of batches:
(6) 
where are batches within . This is similar to our motivation in equation 4 to the extent that gradients produced on these batches approximate samples from the stationary distribution.
The MER learning objective: In this work, we modify the Reptile algorithm to properly integrate it with an experience replay module, facilitating continual learning while maximizing transfer and minimizing interference. As we describe in more detail during the derivation in the appendix, achieving the Reptile objective in an online setting where examples are provided sequentially is nontrivial and is in part only achievable because of our sampling strategies for both the buffer and batch. Following our remarks about experience replay from the prior section, this allows us to optimize for the following objective in a continual learning setting using our proposed MER algorithm:
(7) 
The MER algorithm: MER maintains an experience replay style memory with reservoir sampling and at each time step draws batches including random samples from the buffer to be trained alongside the current example. Each of the examples within each batch is treated as its own Reptile batch of size 1 with an inner loop Reptile metaupdate after that batch is processed. We then apply the Reptile metaupdate again in an outer loop across the batches. We provide further details for MER in algorithm 1. This procedure approximates the objective of equation 7 when .
Controlling the degree of regularization: In light of our ideal objective in equation 4, we can see that using a SGD batch size of 1 has an advantage over larger batches because it allows for the second derivative information conveyed to the algorithm to be fine grained on the example level. Another reason to use sample level effective batches is that for a given number of samples drawn from the buffer we maximize from equation 6. In equation 6, the typical offline learning loss has a weighting proportional to and the regularizer term to maximize transfer and minimize interference has a weighting proportional to . This implies that by maximizing the effective we can put more weight on the regularization term. We found that for a fixed number of examples drawn from , we consistently performed better converting to a long list of individual samples than we did using proper batches as in (Nichol & Schulman, 2018) for few shot learning.
Prioritizing current learning: To ensure strong regularization, we would like our number of batches processed in a Reptile update to be large – enough that experience replay alone would start to overfit to . As such, we also need to make sure we provide enough priority to learning the current example, particularly because we may not store it in . To achieve this in algorithm 1, we sample separate batches from that are processed sequentially and each interleaved with the current example. In the appendix we also outline two additional variants of MER with very similar properties in that they effectively approximate for the same objective. In one we choose one big batch of size memories and copies of the current example (algorithm 5). In the other we choose one memory batch of size with a special current item learning rate of (algorithm 6).
Unique properties: In the end, our approach amounts to a quite easy to implement and computationally efficient extension of SGD, which is applied to an experience replay buffer by leveraging the machinery of past work on optimization based metalearning. However, the emergent regularization on learning is totally different than those previously considered. Past work on optimization based metalearning has enabled fast learning on incoming data without considering past data. Meanwhile, past work on experience replay only focused on stabilizing learning by approximating stationary conditions without altering model parameters to change the dynamics of transfer and interference.
4 Evaluation for Supervised Continual Lifelong Learning
To test the efficacy of MER we compare it to relevant baselines for continual learning of many supervised tasks from (LopezPaz & Ranzato, 2017) (see the appendix for indepth descriptions):

Online: represents online learning performance of a model trained straightforwardly one example at a time on the incoming nonstationary training data by simply applying SGD.

Independent: an independent predictor per task with less hidden units proportional to the number of tasks. When useful, it can be initialized by cloning the last predictor.

Task Input: has the same architecture as Online, but with a dedicated input layer per task.

EWC: Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is an algorithm that modifies online learning where the loss is regularized to avoid catastrophic forgetting.

GEM: Gradient Episodic Memory (GEM) (LopezPaz & Ranzato, 2017) is an approach for making efficient use of episodic storage by following gradients on incoming examples to the maximum extent while altering them so that they do not interfere with past memories. An independent adhoc analysis is performed to alter each incoming gradient. In contrast to MER, nothing generalizable is learned across examples about how to alter gradients.
We follow (LopezPaz & Ranzato, 2017) and consider final retained accuracy across all tasks after training sequentially on all tasks as our main metric for comparing approaches. Moving forward we will refer to this metric as retained accuracy (RA). In order to reveal more characteristics of the learning behavior, we also report the learning accuracy (LA) which is the average accuracy for each task directly after it is learned. Additionally, we report the backward transfer and interference (BTI) as the average change in accuracy from when a task is learned to the end of training. A highly negative BTI reflects catastrophic forgetting. Forward transfer and interference (LopezPaz & Ranzato, 2017) is only applicable for one task we explore, so we provide details in Appendix I.
Question 1 How does MER perform on supervised continual learning benchmarks?
To address this question we consider two continual learning benchmarks from (LopezPaz & Ranzato, 2017). MNIST Permutations is a variant of MNIST first proposed in (Kirkpatrick et al., 2017) where each task is transformed by a fixed permutation of the MNIST pixels. As such, the input distribution of each task is unrelated. MNIST Rotations is another variant of MNIST proposed in (LopezPaz & Ranzato, 2017) where each task contains digits rotated by a fixed angle between 0 and 180 degrees. We follow the standard benchmark setting from (LopezPaz & Ranzato, 2017)
using a modest memory buffer of size 5120 to learn 1000 sampled examples across each of 20 tasks. We provide detailed information about our architectures and hyperparameters in the appendix.
In Table 1 we report results on these benchmarks in comparison to our baseline approaches. Clearly GEM outperforms our other baselines, but our approach adds significant value over GEM in terms of retained accuracy on both benchmarks. MER achieves this by striking a superior balance between transfer and interference with respect to the past and future data. MER displays the best adaption to incoming tasks, while also providing very strong retention of knowledge when learning future tasks. EWC and using a task specific input layer both also lead to gains over standard online learning in terms of retained accuracy. However, they are quite far below the performance of approaches that make usage of episodic storage. While EWC does not store examples, in storing the Fisher information for each task it accrues more incremental resources than the episodic storage approaches.
Model  MNIST Rotations  MNIST Permutations  

RA  LA  BTI  RA  LA  BTI  
Online  46.40 0.78  78.70 0.52  32.30 0.97  55.42 0.65  69.18 0.99  13.76 1.19 
Independent  60.74 4.55  60.74 4.55    55.80 4.79  55.80 4.79   
Task Input  79.98 0.66  81.04 0.34  1.06 0.59  80.46 0.80  81.20 0.52  0.74 1.10 
EwC  57.96 1.33  78.38 0.72  20.42 1.60  62.32 1.34  75.64 1.18  13.32 2.24 
GEM  87.12 0.44  86.34 0.09  +0.80 0.42  82.50 0.42  80.30 0.80  +2.20 0.57 
MER  89.56 0.11  87.62 0.16  +1.94 0.18  85.50 0.16  86.36 0.21  0.86 0.21 
Question 2 How do the performance gains from MER vary as a function of the buffer size?
To make progress towards the greater goals of lifelong learning, we would like our algorithm to make the most use of even a modest buffer. This is because in extremely large scale settings it is unrealistic to assume a system can store a large percentage of previous examples in memory. As such, we would like to compare MER to GEM, which is known to perform well with an extremely small memory buffer (LopezPaz & Ranzato, 2017). We consider a buffer size of 500, that is over 10 times smaller than the standard setting on these benchmarks. Additionally, we also consider a buffer size of 200, matching the smallest setting explored in (LopezPaz & Ranzato, 2017). This setting corresponds to an average storage of 1 example for each combination of task and class. We report our results in Table 2. The benefits of MER seem to grow as the buffer becomes smaller. In the smallest setting, MER provides more than a 10% boost in retained accuracy on both benchmarks.
Model  Buffer  MNIST Rotations  MNIST Permutations  

RA  LA  BTI  RA  LA  BTI  
GEM  5120  87.12 0.44  86.34 0.09  +0.80 0.42  82.50 0.42  80.30 0.80  +2.20 0.57 
500  72.08 1.29  82.92 0.49  10.82 1.22  69.26 0.66  80.28 0.52  11.02 0.71  
200  66.88 0.72  85.46 0.38  18.58 0.48  55.42 1.10  79.84 1.01  24.42 1.10  
MER  5120  89.56 0.11  87.62 0.16  +1.94 0.18  85.50 0.16  86.36 0.21  0.86 0.21 
500  81.82 0.52  87.28 0.50  5.46 0.13  77.40 0.38  83.64 0.18  6.24 0.34  
200  77.24 0.47  84.82 0.25  7.58 0.26  72.74 0.46  82.18 0.23  9.44 0.34 
Model  Buffer  Many Permutations  Omniglot  

RA  LA  BTI  RA  LA  BTI  
Online  0  32.62 0.43  51.68 0.65  19.06 0.86  4.36 0.37  5.38 0.18  1.02 0.33 
EWC  0  33.46 0.46  51.30 0.81  17.84 1.15  4.63 0.14  9.43 0.63  4.80 0.68 
GEM  5120  56.76 0.29  59.66 0.46  2.92 0.52  18.03 0.15  3.86 0.09  +14.19 0.19 
500  32.14 0.50  55.66 0.53  23.52 0.87        
MER  5120  61.84 0.25  60.40 0.36  +1.44 0.21  75.23 0.52  69.12 0.83  +6.11 0.62 
500  47.40 0.35  65.18 0.20  17.78 0.39  32.05 0.69  28.78 0.91  +3.27 1.04 
Question 3 How effective is MER at dealing with increasingly nonstationary settings?
Another larger goal of lifelong learning is to enable continual learning with only relatively few examples per task. This setting is particularly difficult because we have less data to characterize each class to learn from and our distribution is increasingly nonstationary over a fixed amount of training. We would like to explore how various models perform in this kind of setting. To do this we consider two new benchmarks. Many Permutations is a variant of MNIST Permutations that has 5 times more tasks (100 total) and 5 times less training examples per task (200 each). Meanwhile we also explore the Omniglot (Lake et al., 2011) benchmark treating each of the 50 alphabets to be a task (see the appendix for experimental details). Following multitask learning conventions, 90% of the data is used for training and 10% is used for testing (Yang & Hospedales, 2017). Overall there are 1623 different character classes. We learn each class and task sequentially.
We report continual learning results using these new datasets in Table 3. The effect on the Many Permutations of efficiently using episodic storage becomes even more pronounced when the setting becomes more nonstationary. GEM and MER both achieve nearly double the performance of EWC and online learning. We also see that increasingly nonstationary settings lead to a larger performance gain for MER over GEM. Gains are quite significant for Many Permutations and remarkable for Omniglot. Omniglot is even more nonstationary including slightly fewer examples per task and MER nearly quadruples the performance of baseline techniques. Considering the poor performance of online learning and EWC it is natural to question whether or not examples were learned in the first place. We experiment with using as many as 100 gradient descent steps per incoming example to ensure each example is learned when first seen. However, due to the extremely nonstationary setting no run of any variant we tried surpassed 5.5% retained accuracy. GEM also has major deficits for learning on Omniglot that are resolved by MER which achieves far better performance when it comes to quickly learning the current task. GEM maintains a buffer using a recent item based sampling strategy and thus can not deal with nonstationarity within the task nearly as well as reservoir sampling. Additionally, we found that the optimization based on the buffer was significantly less effective and less reliable as the quadratic program fails for many hyperparameter values that lead to nonpositive definite matrices. Unfortunately, we could not get GEM to consistently converge on Omniglot for a memory size of 500 (significantly less than the number of classes), meanwhile MER handles it well. In fact, MER greatly outperforms GEM with an order of magnitude smaller buffer.
5 Evaluation for Continual Reinforcement Learning
Question 4 Can MER improve a DQN with ER in continual reinforcement learning settings?
We considered the evaluation of MER in a continual reinforcement learning setting where the environment is highly nonstationary. In order to produce these nonstationary environments in a controlled way suitable for our experimental purposes, we utilized different arcade games provided by Tasfi (2016). In our experiments we used Catcher and Flappy Bird, two simple but interesting enough environments (see Appendix K.1 for a detailed description of the environments). For the purposes of our explanation, we will call each set of fixed gamedependent parameters a task^{2}^{2}2Agents are not provided task information, forcing them to identify changes in game play on their own.. The multitask setting is then built by introducing changes in these parameters, resulting in nonstationarity across tasks. As in our other experiments, each agent is evaluated based on its performance over time on all tasks. Therefore, the reported performances provide a measurement of adaptation to the current task as well as of retention of knowledge acquired from other tasks. Our model uses a standard DQN setting, originally developed for Atari (Mnih et al., 2015). We refer to Appendix K.2 for the details on the implementation.
In Catcher, we then obtain different tasks by incrementally increasing the pellet velocity a total of 6 times during training. In Flappy Bird, the different tasks are obtained by incrementally reducing the separation between upper and lower pipes a total of 6 times during training. In Figure 3, we show the performance in Catcher when trained sequentially on 6 different tasks for 25k frames each to a maximum of 150k frames, evaluated at each point in time in all 6 tasks. Under these nonstationary conditions, a DQN using MER performs consistently better than the standard DQN with an experience replay buffer. Figure 3.1 shows the average score for the first task. If we take as inspiration how humans perform, in the last stages of training we hope that a player that obtains good results in later tasks will also obtain good results in the first tasks, insofar as the first tasks are subsumed in the latter ones. For example, in Catcher, later tasks have the pellet move faster, and thus we expect to be able to do well in the first task. However, DQN forgets significantly to perform the first task, i.e., get slowly moving pellets. In marked contrast, DQNMER exhibits minimal or no forgetting for the first set of episodes after being trained in the rest of the tasks. This behavior is intuitive in the sense that we would expect forward transfer to happen naturally in this setting, as it seems to be the case with human players. Similar behavior appears in Flappy Bird. DQNMER becomes a Platinum player on the first task during a period in which it is learning the third task. This is a more difficult environment in which the pipe gap is noticeably smaller (see Appendix K.4). We find that DQNMER exhibits the kind of natural learning patterns we would expect from humans in a curriculum learning setting for these games, while a standard DQN struggles both to generalize as the game changes and to retain what it has learned over time.
6 Further Analysis of the Approach
In this section we would like to dive deeper into how MER works. To do so we run additional detailed experiments across our three MNIST based continual learning benchmarks.
Question 5 Does MER lead to a shift in the distribution of gradient dot products?
We would like to directly verify that MER achieves our motivation in equation 7
and results in significant changes in the distribution of gradient dot products between new incoming examples and past examples over the course of learning when compared to experience replay (ER). For these experiments, we maintain a history of all examples seen that is totally separate from our notion of memory buffers that only include a partial history of examples. Every time we receive a new example we use the current model to extract a gradient direction and we also randomly sample five examples from the previous history. We save the dot products of the incoming example gradient with these five past example gradients and consider the mean of the distribution of dot products seen over the course of learning for each model. We run this experiment on the best hyperparamater setting for both our ER model and our MER model with one batch per example for fair comparison. Each model is evaluated five times over the course of learning. We report mean and standard deviations of the mean gradient dot product across runs in Table
4. We can thus verify that a very significant and reproducible difference in the mean gradient encountered is seen for MER in comparison to ER alone. This difference alters the learning process making incoming examples on average result in slight transfer rather than significant interference. This analysis confirms the desired effect of the objective function in equation 7. For these tasks there are enough similarities that our metalearning generalizes very well into the future. We should also expect it to perform well in the case of drastic domain shifts like other metalearning algorithms driven by SGD alone (Finn & Levine, 2017).Model  MNIST Permutations  MNIST Rotations  Many Permutations 

ER  0.569 0.077  1.652 0.082  1.280 0.078 
MER  +0.042 0.017  +0.017 0.007  +0.131 0.027 
Question 6 What components of MER are most important?
We would like to further analyze our proposed MER model to understand what components add the most value and when. We want to understand how powerful our proposed variant of ER is on its own and how much is added by adding metalearning to ER. In the appendix we provide detailed results considering ablated baselines for our experiments on the MNIST lifelong learning benchmarks. Our version of ER consistently provides gains over GEM on its own, but the techniques perform very comparably when we also maintain the GEM buffer with reservoir sampling. Additionally, we see that adding metalearning to ER consistently creates additional value in our experiments. In fact, metalearning appears to provide increasing value with smaller buffer sizes.
7 Conclusion
In this paper we have cast a new perspective on the problem of continual learning in terms of a fundamental tradeoff between transfer and interference. Exploiting this perspective, we have in turn developed a new algorithm MetaExperience Replay (MER) that is well suited for application to general purpose continual learning problems. We have demonstrated that MER regularizes the objective of experience replay so that gradients on incoming examples are more likely to have transfer and less likely to have interference with respect to past examples. The result is a general purpose solution to continual learning problems that outperforms strong baselines for both supervised continual learning benchmarks and continual learning in nonstationary reinforcement learning environments. Techniques for continual learning have been largely driven by different conceptualizations of the fundamental problem encountered by neural networks. We hope that the transferinterference tradeoff can be a useful problem view for future work to exploit with MER as a first successful example.
Acknowledgments
We would like to thank Pouya Bashivan, Christopher Potts, Dan Jurafsky, and Joshua Greene for their input and support of this work. This research was supported by the MITIBM Watson AI Lab, and is based in part upon work supported by the Stanford Data Science Initiative and by the NSF under Grant No. BCS1456077 and the NSF Award IIS1514268.
References
 AlShedivat et al. (2018) Maruan AlShedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via metalearning in nonstationary and competitive environments. ICLR, 2018.
 Bengio et al. (2015) Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.

Carpenter & Grossberg (1987)
Gail A Carpenter and Stephen Grossberg.
A massively parallel architecture for a selforganizing neural pattern recognition machine.
Computer vision, graphics, and image processing, 37(1):54–115, 1987.  Caruana (1997) Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997. doi: 10.1023/A:1007379606734. URL http://dx.doi.org/10.1023/A:1007379606734.
 Finn & Levine (2017) Chelsea Finn and Sergey Levine. Metalearning and universality: Deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Goodrich (2015) Benjamin Frederick Goodrich. Neuron clustering for mitigating catastrophic forgetting in supervised and reinforcement learning. 2015.
 Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, pp. 201611835, 2017.
 Lake et al. (2011) Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the Cognitive Science Society, volume 33, 2011.
 Lee et al. (2018) Jeongtae Lee, Jaehong Yun, Sungju Hwang, and Eunho Yang. Lifelong learning with dynamically expandable networks. ICLR, 2018.

Lee et al. (2017)
SangWoo Lee, JinHwa Kim, Jaehyun Jun, JungWoo Ha, and ByoungTak Zhang.
Overcoming catastrophic forgetting by incremental moment matching.
In Advances in Neural Information Processing Systems, pp. 4652–4662, 2017.  Li & Hoiem (2016) Zhizhong Li and Derek Hoiem. Learning without forgetting. In European Conference on Computer Vision, pp. 614–629. Springer, 2016.
 Lin (1992) LongJi Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321, 1992.
 LopezPaz & Ranzato (2017) David LopezPaz and Marc’Aurelio Ranzato. Gradient episodic memory for continuum learning. NIPS, 2017.
 McCloskey & Cohen (1989) Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109–165, 1989.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Murre (1992) Jacob MJ Murre. Learning and categorization in modular neural networks. 1992.
 Nichol & Schulman (2018) Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018.
 Ravi & Larochelle (2016) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. 2016.

Rebuffi et al. (2017)
SylvestreAlvise Rebuffi, Alexander Kolesnikov, and Christoph H Lampert.
icarl: Incremental classifier and representation learning.
CVPR, 2017. 
Riemer et al. (2015)
Matthew Riemer, Sophia Krasikov, and Harini Srinivasan.
A deep learning and knowledge transfer based architecture for social
media user characteristic determination.
In
Proceedings of the third International Workshop on Natural Language Processing for Social Media
, pp. 39–47, 2015.  Riemer et al. (2016) Matthew Riemer, Elham Khabiri, and Richard Goodwin. Representation stability as a regularizer for improved text analytics transfer learning. arXiv preprint arXiv:1704.03617, 2016.
 Riemer et al. (2017a) Matthew Riemer, Michele Franceschini, Djallel Bouneffouf, and Tim Klinger. Generative knowledge distillation for general purpose function compression. NIPS 2017 Workshop on Teaching Machines, Robots, and Humans, 5:30, 2017a.
 Riemer et al. (2017b) Matthew Riemer, Tim Klinger, Michele Franceschini, and Djallel Bouneffouf. Scalable recollections for continual lifelong learning. arXiv preprint arXiv:1711.06761, 2017b.
 Ring (1994) Mark Bishop Ring. Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin Austin, Texas 78712, 1994.
 Robins (1995) Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
 Rosenbaum et al. (2018) Clemens Rosenbaum, Tim Klinger, and Matthew Riemer. Routing networks: Adaptive selection of nonlinear functions for multitask learning. ICLR, 2018.
 Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International conference on machine learning, pp. 1842–1850, 2016.
 Serrà et al. (2018) Joan Serrà, Dídac Surís, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423, 2018.
 Tasfi (2016) Norman Tasfi. Pygame learning environment. https://github.com/ntasfi/PyGameLearningEnvironment, 2016.
 Thrun (1994) S Thrun. Lifelong learning perspective for mobile robot control. In Proceedings of the IEEE/RSJ/GI International Conference on Intelligent Robots and Systems, volume 1, pp. 23–30, 1994.
 Thrun (1996) Sebastian Thrun. Is learning the nth thing any easier than learning the first? Advances in neural information processing systems, pp. 640–646, 1996.
 Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638, 2016.
 Vitter (1985) Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.

Yang & Hospedales (2017)
Yongxin Yang and Timothy Hospedales.
Deep multitask representation learning: A tensor factorisation approach.
ICLR, 2017.  Zenke et al. (2017) Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pp. 3987–3995, 2017.
 Zhang & Sutton (2017) Shangtong Zhang and Richard S Sutton. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
Appendix A The Connection Between Weight Sharing and the TransferInterference Tradeoff
In this section we would like to generalize our interpretation of a large set of different weight sharing schemes including (Riemer et al., 2015; Bengio et al., 2015; Rosenbaum et al., 2018; Serrà et al., 2018) and how the concept of weight sharing impacts the dynamics of transfer (equation 1) and interference (equation 2). We will assume that we have a total parameter space that can be used by our network at any point in time. However, it is not a requirement that all parameters are actually used at all points in time. So, we can consider two specific instances in time. One where we receive data point and leverage parameters . Then, at the other instance in time, we receive data point and leverage parameters . and are both subsets of and critically the overlap between these subsets influences the possible extent of transfer and interference when training on either data point.
First let us consider two extremes. In the first extreme imagine and are entirely nonoverlapping. As such . On the positive side, this means that our solution has no potential for interference between the examples. On the other hand, there is no potential for transfer either. On the other extreme, we can imagine that . In this case, the potential for both transfer and interference is maximized as gradients with respect to every parameter have the possibility of a nonzero dot product with each other.
From this discussion it is clear that both the extreme of full weight sharing and the extreme of no weight sharing have value depending on the relationship between data points. What we would really like for continual learning is to have a system that learns when to share weights and when not to on its own. To the extent that our learning about weight sharing generalizes, this should allow us to find an optimal solution to the transferinterference tradeoff.
Appendix B Further Descriptions and Comparisons with Baseline Algorithms
Independent: originally reported in (LopezPaz & Ranzato, 2017) is the performance of an independent predictor per task which has the same architecture but with less hidden units proportional to the number of tasks. The independent predictor can be initialized randomly or clone the last trained predictor depending on what leads to better performance.
EWC: Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is an algorithm that modifies online learning where the loss is regularized to avoid catastrophic forgetting by considering the importance of parameters in the model as measured by their fisher information. EWC follows the catastrophic forgetting view of the continual learning problem by promoting less sharing of parameters for new learning that were deemed to be important for performance on old memories. We utilize the code provided by (LopezPaz & Ranzato, 2017) in our experiments. The only difference in our setting is that we provide the model one example at a time to test true continual learning rather than providing a batch of 10 examples at a time.
GEM: Gradient Episodic Memory (GEM) (LopezPaz & Ranzato, 2017) is an algorithm meant to enhance the effectiveness of episodic storage based continual learning techniques by allowing the model to adapt to incoming examples using SGD as long as the gradients do not interfere with examples from each task stored in a memory buffer. If gradients interfere leading to a decrease in the performance of a past task, a quadratic program is used to solve for the closest gradient to the original that does not have negative gradient dot products with the aggregate memories from any previous tasks. GEM is known to achieve superior performance in comparison to other recently proposed techniques that use episodic storage like (Rebuffi et al., 2017), making superior use of small memory buffer sizes. GEM follows similar motivation to our approach in that it also considers the intelligent use of gradient dot product information to improve the use case of supervised continual learning. As a result, it a very strong and interesting baseline to compare with our approach. We modify the original code and benchmarks provided by (LopezPaz & Ranzato, 2017). Once again the only difference in our setting is that we provide the model one example at a time to test true continual learning rather than providing a batch of 10 examples at a time.
We can consider the GEM algorithm as tailored to the stabilityplasticity dilemma conceptualization of continual learning in that it looks to preserve performance on past tasks while allowing for maximal plasticity to the new task. To achieve this, GEM (LopezPaz & Ranzato, 2017) solves a quadratic program to find an approximate gradient that closely matches while ensuring that the following constraint holds:
(8) 
Appendix C Reptile Algorithm
We detail the standard Reptile algorithm from (Nichol & Schulman, 2018) in algorithm 2. The function randomly samples batches of size from dataset . The
function applies minbatch stochastic gradient descent over a batch of data given a set of current parameters and learning rate.
Appendix D Details on Reservoir Sampling
Throughout this paper we refer to updates to our memory as . We would like to now provide details on how we update our memory buffer using reservoir sampling as outlined in (Vitter, 1985) (algorithm 3). Reservoir sampling solves the problem of keeping some limited number of total items seen before with equal probability when you don’t know what number will be in advance. The function randomly draws an integer inclusively between the provided minimum and maximum values.
Appendix E Experience Replay Algorithm
We detail the our variant of the experience replay in algorithm 4. This procedure closely follows recent enhancements discussed in (Zhang & Sutton, 2017; Riemer et al., 2017b, a) The function randomly samples examples from the memory buffer and interleaves them with the current example to form a single size batch. The function applies minbatch stochastic gradient descent over a batch of data given a set of current parameters and learning rate.
Appendix F The Variants of MER
We detail two additional variants of MER (algorithm 1) in algorithms 5 and 6. The function takes on a slightly different meaning in each variant of the algorithm. In algorithm 1 is used to produce batches consisting of random examples from the memory buffer and the current example. In algorithm 5 is used to produce one batch consisting of examples from the memory buffer and copies of the current example. In algorithm 6 is used to produce one batch consisting of examples from the memory buffer. In contrast, the function carries a common meaning across algorithms, applying stochastic gradient descent over a particular input and output given a set of current parameters and learning rate.
Appendix G Deriving the Effective Objective of MER
We would like to derive what objective MetaExperience Replay (algorithm 1) approximates and show that it is approximately the same objective from algorithms 5 and 6. We follow conventions from (Nichol & Schulman, 2018) and first demonstrate what happens to the effective gradients computed by the algorithm in the most trivial case. As in (Nichol & Schulman, 2018), this allows us to extrapolate an effective gradient that is a function of the number of steps taken. We can then consider the effective loss function that results in this gradient. Before we begin, let us define the following terms from (Nichol & Schulman, 2018):
(9) 
(10) 
(11) 
(12) 
(13) 
(14) 
In (Nichol & Schulman, 2018) they consider the effective gradient across one loop of reptile with size . As we have both an outer loop of Reptile applied across batches and an inner loop applied within the batch to consider, we start with a setting where the number of batches and the number of examples per batch . Let’s recall from the original paper that the gradients of Reptile with was:
(15) 
So, we can also consider the gradients of Reptile if we had 4 examples in one big batch (algorithm 5) as opposed to 2 batches of 2 examples:
(16) 
Now we can consider the case for MER where we define the parameter values as follows extending algorithm 1:
(17) 
(18) 
(19) 
(20) 
(21) 
(22) 
(23) 
(24) 
the gradient of MetaExperience Replay can thus be defined analogously to the gradient of Reptile as:
(25) 
By simply applying Reptile from equation 15 we can derive the value of the parameters after updating with Reptile within the first batch in terms of the original parameters :
(26) 
(27) 
We can express in terms of the initial point, by considering a Taylor expansion following the Reptile paper:
(28) 
Then substituting in for we express in terms of :
(29) 
We can then rewrite by taking a Taylor expansions with respect to :
(30) 
Taking another Taylor expansion we find that we can transform our expression for the Hessian:
(31) 
We can analogously also transform our expression our expression for :
(32) 
Substituting for in terms of
(33) 
(34) 
Finally, we have all of the terms we need to express and we can then derive an expression for the MER gradient :
(35) 
This equation is quite interesting and very similar to equation 16. As we would like to approximate the same objective, we can remove one hyperparameter from our model by setting . This yields:
(36) 
Indeed, with set to equal 1, we have shown that the gradient of MER is the same as one loop of Reptile with a number of steps equal to the total number of examples in all batches of MER (algorithm 5) if the current example is mixed in with the same proportion. If we include in the current example for of examples in our metareplay batch, it gets the same overall priority in both cases which is times larger than that of a random example drawn from the buffer. As such, we can also optimize an equivalent gradient using algorithm 6 because it uses a factor to increase the priority of the gradient given to the current example.
While is an interesting special case of MER in algorithm 1, in general we find it can be useful to set to be a value smaller than 1. In fact, in our experiments we consider the case when is smaller than 1 and
. The success of this approach makes sense because the higher order terms in the Taylor expansion that reflect the mismatch between parameters across replay batches creates variance in the learning process. By setting
to a value below 1 we can reduce our comparative weighting on promoting inter batch gradient similarities rather than intra batch gradient similarities.It was noted in the Reptile paper that the following equality holds if the examples and order are random:
(37) 
In our work to make sure this equality holds in an online setting, we must take multiple precautions as noted in the main text. The issue is that examples are received in a nonstationary sequence so when applied in a continual learning setting the order is not totally random or arbitrary as in the original Reptile work. We address this by maintaining our buffer using reservoir sampling, which ensures that any example seen before has a probability of being a particular element in the buffer. We also randomly select over these elements to form a batch. As this makes the order largely arbitrary to the extent that our buffer includes all examples seen, we are approximating the random offline setting from the original Reptile paper. As such we can view the gradients in equation 16 and equation 36 as leading to approximately the following objective function:
(38) 
This is precisely equation 7 in the main text.
Appendix H Supervised Continual Lifelong Learning
For the supervised continual learning benchmarks on MNIST Rotations and MNIST Permutations, following conventions, we use a two layer MLP architecture for all models with 100 hidden units in each layer. We also model our hyperparameter search after (LopezPaz & Ranzato, 2017).
For Omniglot, following (Vinyals et al., 2016)
we scale the images to 28x28 and use an architecture that consists of a stack of 4 modules before a fully connected softmax layer. Each module includes a 3x3 convolution with 64 filters, a ReLU nonlinearity and 2x2 maxpooling.
h.1 Hyperparameter Search
Here we report the hyperparameter grids that we searched over in our experiments. The best values for the MNIST Rotations (ROT) at each buffer size (ROT5120, ROT500, ROT200), MNIST Permutations (PERM) at each buffer size (PERM5120, PERM500, PERM200), Many Permutations (MANY) at each buffer size (MANY5120, MANY500), and Omniglot (OMNI) at each buffer size (OMNI5120, OMNI500) are noted accordingly in parenthesis.

Online Learning

learning rate: [0.0001, 0.0003, 0.001 (PERM, ROT), 0.003 (MANY), 0.01, 0.03, 0.1 (OMNI)]


EWC

learning rate: [0.001 (OMNI), 0.003 (MANY), 0.01, 0.03 (ROT, PERM), 0.1, 0.3, 1.0]

regularization: [1, 3 (MANY), 10 (OMNI), 30, 100, 300 (ROT, PERM), 1000, 3000, 10000, 30000]


GEM

learning rate: [0.001, 0.003 (MANY500), 0.01 (ROT, PERM, OMNI, MANY5120), 0.03, 0.1, 0.3, 1.0]

memory strength (): [0.0 (PERM500, MANY500), 0.1 (PERM200, MANY200), 0.5 (OMNI), 1.0 (ROT5120, ROT500, ROT200, PERM5120)]


Experience Replay

learning rate: [0.00003, 0.0001, 0.0003, 0.001, 0.003 (MANY), 0.01 (ROT, PERM), 0.03 (OMNI), 0.1]

batch size (1): [5, 10 (ROT500, PERM200), 25 (ROT5120, PERM5120, PERM500), 50 (OMNI, MANY5120, ROT200), 100, 250]


MetaExperience Replay

learning rate (): [0.01 (OMNI5120), 0.03 (ROT, PERM, MANY), 0.1 (OMNI500)]

across batch metalearning rate (): 1.0

within batch metalearning rate (): [0.03 (ROT5120, ROT200, PERM5120, PERM200, MANY), 0.1 (ROT500, PERM500), 0.3, 1.0 (OMNI)]

batch size (1): [5 (MANY500, OMNI500), 10, 25 (PERM500, PERM200, OMNI5120), 50 (ROT5120, PERM5120,MANY5120), 100 (ROT200, ROT500)]

number of batches per example: [1, 2 (PERM500, OMNI500), 5 (ROT500, ROT200, PERM200, OMNI5120, MANY5120), 10 (ROT5120, PERM5120, MANY500)]

Appendix I Forward Transfer and Interference
Forward transfer was a metric defined in (LopezPaz & Ranzato, 2017) based on the average increased performance on a task relative to performance at random initialization before training on that task. Unfortunately, this metric does not make much sense for tasks like MNIST Permutations where inputs are totally uncorrelated across tasks or Omniglot where outputs are totally uncorrelated across tasks. As such, we only provide performance for this metric on MNIST Rotations in Table 5.
Model  FTI 

Online  58.22 2.03 
Task Input  1.62 0.87 
EWC  58.26 1.98 
GEM  65.96 1.67 
MER  66.74 1.41 
Appendix J Ablation Experiments
In order to consider a version of GEM that uses reservoir sampling, we maintain our buffer the same way that we do for experience replay and MER. We consider everything in the buffer to be old data and solve the GEM quadratic program so that the loss is not increased on this data. We found that considering the task level gradient directions did not lead to improvements.
Model  Buffer Size  Rotations  Permutations  Many Permutations 

ER (algorithm 4)  5120  88.30 0.57  83.90 0.21  59.78 0.22 
500  76.58 0.89  74.02 0.33  42.36 0.42  
200  70.32 0.86  67.62 0.27  
MER (algorithm 1)  5120  89.56 0.11  85.50 0.16  61.84 0.25 
500  81.82 0.52  77.40 0.38  47.40 0.35  
200  77.24 0.47  72.74 0.46    
GEM (LopezPaz & Ranzato, 2017)  5120  87.12 0.44  82.50 0.42  56.76 0.29 
500  72.08 1.29  69.26 0.66  32.14 0.50  
200  66.88 0.72  55.42 1.10    
GEM with Reservoir Sampling  5120  87.16 0.41  83.68 0.40  58.94 0.53 
500  77.26 2.09  74.82 0.29  42.24 0.48  
200  69.00 0.84  68.90 0.71   
Appendix K Continual Reinforcement Learning
We detail the application of MER to deep Qlearning in algorithm 7, using notation from Mnih et al. (2015).
k.1 Description of Catcher and Flappy Bird
In Catcher, the agent controls a segment that lies horizontally in the bottom of the screen, i.e. a basket, and can move right or left, or stay still. The goal is to move the basket to catch as many pellets as possible. Missing a pellet results in losing one of the three available lives. Pellets emerge one by one from the top of the screen, and have a descending velocity that is fixed for each task.
In the case of the very popular game Flappy Bird, the agent has to navigate a bird in an environment full of pipes by deciding whether to flap or not flap its wings. The pipes appear always in pairs, one from the bottom of the screen and one from the top of the screen, and have a gap that allows the bird to pass through them. Flapping the wings results in the bird ascending, otherwise the bird descends to ground naturally. Both ascending and descending velocities are presets by the physics engine of the game. The goal is to pass through many pairs of pipes as possible without hitting a pipe, as this results in losing the game. The scoring scheme in this game awards a point each time a pipe is crossed. Despite very simple mechanics, Flappy Bird has proven to be challenging for many humans. According to the original game scoring scheme, players with a score of 10 receive a Bronze medal; with 20 points, a Silver medal; 30 results in a Gold medal, and any score better than 40 is rewarded with a Platinum medal.
k.2 DQN with MetaExperience Replay
The DQN used to train on both games follows the classic architecture from (Mnih et al., 2015): it has a CNN consisting of 3 layers, the first with 32 filters and an 8x8 kernel, the second layer with 64 filters and a 4x4 kernel, and a final layer with 64 filters and a 3x3 kernel. The CNN is followed by two fully connected layers. A ReLU nonlinearity was applied after each layer. We limited the memory buffer size for our models to 50k transitions, which is roughly the proportion of the total memories used in the benchmark setting for our supervised learning tasks.
k.3 Parameters for Continual Reinforcement Learning Experiments
For the continual reinforcement learning setting we set the parameters using results from the experiments in the supervised setting as a guidance. Both Catcher and Flappy Bird used the same hyper parameters as detailed below with the obvious exception of the gamedependent parameter that defines each task. Models were trained with a maximum number of frames of 150k and 6 total tasks, switching every 25k frames. Runs used different random seeds for the initialization as stated in the figures.

Game Parameters

Catcher: : (vertical velocity of pellet increased from default 0.608).

Flappy Bird: : (pipe gap decreased 10 from default 100).


Experience Replay

learning rate: 0.0001

batch size (1): 16


MetaExperience Replay

learning rate (): 0.0001

within batch metalearning rate (): 1

across batch metalearning rate (): 0.3

batch size (1): 16

number of steps: 1

buffer size: 50000

k.4 Continual Reinforcement Learning Evaluation for Flappy Bird
Performance during training in continual learning for a nonstationary version of Flappy Bird is shown in Figure (4). Graphsshow averaged values over three validation episodes across three different seed initializations. Vertical grid lineson the frame axis indicate task switch