Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

10/29/2018 ∙ by Matthew Riemer, et al. ∙ 0

Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a trade-off between transfer and interference. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.



There are no comments yet.


page 2

page 8

page 9

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Solving the Continual Learning Problem

A long-held goal of AI is to build agents capable of operating autonomously for long periods. Such agents must incrementally learn and adapt to a changing environment while maintaining memories of what they have learned before, a setting known as lifelong learning (Thrun, 1994, 1996). In this paper we explore a variant called continual learning (Ring, 1994; Lopez-Paz & Ranzato, 2017). Continual learning assumes that the learner is exposed to a sequence of tasks, where each task is a sequence of experiences from the same distribution. We would like to develop a solution in this setting by discovering notions of tasks without supervision while learning incrementally after every experience. This is challenging because in standard offline single task and multi-task learning (Caruana, 1997) it is implicitly assumed that the data is drawn from an i.i.d. stationary distribution. Neural networks tend to struggle whenever this is not the case (Goodrich, 2015).

Over the years, solutions to the continual learning problem have been largely driven by prominent conceptualizations of the issues faced by neural networks. One popular view is catastrophic forgetting (interference) (McCloskey & Cohen, 1989), in which the primary concern is the lack of stability in neural networks, and the main solution is to limit the extent of weight sharing across experiences by focusing on preserving past knowledge (Kirkpatrick et al., 2017; Zenke et al., 2017; Lee et al., 2017). Another popular and more complex conceptualization is the stability-plasticity dilemma (Carpenter & Grossberg, 1987). In this view, the primary concern is the balance between network stability (to preserve past knowledge) and plasticity (to rapidly learn the current experience). Recently proposed techniques focus on balancing limited weight sharing with some mechanism to ensure fast learning (Li & Hoiem, 2016; Riemer et al., 2016; Lopez-Paz & Ranzato, 2017; Rosenbaum et al., 2018; Lee et al., 2018; Serrà et al., 2018). In this paper, we extend this view by noting that for continual learning over an unbounded number of distributions, we need to consider the stability-plasticity trade-off in both the forward and backward directions in time (Figure 1A).

The transfer-interference trade-off proposed in this paper (section 2) makes explicit the connection between controlling the extent of weight sharing and managing the continual learning problem. The key difference in perspective with past conceptualizations of continual learning is that we are not just concerned with current interference and transfer with respect to past examples, but also with the dynamics of transfer and interference moving forward as we learn. This new view of the problem leads to a natural meta-learning perspective on continual learning: we would like to learn to change our parameters, so that we effect the dynamics of transfer and interference in a general sense. Ideally, not just in the past, but in the future as well. To the extent that our meta-learning into the future generalizes, this should effectively make it easier for our model to perform continual learning in non-stationary settings. We achieve this by building off past work on experience replay (Murre, 1992; Lin, 1992; Robins, 1995) that has been a mainstay for solving non-stationary problems with neural networks (Mnih et al., 2015). We argue that the reason lies in replay’s ability to stabilize learning without sacrificing weight-sharing. In this work, we propose a novel meta-experience replay (MER) algorithm that combines experience replay with optimization based meta-learning (section 3) as a first step towards modeling this perspective. Moreover, our experiments (sections 4, 5, and 6), confirm our theory. MER shows great promise across a variety of supervised continual learning and continual reinforcement learning settings. In contrast to past work on meta-learning for few shot learning (Santoro et al., 2016; Vinyals et al., 2016; Ravi & Larochelle, 2016; Finn et al., 2017) and reinforcement learning across successive tasks (Al-Shedivat et al., 2018), we are not only trying to simply improve the speed of learning on new data, but also doing so in a way that preserves knowledge of past data and generalizes to future data. Critically, our approach is not reliant on any notion of tasks provided by humans to the model and in most of the settings we explore must detect the concept of new tasks without supervision.

2 The Transfer-Interference Trade-off for Continual Learning

Figure 1: A) The stability-plasticity dilemma considers plasticity with respect to the current learning and how it degrades old learning. It is important to address this problem without reducing the network’s performance in the future. The transfer-interference trade-off is the stability-plasticity dilemma considered in both forward and backward directions. This bi-directional view is crucial since solutions to the stability-plasticity dilemma that reduce the degree of weight-sharing are unlikely to function well into the future. B) A depiction of an example of transfer in weight space. C) A depiction of an example of interference in weight space.

At an instant in time with parameters and loss , we can define111Throughout our paper we will discuss ideas in terms of the supervised learning problem formulation. Extensions to the reinforcement learning formulation are straightforward and we provide more details in the appendix. operational measures of transfer and interference between two arbitrary distinct examples and while training with SGD. We can say that transfer occurs when:


where is the dot product operator. This implies that learning example will without repetition improve performance on example and vice versa (Figure 1B). On the other hand we can say that interference occurs when:


Here, in contrast, learning example will lead to unlearning (i.e. forgetting) of example and vice versa (Figure 1C). The potential for transfer is maximized when weight sharing is maximized while potential for interference is minimized when weight sharing is minimized (Appendix A).

Past solutions for the stability-plasticity dilemma in continual learning operate in a simplified temporal context whereby learning is divided into two phases: all past experiences are lumped together as old memories and the data that is currently being learned qualifies as new learning. In this setting, the goal is to minimize the interference projecting backward in time, and many of the approaches that succeed in this respect do so by reducing the degree of weight sharing in one form or another. In the appendix we explain how our baseline approaches (Kirkpatrick et al., 2017; Lopez-Paz & Ranzato, 2017) each fit within this paradigm.

The important issue with this perspective, however, is that the system is not done with learning that will occur in the future, and what the future may bring is largely unknown. This makes it incumbent upon us to do nothing to potentially undermine the networks ability to effectively learn in an uncertain future. This consideration makes us extend the temporal horizon of the stability-plasticity problem forward, turning it, more generally, into a continual learning problem that we label as solving the Transfer-Interference Trade-off (Figure 1A). Specifically, it is important not only to reduce backward interference from our current point in time, but we must do so in a manner that does not limit our ability to learn in the future. This more general perspective acknowledges a subtlety in the problem: the issue of weight sharing across tasks arises both backward and forward in time. With this temporally symmetric perspective, the transfer-interference trade-off becomes clear. The weight sharing across tasks that enables transfer to improve future performance must not disrupt performance on what has come previously. This brings us to our central point: if we mitigate interference by reducing weight-sharing, we are likely to worsen the networks capacity for transfer into the future, since transfer depends precisely on weight sharing as well. What is the future will eventually become the past, and performance must be good at all points in time. As such, our work adopts a meta-learning perspective on the continual learning problem. We would like to learn to learn each example in a way that generalizes to other examples from the overall distribution.

3 A System for Learning to Learn without Forgetting

In typical offline supervised learning, we can express our optimization objective over the stationary distribution of pairs within the dataset :



is the loss function, which can be selected to fit the problem. If we would like to maximize transfer and minimize interference, we can imagine it would be useful to add an auxiliary loss to the objective to bias the learning process in that direction. Considering equations

1 and 2, one obviously beneficial choice would be to also directly consider the gradients with respect to the loss function evaluated at randomly chosen datapoints. If we could maximize the dot products between gradients at these different points, it would directly encourage the network to share parameters where gradient directions align and keep parameters separate where interference is caused by gradients in opposite directions. So, ideally we would like to optimize for the following objective:

procedure Train()
      for  do
            for  in  do
                 // Draw batches from buffer:
                 for  do
                       for  do
                       end for
                       // Within batch Reptile meta-update:
                 end for
                 // Across batch Reptile meta-update:
                 // Reservoir sampling memory update:
                  (algorithm 3)
            end for
      end for
end procedure
Algorithm 1 Meta-Experience Replay (MER)

where and are randomly sampled unique data points. In our work we will attempt to design a continual learning system that optimizes for this objective. However, there are multiple problems that must be addressed to implement this kind of learning process in practical settings. The first problem is that continual learning deals with learning over a continuous non-stationary stream of data. We address this by following past work and implementing an experience replay module that augments online learning so that we can approximately optimize over the stationary distribution of all examples seen so far. Another practical problem is that the gradients of this loss depend on the second derivative with respect to the loss function, which is expensive to compute. We address this issue by indirectly approximating this objective to a first order Taylor series expansion using an online meta-learning algorithm with minimal computational overhead.

3.1 Experience Replay

Learning objective: The continual lifelong learning setting poses a challenge for the optimization of neural networks as examples come one by one in a non-stationary stream. Instead, we would like our network to optimize over the stationary distribution of all examples seen so far. Experience replay (Lin, 1992; Murre, 1992)

is an old technique that remains a central component of deep learning systems attempting to learn in non-stationary settings, and we will adopt here conventions from recent work

(Zhang & Sutton, 2017; Riemer et al., 2017b) leveraging this approach. The central feature of experience replays is keeping a memory of examples seen that is interleaved with the training of the current example with the goal of making training more stable. As a result, experience replay approximates the objective in equation 3 to the extent that approximates :


has a current size and maximum size . In our work, we update the buffer with reservoir sampling (Appendix D

). This ensures that at every time-step the probability that any of the

examples seen has of being in the buffer is equal to . The content of the buffer resembles a stationary distribution over all examples seen to the extent that the items stored captures the variation of past examples. Following the standard practice in offline learning, we train by randomly sampling a batch from the distribution captured by .

Prioritizing the current example: the variant of experience replay we explore differs from offline learning in that the current example has a special role ensuring that it is always interleaved with the examples sampled from the replay buffer. This is because before we proceed to the next example, we want to make sure our algorithm has the ability to optimize for the current example (particularly if it is not added to the memory). Over examples seen, this still implies that we have trained with each example as the current example with probability per step of . We provide an algorithm further detailing how experience replay is used in this work in the appendix (algorithm 4).

Concerns about storing examples: Obviously, it is not scalable to store every experience seen in memory. As such, in this work we focus on showing that we can achieve greater performance than baseline techniques when each approach is provided with only a small memory buffer.

3.2 Combining Experience Replay with Optimization Based Meta-Learning

First order meta-learning: One of the most popular meta-learning algorithms to date is Model Agnostic Meta-Learning (MAML) (Finn et al., 2017). MAML is an optimization based meta-learning algorithm with nice properties such as the ability to approximate any learning algorithm and the ability to generalize well to learning data outside of the previous distribution (Finn & Levine, 2017). One aspect of MAML that limits its scalability is the need to explicitly compute second derivatives. The authors proposed a variant called first-order MAML (FOMAML), which is defined by ignoring the second derivative terms to address this issue and surprisingly found that it achieved very similar performance. Recently, this phenomenon was explained by (Nichol & Schulman, 2018) who noted through Taylor expansion that the two algorithms were approximately optimizing for the same loss function. Nichol & Schulman (2018) also proposed an algorithm, Reptile, that efficiently optimizes for approximately the same objective while not requiring that the data be split into training and testing splits for each task learned as MAML does. Reptile is implemented by optimizing across batches of data sequentially with an SGD based optimizer and learning rate . After training on these batches, we take the initial parameters before training and update them to where is the learning rate for the meta-learning update. The process repeats for each series of batches (algorithm 2). Shown in terms of gradients in (Nichol & Schulman, 2018), Reptile approximately optimizes for the following objective over a set of batches:


where are batches within . This is similar to our motivation in equation 4 to the extent that gradients produced on these batches approximate samples from the stationary distribution.

The MER learning objective: In this work, we modify the Reptile algorithm to properly integrate it with an experience replay module, facilitating continual learning while maximizing transfer and minimizing interference. As we describe in more detail during the derivation in the appendix, achieving the Reptile objective in an online setting where examples are provided sequentially is non-trivial and is in part only achievable because of our sampling strategies for both the buffer and batch. Following our remarks about experience replay from the prior section, this allows us to optimize for the following objective in a continual learning setting using our proposed MER algorithm:


The MER algorithm: MER maintains an experience replay style memory with reservoir sampling and at each time step draws batches including random samples from the buffer to be trained alongside the current example. Each of the examples within each batch is treated as its own Reptile batch of size 1 with an inner loop Reptile meta-update after that batch is processed. We then apply the Reptile meta-update again in an outer loop across the batches. We provide further details for MER in algorithm 1. This procedure approximates the objective of equation 7 when .

Controlling the degree of regularization: In light of our ideal objective in equation 4, we can see that using a SGD batch size of 1 has an advantage over larger batches because it allows for the second derivative information conveyed to the algorithm to be fine grained on the example level. Another reason to use sample level effective batches is that for a given number of samples drawn from the buffer we maximize from equation 6. In equation 6, the typical offline learning loss has a weighting proportional to and the regularizer term to maximize transfer and minimize interference has a weighting proportional to . This implies that by maximizing the effective we can put more weight on the regularization term. We found that for a fixed number of examples drawn from , we consistently performed better converting to a long list of individual samples than we did using proper batches as in (Nichol & Schulman, 2018) for few shot learning.

Prioritizing current learning: To ensure strong regularization, we would like our number of batches processed in a Reptile update to be large – enough that experience replay alone would start to overfit to . As such, we also need to make sure we provide enough priority to learning the current example, particularly because we may not store it in . To achieve this in algorithm 1, we sample separate batches from that are processed sequentially and each interleaved with the current example. In the appendix we also outline two additional variants of MER with very similar properties in that they effectively approximate for the same objective. In one we choose one big batch of size memories and copies of the current example (algorithm 5). In the other we choose one memory batch of size with a special current item learning rate of (algorithm 6).

Unique properties: In the end, our approach amounts to a quite easy to implement and computationally efficient extension of SGD, which is applied to an experience replay buffer by leveraging the machinery of past work on optimization based meta-learning. However, the emergent regularization on learning is totally different than those previously considered. Past work on optimization based meta-learning has enabled fast learning on incoming data without considering past data. Meanwhile, past work on experience replay only focused on stabilizing learning by approximating stationary conditions without altering model parameters to change the dynamics of transfer and interference.

4 Evaluation for Supervised Continual Lifelong Learning

To test the efficacy of MER we compare it to relevant baselines for continual learning of many supervised tasks from (Lopez-Paz & Ranzato, 2017) (see the appendix for in-depth descriptions):

  • Online: represents online learning performance of a model trained straightforwardly one example at a time on the incoming non-stationary training data by simply applying SGD.

  • Independent: an independent predictor per task with less hidden units proportional to the number of tasks. When useful, it can be initialized by cloning the last predictor.

  • Task Input: has the same architecture as Online, but with a dedicated input layer per task.

  • EWC: Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is an algorithm that modifies online learning where the loss is regularized to avoid catastrophic forgetting.

  • GEM: Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) is an approach for making efficient use of episodic storage by following gradients on incoming examples to the maximum extent while altering them so that they do not interfere with past memories. An independent adhoc analysis is performed to alter each incoming gradient. In contrast to MER, nothing generalizable is learned across examples about how to alter gradients.

We follow (Lopez-Paz & Ranzato, 2017) and consider final retained accuracy across all tasks after training sequentially on all tasks as our main metric for comparing approaches. Moving forward we will refer to this metric as retained accuracy (RA). In order to reveal more characteristics of the learning behavior, we also report the learning accuracy (LA) which is the average accuracy for each task directly after it is learned. Additionally, we report the backward transfer and interference (BTI) as the average change in accuracy from when a task is learned to the end of training. A highly negative BTI reflects catastrophic forgetting. Forward transfer and interference (Lopez-Paz & Ranzato, 2017) is only applicable for one task we explore, so we provide details in Appendix I.

Question 1 How does MER perform on supervised continual learning benchmarks?

To address this question we consider two continual learning benchmarks from (Lopez-Paz & Ranzato, 2017). MNIST Permutations is a variant of MNIST first proposed in (Kirkpatrick et al., 2017) where each task is transformed by a fixed permutation of the MNIST pixels. As such, the input distribution of each task is unrelated. MNIST Rotations is another variant of MNIST proposed in (Lopez-Paz & Ranzato, 2017) where each task contains digits rotated by a fixed angle between 0 and 180 degrees. We follow the standard benchmark setting from (Lopez-Paz & Ranzato, 2017)

using a modest memory buffer of size 5120 to learn 1000 sampled examples across each of 20 tasks. We provide detailed information about our architectures and hyperparameters in the appendix.

In Table 1 we report results on these benchmarks in comparison to our baseline approaches. Clearly GEM outperforms our other baselines, but our approach adds significant value over GEM in terms of retained accuracy on both benchmarks. MER achieves this by striking a superior balance between transfer and interference with respect to the past and future data. MER displays the best adaption to incoming tasks, while also providing very strong retention of knowledge when learning future tasks. EWC and using a task specific input layer both also lead to gains over standard online learning in terms of retained accuracy. However, they are quite far below the performance of approaches that make usage of episodic storage. While EWC does not store examples, in storing the Fisher information for each task it accrues more incremental resources than the episodic storage approaches.

Model MNIST Rotations MNIST Permutations
Online 46.40 0.78 78.70 0.52 -32.30 0.97 55.42 0.65 69.18 0.99 -13.76 1.19
Independent 60.74 4.55 60.74 4.55 - 55.80 4.79 55.80 4.79 -
Task Input 79.98 0.66 81.04 0.34 -1.06 0.59 80.46 0.80 81.20 0.52 -0.74 1.10
EwC 57.96 1.33 78.38 0.72 -20.42 1.60 62.32 1.34 75.64 1.18 -13.32 2.24
GEM 87.12 0.44 86.34 0.09 +0.80 0.42 82.50 0.42 80.30 0.80 +2.20 0.57
MER 89.56 0.11 87.62 0.16 +1.94 0.18 85.50 0.16 86.36 0.21 -0.86 0.21
Table 1: Performance on continual lifelong learning 20 tasks benchmarks from (Lopez-Paz & Ranzato, 2017).

Question 2 How do the performance gains from MER vary as a function of the buffer size?

To make progress towards the greater goals of lifelong learning, we would like our algorithm to make the most use of even a modest buffer. This is because in extremely large scale settings it is unrealistic to assume a system can store a large percentage of previous examples in memory. As such, we would like to compare MER to GEM, which is known to perform well with an extremely small memory buffer (Lopez-Paz & Ranzato, 2017). We consider a buffer size of 500, that is over 10 times smaller than the standard setting on these benchmarks. Additionally, we also consider a buffer size of 200, matching the smallest setting explored in (Lopez-Paz & Ranzato, 2017). This setting corresponds to an average storage of 1 example for each combination of task and class. We report our results in Table 2. The benefits of MER seem to grow as the buffer becomes smaller. In the smallest setting, MER provides more than a 10% boost in retained accuracy on both benchmarks.

Model Buffer MNIST Rotations MNIST Permutations
GEM 5120 87.12 0.44 86.34 0.09 +0.80 0.42 82.50 0.42 80.30 0.80 +2.20 0.57
500 72.08 1.29 82.92 0.49 -10.82 1.22 69.26 0.66 80.28 0.52 -11.02 0.71
200 66.88 0.72 85.46 0.38 -18.58 0.48 55.42 1.10 79.84 1.01 -24.42 1.10
MER 5120 89.56 0.11 87.62 0.16 +1.94 0.18 85.50 0.16 86.36 0.21 -0.86 0.21
500 81.82 0.52 87.28 0.50 -5.46 0.13 77.40 0.38 83.64 0.18 -6.24 0.34
200 77.24 0.47 84.82 0.25 -7.58 0.26 72.74 0.46 82.18 0.23 -9.44 0.34
Table 2: Performance varying the buffer size on continual learning benchmarks (Lopez-Paz & Ranzato, 2017).
Model Buffer Many Permutations Omniglot
Online 0 32.62 0.43 51.68 0.65 -19.06 0.86 4.36 0.37 5.38 0.18 -1.02 0.33
EWC 0 33.46 0.46 51.30 0.81 -17.84 1.15 4.63 0.14 9.43 0.63 -4.80 0.68
GEM 5120 56.76 0.29 59.66 0.46 -2.92 0.52 18.03 0.15 3.86 0.09 +14.19 0.19
500 32.14 0.50 55.66 0.53 -23.52 0.87 - - -
MER 5120 61.84 0.25 60.40 0.36 +1.44 0.21 75.23 0.52 69.12 0.83 +6.11 0.62
500 47.40 0.35 65.18 0.20 -17.78 0.39 32.05 0.69 28.78 0.91 +3.27 1.04
Table 3: Performance on many task non-stationary continual lifelong learning benchmarks.

Question 3 How effective is MER at dealing with increasingly non-stationary settings?

Another larger goal of lifelong learning is to enable continual learning with only relatively few examples per task. This setting is particularly difficult because we have less data to characterize each class to learn from and our distribution is increasingly non-stationary over a fixed amount of training. We would like to explore how various models perform in this kind of setting. To do this we consider two new benchmarks. Many Permutations is a variant of MNIST Permutations that has 5 times more tasks (100 total) and 5 times less training examples per task (200 each). Meanwhile we also explore the Omniglot (Lake et al., 2011) benchmark treating each of the 50 alphabets to be a task (see the appendix for experimental details). Following multi-task learning conventions, 90% of the data is used for training and 10% is used for testing (Yang & Hospedales, 2017). Overall there are 1623 different character classes. We learn each class and task sequentially.

We report continual learning results using these new datasets in Table 3. The effect on the Many Permutations of efficiently using episodic storage becomes even more pronounced when the setting becomes more non-stationary. GEM and MER both achieve nearly double the performance of EWC and online learning. We also see that increasingly non-stationary settings lead to a larger performance gain for MER over GEM. Gains are quite significant for Many Permutations and remarkable for Omniglot. Omniglot is even more non-stationary including slightly fewer examples per task and MER nearly quadruples the performance of baseline techniques. Considering the poor performance of online learning and EWC it is natural to question whether or not examples were learned in the first place. We experiment with using as many as 100 gradient descent steps per incoming example to ensure each example is learned when first seen. However, due to the extremely non-stationary setting no run of any variant we tried surpassed 5.5% retained accuracy. GEM also has major deficits for learning on Omniglot that are resolved by MER which achieves far better performance when it comes to quickly learning the current task. GEM maintains a buffer using a recent item based sampling strategy and thus can not deal with non-stationarity within the task nearly as well as reservoir sampling. Additionally, we found that the optimization based on the buffer was significantly less effective and less reliable as the quadratic program fails for many hyperparameter values that lead to non-positive definite matrices. Unfortunately, we could not get GEM to consistently converge on Omniglot for a memory size of 500 (significantly less than the number of classes), meanwhile MER handles it well. In fact, MER greatly outperforms GEM with an order of magnitude smaller buffer.

Figure 2: A, B: sequence of frames for the game Catcher and Flappy Bird respectively. The goal in Catcher is to capture the falling pellet by horizontally moving the racket on the bottom of the screen. In Flappy Bird, the goal is to navigate the bird through as many pipes as possible by making it go up or leave it fall. C, D: average score in Catcher and Flappy Bird respectively for the evaluation on the first task, corresponding to slower falling pellet in Catcher and a larger gap in Flappy Bird.

5 Evaluation for Continual Reinforcement Learning

Question 4 Can MER improve a DQN with ER in continual reinforcement learning settings?

We considered the evaluation of MER in a continual reinforcement learning setting where the environment is highly non-stationary. In order to produce these non-stationary environments in a controlled way suitable for our experimental purposes, we utilized different arcade games provided by Tasfi (2016). In our experiments we used Catcher and Flappy Bird, two simple but interesting enough environments (see Appendix K.1 for a detailed description of the environments). For the purposes of our explanation, we will call each set of fixed game-dependent parameters a task222Agents are not provided task information, forcing them to identify changes in game play on their own.. The multi-task setting is then built by introducing changes in these parameters, resulting in non-stationarity across tasks. As in our other experiments, each agent is evaluated based on its performance over time on all tasks. Therefore, the reported performances provide a measurement of adaptation to the current task as well as of retention of knowledge acquired from other tasks. Our model uses a standard DQN setting, originally developed for Atari (Mnih et al., 2015). We refer to Appendix K.2 for the details on the implementation.

Figure 3: Continual learning performance for a non-stationary version of Catcher. Graphs show averaged values over ten validation episodes across five different seeds. Vertical grid lines on the x-axis indicate a task switch.

In Catcher, we then obtain different tasks by incrementally increasing the pellet velocity a total of 6 times during training. In Flappy Bird, the different tasks are obtained by incrementally reducing the separation between upper and lower pipes a total of 6 times during training. In Figure 3, we show the performance in Catcher when trained sequentially on 6 different tasks for 25k frames each to a maximum of 150k frames, evaluated at each point in time in all 6 tasks. Under these non-stationary conditions, a DQN using MER performs consistently better than the standard DQN with an experience replay buffer. Figure 3.1 shows the average score for the first task. If we take as inspiration how humans perform, in the last stages of training we hope that a player that obtains good results in later tasks will also obtain good results in the first tasks, insofar as the first tasks are subsumed in the latter ones. For example, in Catcher, later tasks have the pellet move faster, and thus we expect to be able to do well in the first task. However, DQN forgets significantly to perform the first task, i.e., get slowly moving pellets. In marked contrast, DQN-MER exhibits minimal or no forgetting for the first set of episodes after being trained in the rest of the tasks. This behavior is intuitive in the sense that we would expect forward transfer to happen naturally in this setting, as it seems to be the case with human players. Similar behavior appears in Flappy Bird. DQN-MER becomes a Platinum player on the first task during a period in which it is learning the third task. This is a more difficult environment in which the pipe gap is noticeably smaller (see Appendix K.4). We find that DQN-MER exhibits the kind of natural learning patterns we would expect from humans in a curriculum learning setting for these games, while a standard DQN struggles both to generalize as the game changes and to retain what it has learned over time.

6 Further Analysis of the Approach

In this section we would like to dive deeper into how MER works. To do so we run additional detailed experiments across our three MNIST based continual learning benchmarks.

Question 5 Does MER lead to a shift in the distribution of gradient dot products?

We would like to directly verify that MER achieves our motivation in equation 7

and results in significant changes in the distribution of gradient dot products between new incoming examples and past examples over the course of learning when compared to experience replay (ER). For these experiments, we maintain a history of all examples seen that is totally separate from our notion of memory buffers that only include a partial history of examples. Every time we receive a new example we use the current model to extract a gradient direction and we also randomly sample five examples from the previous history. We save the dot products of the incoming example gradient with these five past example gradients and consider the mean of the distribution of dot products seen over the course of learning for each model. We run this experiment on the best hyperparamater setting for both our ER model and our MER model with one batch per example for fair comparison. Each model is evaluated five times over the course of learning. We report mean and standard deviations of the mean gradient dot product across runs in Table

4. We can thus verify that a very significant and reproducible difference in the mean gradient encountered is seen for MER in comparison to ER alone. This difference alters the learning process making incoming examples on average result in slight transfer rather than significant interference. This analysis confirms the desired effect of the objective function in equation 7. For these tasks there are enough similarities that our meta-learning generalizes very well into the future. We should also expect it to perform well in the case of drastic domain shifts like other meta-learning algorithms driven by SGD alone (Finn & Levine, 2017).

Model MNIST Permutations MNIST Rotations Many Permutations
ER -0.569 0.077 -1.652 0.082 -1.280 0.078
MER +0.042 0.017 +0.017 0.007 +0.131 0.027
Table 4: Analysis of the mean dot product across the period of learning between gradients on incoming examples and gradients on randomly sampled past examples across 5 runs on MNIST based benchmarks.

Question 6 What components of MER are most important?

We would like to further analyze our proposed MER model to understand what components add the most value and when. We want to understand how powerful our proposed variant of ER is on its own and how much is added by adding meta-learning to ER. In the appendix we provide detailed results considering ablated baselines for our experiments on the MNIST lifelong learning benchmarks. Our version of ER consistently provides gains over GEM on its own, but the techniques perform very comparably when we also maintain the GEM buffer with reservoir sampling. Additionally, we see that adding meta-learning to ER consistently creates additional value in our experiments. In fact, meta-learning appears to provide increasing value with smaller buffer sizes.

7 Conclusion

In this paper we have cast a new perspective on the problem of continual learning in terms of a fundamental trade-off between transfer and interference. Exploiting this perspective, we have in turn developed a new algorithm Meta-Experience Replay (MER) that is well suited for application to general purpose continual learning problems. We have demonstrated that MER regularizes the objective of experience replay so that gradients on incoming examples are more likely to have transfer and less likely to have interference with respect to past examples. The result is a general purpose solution to continual learning problems that outperforms strong baselines for both supervised continual learning benchmarks and continual learning in non-stationary reinforcement learning environments. Techniques for continual learning have been largely driven by different conceptualizations of the fundamental problem encountered by neural networks. We hope that the transfer-interference trade-off can be a useful problem view for future work to exploit with MER as a first successful example.


We would like to thank Pouya Bashivan, Christopher Potts, Dan Jurafsky, and Joshua Greene for their input and support of this work. This research was supported by the MIT-IBM Watson AI Lab, and is based in part upon work supported by the Stanford Data Science Initiative and by the NSF under Grant No. BCS-1456077 and the NSF Award IIS-1514268.


Appendix A The Connection Between Weight Sharing and the Transfer-Interference Trade-off

In this section we would like to generalize our interpretation of a large set of different weight sharing schemes including (Riemer et al., 2015; Bengio et al., 2015; Rosenbaum et al., 2018; Serrà et al., 2018) and how the concept of weight sharing impacts the dynamics of transfer (equation 1) and interference (equation 2). We will assume that we have a total parameter space that can be used by our network at any point in time. However, it is not a requirement that all parameters are actually used at all points in time. So, we can consider two specific instances in time. One where we receive data point and leverage parameters . Then, at the other instance in time, we receive data point and leverage parameters . and are both subsets of and critically the overlap between these subsets influences the possible extent of transfer and interference when training on either data point.

First let us consider two extremes. In the first extreme imagine and are entirely non-overlapping. As such . On the positive side, this means that our solution has no potential for interference between the examples. On the other hand, there is no potential for transfer either. On the other extreme, we can imagine that . In this case, the potential for both transfer and interference is maximized as gradients with respect to every parameter have the possibility of a non-zero dot product with each other.

From this discussion it is clear that both the extreme of full weight sharing and the extreme of no weight sharing have value depending on the relationship between data points. What we would really like for continual learning is to have a system that learns when to share weights and when not to on its own. To the extent that our learning about weight sharing generalizes, this should allow us to find an optimal solution to the transfer-interference trade-off.

Appendix B Further Descriptions and Comparisons with Baseline Algorithms

Independent: originally reported in (Lopez-Paz & Ranzato, 2017) is the performance of an independent predictor per task which has the same architecture but with less hidden units proportional to the number of tasks. The independent predictor can be initialized randomly or clone the last trained predictor depending on what leads to better performance.

EWC: Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is an algorithm that modifies online learning where the loss is regularized to avoid catastrophic forgetting by considering the importance of parameters in the model as measured by their fisher information. EWC follows the catastrophic forgetting view of the continual learning problem by promoting less sharing of parameters for new learning that were deemed to be important for performance on old memories. We utilize the code provided by (Lopez-Paz & Ranzato, 2017) in our experiments. The only difference in our setting is that we provide the model one example at a time to test true continual learning rather than providing a batch of 10 examples at a time.

GEM: Gradient Episodic Memory (GEM) (Lopez-Paz & Ranzato, 2017) is an algorithm meant to enhance the effectiveness of episodic storage based continual learning techniques by allowing the model to adapt to incoming examples using SGD as long as the gradients do not interfere with examples from each task stored in a memory buffer. If gradients interfere leading to a decrease in the performance of a past task, a quadratic program is used to solve for the closest gradient to the original that does not have negative gradient dot products with the aggregate memories from any previous tasks. GEM is known to achieve superior performance in comparison to other recently proposed techniques that use episodic storage like (Rebuffi et al., 2017), making superior use of small memory buffer sizes. GEM follows similar motivation to our approach in that it also considers the intelligent use of gradient dot product information to improve the use case of supervised continual learning. As a result, it a very strong and interesting baseline to compare with our approach. We modify the original code and benchmarks provided by (Lopez-Paz & Ranzato, 2017). Once again the only difference in our setting is that we provide the model one example at a time to test true continual learning rather than providing a batch of 10 examples at a time.

We can consider the GEM algorithm as tailored to the stability-plasticity dilemma conceptualization of continual learning in that it looks to preserve performance on past tasks while allowing for maximal plasticity to the new task. To achieve this, GEM (Lopez-Paz & Ranzato, 2017) solves a quadratic program to find an approximate gradient that closely matches while ensuring that the following constraint holds:


Appendix C Reptile Algorithm

We detail the standard Reptile algorithm from (Nichol & Schulman, 2018) in algorithm 2. The function randomly samples batches of size from dataset . The

function applies min-batch stochastic gradient descent over a batch of data given a set of current parameters and learning rate.

procedure Train()
     while not done do
         // Draw batches from data:
         for  do
         end for
         // Reptile meta-update:
     end while
end procedure
Algorithm 2 Reptile for Stationary Data

Appendix D Details on Reservoir Sampling

Throughout this paper we refer to updates to our memory as . We would like to now provide details on how we update our memory buffer using reservoir sampling as outlined in (Vitter, 1985) (algorithm 3). Reservoir sampling solves the problem of keeping some limited number of total items seen before with equal probability when you don’t know what number will be in advance. The function randomly draws an integer inclusively between the provided minimum and maximum values.

procedure Reservoir()
     if  then
         if   then
         end if
     end if
end procedure
Algorithm 3 Reservoir Sampling with Algorithm R

Appendix E Experience Replay Algorithm

We detail the our variant of the experience replay in algorithm 4. This procedure closely follows recent enhancements discussed in (Zhang & Sutton, 2017; Riemer et al., 2017b, a) The function randomly samples examples from the memory buffer and interleaves them with the current example to form a single size batch. The function applies min-batch stochastic gradient descent over a batch of data given a set of current parameters and learning rate.

procedure Train()
     for  do
         for  in  do
              // Draw batch from buffer:
              // Update parameters with mini-batch SGD:
              // Reservoir sampling memory update:
               (algorithm 3)
         end for
     end for
end procedure
Algorithm 4 Experience Replay (ER) with Reservoir Sampling

Appendix F The Variants of MER

We detail two additional variants of MER (algorithm 1) in algorithms 5 and 6. The function takes on a slightly different meaning in each variant of the algorithm. In algorithm 1 is used to produce batches consisting of random examples from the memory buffer and the current example. In algorithm 5 is used to produce one batch consisting of examples from the memory buffer and copies of the current example. In algorithm 6 is used to produce one batch consisting of examples from the memory buffer. In contrast, the function carries a common meaning across algorithms, applying stochastic gradient descent over a particular input and output given a set of current parameters and learning rate.

procedure Train()
     for  do
         for  in  do
              // Draw batch from buffer:
              for  do
              end for
              // Reptile meta-update:
              // Reservoir sampling memory update:
               (algorithm 3)
         end for
     end for
end procedure
Algorithm 5 Meta-Experience Replay (MER) - One Big Batch
procedure Train()
     for  do
         for  in  do
              // Draw batch from buffer:
              // SGD on individual samples from batch:
              for  do
              end for
              // High learning rate SGD on current example:
              // Reptile meta-update:
              // Reservoir sampling memory update:
               (algorithm 3)
         end for
     end for
end procedure
Algorithm 6 Meta-Experience Replay (MER) - Current Example Learning Rate

Appendix G Deriving the Effective Objective of MER

We would like to derive what objective Meta-Experience Replay (algorithm 1) approximates and show that it is approximately the same objective from algorithms 5 and 6. We follow conventions from (Nichol & Schulman, 2018) and first demonstrate what happens to the effective gradients computed by the algorithm in the most trivial case. As in (Nichol & Schulman, 2018), this allows us to extrapolate an effective gradient that is a function of the number of steps taken. We can then consider the effective loss function that results in this gradient. Before we begin, let us define the following terms from (Nichol & Schulman, 2018):


In (Nichol & Schulman, 2018) they consider the effective gradient across one loop of reptile with size . As we have both an outer loop of Reptile applied across batches and an inner loop applied within the batch to consider, we start with a setting where the number of batches and the number of examples per batch . Let’s recall from the original paper that the gradients of Reptile with was:


So, we can also consider the gradients of Reptile if we had 4 examples in one big batch (algorithm 5) as opposed to 2 batches of 2 examples:


Now we can consider the case for MER where we define the parameter values as follows extending algorithm 1:


the gradient of Meta-Experience Replay can thus be defined analogously to the gradient of Reptile as:


By simply applying Reptile from equation 15 we can derive the value of the parameters after updating with Reptile within the first batch in terms of the original parameters :


By subbing equation 26 into equation 23 we can see that:


We can express in terms of the initial point, by considering a Taylor expansion following the Reptile paper:


Then substituting in for we express in terms of :


We can then rewrite by taking a Taylor expansions with respect to :


Taking another Taylor expansion we find that we can transform our expression for the Hessian:


We can analogously also transform our expression our expression for :


Substituting for in terms of


We then substitute equation 31, equation 33, and equation 29 into equation 34:


Finally, we have all of the terms we need to express and we can then derive an expression for the MER gradient :


This equation is quite interesting and very similar to equation 16. As we would like to approximate the same objective, we can remove one hyperparameter from our model by setting . This yields:


Indeed, with set to equal 1, we have shown that the gradient of MER is the same as one loop of Reptile with a number of steps equal to the total number of examples in all batches of MER (algorithm 5) if the current example is mixed in with the same proportion. If we include in the current example for of examples in our meta-replay batch, it gets the same overall priority in both cases which is times larger than that of a random example drawn from the buffer. As such, we can also optimize an equivalent gradient using algorithm 6 because it uses a factor to increase the priority of the gradient given to the current example.

While is an interesting special case of MER in algorithm 1, in general we find it can be useful to set to be a value smaller than 1. In fact, in our experiments we consider the case when is smaller than 1 and

. The success of this approach makes sense because the higher order terms in the Taylor expansion that reflect the mismatch between parameters across replay batches creates variance in the learning process. By setting

to a value below 1 we can reduce our comparative weighting on promoting inter batch gradient similarities rather than intra batch gradient similarities.

It was noted in the Reptile paper that the following equality holds if the examples and order are random:


In our work to make sure this equality holds in an online setting, we must take multiple precautions as noted in the main text. The issue is that examples are received in a non-stationary sequence so when applied in a continual learning setting the order is not totally random or arbitrary as in the original Reptile work. We address this by maintaining our buffer using reservoir sampling, which ensures that any example seen before has a probability of being a particular element in the buffer. We also randomly select over these elements to form a batch. As this makes the order largely arbitrary to the extent that our buffer includes all examples seen, we are approximating the random offline setting from the original Reptile paper. As such we can view the gradients in equation 16 and equation 36 as leading to approximately the following objective function:


This is precisely equation 7 in the main text.

Appendix H Supervised Continual Lifelong Learning

For the supervised continual learning benchmarks on MNIST Rotations and MNIST Permutations, following conventions, we use a two layer MLP architecture for all models with 100 hidden units in each layer. We also model our hyperparameter search after (Lopez-Paz & Ranzato, 2017).

For Omniglot, following (Vinyals et al., 2016)

we scale the images to 28x28 and use an architecture that consists of a stack of 4 modules before a fully connected softmax layer. Each module includes a 3x3 convolution with 64 filters, a ReLU non-linearity and 2x2 max-pooling.

h.1 Hyperparameter Search

Here we report the hyper-parameter grids that we searched over in our experiments. The best values for the MNIST Rotations (ROT) at each buffer size (ROT-5120, ROT-500, ROT-200), MNIST Permutations (PERM) at each buffer size (PERM-5120, PERM-500, PERM-200), Many Permutations (MANY) at each buffer size (MANY-5120, MANY-500), and Omniglot (OMNI) at each buffer size (OMNI-5120, OMNI-500) are noted accordingly in parenthesis.

  • Online Learning

    • learning rate: [0.0001, 0.0003, 0.001 (PERM, ROT), 0.003 (MANY), 0.01, 0.03, 0.1 (OMNI)]

  • EWC

    • learning rate: [0.001 (OMNI), 0.003 (MANY), 0.01, 0.03 (ROT, PERM), 0.1, 0.3, 1.0]

    • regularization: [1, 3 (MANY), 10 (OMNI), 30, 100, 300 (ROT, PERM), 1000, 3000, 10000, 30000]

  • GEM

    • learning rate: [0.001, 0.003 (MANY-500), 0.01 (ROT, PERM, OMNI, MANY-5120), 0.03, 0.1, 0.3, 1.0]

    • memory strength (): [0.0 (PERM-500, MANY-500), 0.1 (PERM-200, MANY-200), 0.5 (OMNI), 1.0 (ROT-5120, ROT-500, ROT-200, PERM-5120)]

  • Experience Replay

    • learning rate: [0.00003, 0.0001, 0.0003, 0.001, 0.003 (MANY), 0.01 (ROT, PERM), 0.03 (OMNI), 0.1]

    • batch size (-1): [5, 10 (ROT-500, PERM-200), 25 (ROT-5120, PERM-5120, PERM-500), 50 (OMNI, MANY-5120, ROT-200), 100, 250]

  • Meta-Experience Replay

    • learning rate (): [0.01 (OMNI-5120), 0.03 (ROT, PERM, MANY), 0.1 (OMNI-500)]

    • across batch meta-learning rate (): 1.0

    • within batch meta-learning rate (): [0.03 (ROT-5120, ROT-200, PERM-5120, PERM-200, MANY), 0.1 (ROT-500, PERM-500), 0.3, 1.0 (OMNI)]

    • batch size (-1): [5 (MANY-500, OMNI-500), 10, 25 (PERM-500, PERM-200, OMNI-5120), 50 (ROT-5120, PERM-5120,MANY-5120), 100 (ROT-200, ROT-500)]

    • number of batches per example: [1, 2 (PERM-500, OMNI-500), 5 (ROT-500, ROT-200, PERM-200, OMNI-5120, MANY-5120), 10 (ROT-5120, PERM-5120, MANY-500)]

Appendix I Forward Transfer and Interference

Forward transfer was a metric defined in (Lopez-Paz & Ranzato, 2017) based on the average increased performance on a task relative to performance at random initialization before training on that task. Unfortunately, this metric does not make much sense for tasks like MNIST Permutations where inputs are totally uncorrelated across tasks or Omniglot where outputs are totally uncorrelated across tasks. As such, we only provide performance for this metric on MNIST Rotations in Table 5.

Model FTI
Online 58.22 2.03
Task Input 1.62 0.87
EWC 58.26 1.98
GEM 65.96 1.67
MER 66.74 1.41
Table 5: Forward transfer and interference (FTI) experiments on MNIST Rotations.

Appendix J Ablation Experiments

In order to consider a version of GEM that uses reservoir sampling, we maintain our buffer the same way that we do for experience replay and MER. We consider everything in the buffer to be old data and solve the GEM quadratic program so that the loss is not increased on this data. We found that considering the task level gradient directions did not lead to improvements.

Model Buffer Size Rotations Permutations Many Permutations
ER (algorithm 4) 5120 88.30 0.57 83.90 0.21 59.78 0.22
500 76.58 0.89 74.02 0.33 42.36 0.42
200 70.32 0.86 67.62 0.27
MER (algorithm 1) 5120 89.56 0.11 85.50 0.16 61.84 0.25
500 81.82 0.52 77.40 0.38 47.40 0.35
200 77.24 0.47 72.74 0.46 -
GEM (Lopez-Paz & Ranzato, 2017) 5120 87.12 0.44 82.50 0.42 56.76 0.29
500 72.08 1.29 69.26 0.66 32.14 0.50
200 66.88 0.72 55.42 1.10 -
GEM with Reservoir Sampling 5120 87.16 0.41 83.68 0.40 58.94 0.53
500 77.26 2.09 74.82 0.29 42.24 0.48
200 69.00 0.84 68.90 0.71 -
Table 6: Retained accuracy ablation experiments on MNIST based learning lifelong learning benchmarks.

Appendix K Continual Reinforcement Learning

We detail the application of MER to deep Q-learning in algorithm 7, using notation from Mnih et al. (2015).

procedure DQN-MER()
     // Initialize action-value function with parameters :
     // Initialize action-value function with the same parameters :
     // Initialize experience replay buffer:
     while  do
         // Begin new episode:
         // Initialize the state with the initial observation:
         while episode not done do
              // Select with probability an action from set of possible actions:
              // Perform the action in the environment:
              // Store current transition with reward :
               (algorithm 3)
              // Store current weights:
              for  do
                  for  do
                       // Sample one set of processed sequences, actions, and rewards from :
                       // Optimize the Huber loss with respect to the parameters :
                  end for
                  // Within batch Reptile meta-update:
              end for
              // Across batch Reptile meta-update:
              // Reset target action-value network to every number of episodes:
         end while
     end while
end procedure
Algorithm 7 Deep Q-learning with Meta-Experience Replay (MER)

k.1 Description of Catcher and Flappy Bird

In Catcher, the agent controls a segment that lies horizontally in the bottom of the screen, i.e. a basket, and can move right or left, or stay still. The goal is to move the basket to catch as many pellets as possible. Missing a pellet results in losing one of the three available lives. Pellets emerge one by one from the top of the screen, and have a descending velocity that is fixed for each task.

In the case of the very popular game Flappy Bird, the agent has to navigate a bird in an environment full of pipes by deciding whether to flap or not flap its wings. The pipes appear always in pairs, one from the bottom of the screen and one from the top of the screen, and have a gap that allows the bird to pass through them. Flapping the wings results in the bird ascending, otherwise the bird descends to ground naturally. Both ascending and descending velocities are presets by the physics engine of the game. The goal is to pass through many pairs of pipes as possible without hitting a pipe, as this results in losing the game. The scoring scheme in this game awards a point each time a pipe is crossed. Despite very simple mechanics, Flappy Bird has proven to be challenging for many humans. According to the original game scoring scheme, players with a score of 10 receive a Bronze medal; with 20 points, a Silver medal; 30 results in a Gold medal, and any score better than 40 is rewarded with a Platinum medal.

k.2 DQN with Meta-Experience Replay

The DQN used to train on both games follows the classic architecture from (Mnih et al., 2015): it has a CNN consisting of 3 layers, the first with 32 filters and an 8x8 kernel, the second layer with 64 filters and a 4x4 kernel, and a final layer with 64 filters and a 3x3 kernel. The CNN is followed by two fully connected layers. A ReLU non-linearity was applied after each layer. We limited the memory buffer size for our models to 50k transitions, which is roughly the proportion of the total memories used in the benchmark setting for our supervised learning tasks.

k.3 Parameters for Continual Reinforcement Learning Experiments

For the continual reinforcement learning setting we set the parameters using results from the experiments in the supervised setting as a guidance. Both Catcher and Flappy Bird used the same hyper parameters as detailed below with the obvious exception of the game-dependent parameter that defines each task. Models were trained with a maximum number of frames of 150k and 6 total tasks, switching every 25k frames. Runs used different random seeds for the initialization as stated in the figures.

  • Game Parameters

    • Catcher: : (vertical velocity of pellet increased from default 0.608).

    • Flappy Bird: : (pipe gap decreased 10 from default 100).

  • Experience Replay

    • learning rate: 0.0001

    • batch size (-1): 16

  • Meta-Experience Replay

    • learning rate (): 0.0001

    • within batch meta-learning rate (): 1

    • across batch meta-learning rate (): 0.3

    • batch size (-1): 16

    • number of steps: 1

    • buffer size: 50000

k.4 Continual Reinforcement Learning Evaluation for Flappy Bird

Figure 4: Continual learning for a non-stationary version of Flappy Bird.

Performance during training in continual learning for a non-stationary version of Flappy Bird is shown in Figure (4). Graphsshow averaged values over three validation episodes across three different seed initializations. Vertical grid lineson the frame axis indicate task switch