Official implementation of the Averaged Gradient Episodic Memory (A-GEM) in Tensorflow
Learning with less supervision is a major challenge in artificial intelligence. One sensible approach to decrease the amount of supervision is to leverage prior experience and transfer knowledge from tasks seen in the past. However, a necessary condition for a successful transfer is the ability to remember how to perform previous tasks. The Continual Learning (CL) setting, whereby an agent learns from a stream of tasks without seeing any example twice, is an ideal framework to investigate how to accrue such knowledge. In this work, we consider supervised learning tasks and methods that leverage a very small episodic memory for continual learning. Through an extensive empirical analysis across four benchmark datasets adapted to CL, we observe that a very simple baseline, which jointly trains on both examples from the current task as well as examples stored in the memory, outperforms state-of-the-art CL approaches with and without episodic memory. Surprisingly, repeated learning over tiny episodic memories does not harm generalization on past tasks, as joint training on data from subsequent tasks acts like a data dependent regularizer. We discuss and evaluate different approaches to write into the memory. Most notably, reservoir sampling works remarkably well across the board, except when the memory size is extremely small. In this case, writing strategies that guarantee an equal representation of all classes work better. Overall, these methods should be considered as a strong baseline candidate when benchmarking new CL approachesREAD FULL TEXT VIEW PDF
Official implementation of the Averaged Gradient Episodic Memory (A-GEM) in Tensorflow
Arguably, the objective of continual learning (CL) is to rapidly learn new skills from a sequence of tasks leveraging the knowledge accumulated in the past. Catastrophic forgetting (McCloskey and Cohen, 1989), i.e. the inability of a model to recall how to perform tasks seen in the past, makes such rapid or efficient adaptation extremely difficult.
This decades old problem of CL (Ring, 1997; Thrun, 1998) is now seeing a surge of interest in the research community. Recently, several works have attempted to reduce forgetting by adding a regularization term to the objective function. In some of these works (Kirkpatrick et al., 2016; Zenke et al., 2017; Chaudhry et al., 2018; Aljundi et al., 2018), the regularization term discourages change in parameters that were important to solve past tasks. In other works (Li and Hoiem, 2016; Rebuffi et al., 2017), regularization is used to penalize feature drift on already learned tasks. Yet another approach is to use an episodic memory storing data from past tasks (Rebuffi et al., 2017; Chaudhry et al., 2018); one effective approach to leverage such episodic memory is to use it to constrain the optimization such that the loss on past tasks can never increase (Lopez-Paz and Ranzato, 2017).
In this work, we do a quantitative study of CL methods on four benchmark datasets under the assumption that i) each task is fully supervised, ii) that each example from a task can only be seen once using the learning protocol proposed by Chaudhry et al. (2019) (see §3), and iii) that the learner has access to a very small episodic memory to store and replay examples from the past. Restricting the size of the episodic memory is important because it makes the learning problem more realistic and more distinct from multi-task learning.
While Lopez-Paz and Ranzato (2017) and Chaudhry et al. (2019) used the memory as a constraint, here we drastically simplify the optimization problem and directly train on the memory, resulting in better performance and more efficient learning. Earlier works (Isele and Cosgun, 2018) explored a similar usage of episodic memory, dubbed Experience Replay (ER)333For consistency to prior work in the literature, we will refer to this approach which trains on the episodic memory as ER, although its usage for supervised learning tasks is far less established., but for RL tasks where the learner does multiple passes over the data using a very large episodic memory. Our work is instead most similar to Riemer et al. (2019), who also considered the same single-pass through the data learning setting and trained directly on the episodic memory. Our contribution is to extend their study by a) considering much smaller episodic memories, b) investigating different strategies to write into the memory and c) analyzing why training on tiny memories does not lead to overfitting.
Our extensive empirical analysis shows that when the size of the episodic buffer is reasonably large, ER with reservoir sampling outperforms current state-of-the-art CL methods (Kirkpatrick et al., 2016; Chaudhry et al., 2019; Riemer et al., 2019). However, when the episodic buffer is very small, reservoir sampling-based ER suffers a performance loss because there is no guarantee that each class has at least one representative example stored in the memory. In this regime, other sampling strategies that sacrifice perfect randomization for balancing the population across classes work better. This observation motivated us to introduce a simple hybrid approach which combines the best of both worlds and does not require prior knowledge of the total number of tasks: it starts by using reservoir sampling but then switches to a balanced strategy when it detects that at least one class has too few examples stored in the memory. Importantly and counter intuitively, repetitive learning on the same tiny episodic memory still yields a significant boost of performance thanks to the regularization effect brought by training on subsequent tasks, a topic which we investigate at length in §4.5. In conclusion, ER on tiny episodic memories offers very strong performance at a very small additional computational cost. We believe that this approach will serve as a stronger baseline for the development of future CL approaches.
Recent works (Lopez-Paz and Ranzato, 2017; Chaudhry et al., 2019) have shown that methods relying on episodic memory have superior performance than regularization based approaches (e.g., (Kirkpatrick et al., 2016; Zenke et al., 2017)) when using a “single-pass through the data” protocol, see §3 for details. While Lopez-Paz and Ranzato (2017) and Chaudhry et al. (2019) used the episodic memory as a mean to project gradients, more recently Riemer et al. (2019) proposed an approach that achieves even better generalization by training directly with the examples stored in the memory. In this work, we build upon Riemer et al. (2019) and further investigate the use of memory as a source of training data when the memory size is very small
. Moreover, we compare various heuristics for writing into the memory.
The overall training procedure is given in Alg. 1. Compared to the simplest baseline model that merely tunes the parameters on the new task starting from the previous task parameter vector, ER makes two modifications. First, it has an episodic memory which is updated at every time step, line 8. Second, it doubles the size of the minibatch used to compute the gradient descent parameter update by stacking the actual minibatch of examples from the current task with a minibatch of examples taken at random from the memory, line 7. As we shall see in our empirical validation, these two simple modifications yield much better generalization and substantially limit forgetting, while incurring in a negligible additional computational cost on modern GPU devices.
Next, we describe various strategies to write into the memory. All these methods assume access to a continuous stream of data and a small additional temporary memory, which rules out approaches relying on the temporary storage of all the examples seen so far. This restriction is consistent with our definition of CL; a learning experience through a stream of data under the constraint of a fixed and small sized memory and limited compute budget.
Similarly to Riemer et al. (2019), Reservoir sampling (Vitter, 1985) takes as input a stream of data of unknown length and returns a random subset of items from that stream. If ‘’ is the number of points observed so far and
is the size of the reservoir (sampling buffer), this selection strategy samples each data point with a probability. The routine to update the memory is given in Alg. 2.
Similarly to Lopez-Paz and Ranzato (2017), for each task, the ring buffer strategy allocates as many equally sized FIFO buffers as there are classes. If is the total number of classes across all tasks, each stack has a buffer of size . As shown in Alg. 3, the memory stores the last few observations from each class. Unlike reservoir sampling, in this strategy the samples stored for older tasks do not change throughout training, leading to potentially stronger overfitting. Also, at early stages of training the memory is not fully utilized since each stack has constant size throughout training. However, this simple sampling strategy guarantees equal representation of all classes in the memory, which is particularly important when the memory is very small.
For each class, we use online k-Means to estimate k centroids in feature space, using the representation before the last classification layer. We then store in the memory the input examples whose feature representation is the closest to such centroids, see Alg.4. This memory writing strategy has similar benefits and drawbacks of ring buffer, except that it has potentially better coverage of the feature space in L2 sense.
Similarly to Rebuffi et al. (2017), for each class we compute a running estimate of the average feature vector just before the classification layer and store in the memory examples whose feature representation is closest to such average feature vector, see details in Alg. 5. This writing strategy has the same balancing guarantees of ring buffer and k-means, but it populates the memory differently. Instead of populating the memory at random or using k-Means, it puts examples that are closest to the mode in feature space.
We use the same learning framework proposed by Chaudhry et al. (2019). There are two streams of tasks, and . The former contains only a handful of tasks and it is only used for cross-validation purposes. Tasks from can be replayed as many times as needed and have various degree of similarity to tasks in the training and evaluation dataset . The latter stream instead can be played only once; the learner will observe examples in sequence and will be tested throughout the learning experience. The final performance will be reported on the held-out test set drawn from .
The -th task in any of these streams consists of , where each triplet constitutes an example defined by an input (), a task descriptor () which is an integer id in this work, and a target vector (), where is the set of labels specific to task and .
Let be the performance of the model on the held-out test set of task ‘’ after the model is trained on task ‘’. The average accuracy at task is then defined as:
Let be the forgetting on task ‘’ after the model is trained on task ‘’ which is computed as:
The average forgetting measure at task is then defined as:
In this section, we review the benchmark datasets we used in our evaluation, as well as the architectures and the baselines we compared against. We then report the results we obtained using episodic memory and ER, and we conclude with a brief analysis investigating generalization when using ER on tiny memories.
We consider four commonly used benchmarks in CL literature. Permuted MNIST (Kirkpatrick et al., 2016) is a variant of MNIST (LeCun, 1998) dataset of handwritten digits where each task has a certain random permutation of the input pixels which is applied to all the images of that task. Our Permuted MNIST benchmark consists of a total of tasks.
Split CIFAR (Zenke et al., 2017) consists of splitting the original CIFAR-100 dataset (Krizhevsky and Hinton, 2009) into disjoint subsets, each of which is considered as a separate task. Each task has classes that are randomly sampled without replacement from the total of classes.
Similarly to Split CIFAR, Split miniImageNet is constructed by splitting miniImageNet (Vinyals et al., 2016)
, a subset of ImageNet with a total of 100 classes and 600 images per class, todisjoint subsets.
For MNIST, we use a fully-connected network with two hidden layers of 256 ReLU units each. For CIFAR and miniImageNet, a reduced ResNet18, similar toLopez-Paz and Ranzato (2017)
, is used and a standard ResNet18 with ImageNet pretraining is used for CUB. The input integer task id is used to select a task specific classifier head, and the network is trained via cross-entropy loss.
For a given dataset stream, all models use the same architecture, and all models are optimized via stochastic gradient descent with a mini-batch size equal to 10. The size of the mini-batch sampled from the episodic memory is also set to 10 irrespective of the size of the episodic buffer.
We compare against the following baselines:
finetune, a model trained continually without any regularization and episodic memory, with parameters of a new task initialized from the parameters of the previous task.
ewc (Kirkpatrick et al., 2016), a regularization-based approach that avoids catastrophic forgetting by limiting learning of parameters critical to the performance of past tasks, as measured by the Fisher information matrix (FIM). In particular, we compute the FIM as a moving average similar to ewc++ in Chaudhry et al. (2018) and online EWC in Schwarz et al. (2018).
mer (Riemer et al., 2019), a model that also leverages an episodic memory and uses a loss that approximates the dot products of the gradients of current and previous tasks to avoid forgetting. To make the experimental setting more comparable to the other methods (in terms of SGD updates), we set the number of inner gradient steps to for each outer Reptile (Nichol and Schulman, 2018) meta-update with the mini-batch size of .
In the first experiment, we measured average accuracy at the end of the learning experience on as a function of the size of the memory. From the results in Fig. 1 we can make several observations. First and not surprisingly, average accuracy increases with the memory size and average accuracy is much better than the baseline finetune and ewc, showing that CL methods with episodic memories are indeed very effective. For instance, on CIFAR the improvement brought by an episodic memory storing a single example per class is at least 10% (difference between performance of MER, the worst performing method using episodic memory, and ewc, the best performing baseline not relying on episodic memory), a gain that further increases to 20% when the memory stores 10 examples per class.
Second, methods using ER outperform not only the baseline approaches that do not have episodic memory (finetune and ewc) but also state-of-the-art approaches relying on an episodic memory of the same size (a-gem and mer). For instance, on CIFAR the gain over a-gem brought by ER is 1.7% when the memory only stores 1 example per class, and more than 5% when the memory stores 13 examples per class. This finding might seem quite surprising and will be investigated in more depth in §4.5.
Third, experience replay based on reservoir sampling works the best across the board except when the memory size is very small (less than 3 samples per class). Empirically we observed that as more and more tasks arrive and the size of the memory per class shrinks, reservoir sampling often ends up evicting some of the earlier classes from the memory, thereby inducing higher forgetting.
Fourth, when the memory is tiny, sampling methods that by construction guarantee a balanced number of samples per class, work the best (even better than reservoir sampling). All methods that have this property, ring buffer, k-Means and Mean of Features, have a rather similar performance which is substantially better than the reservoir sampling. For instance, on CIFAR, with one example per class in the memory, ER with reservoir sampling is 3.5% worse than ER K-Means, but ER K-Means, ER Ring Buffer and ER MoF are all within 0.5% from each other (see Tab. 3 in Appendix for numerical values). These findings are further confirmed by looking at the evolution of the average accuracy (Fig. 2 left) as new tasks arrive when the memory can store at most one example per class.
The better performance of strategies like ring buffer for tiny episodic memories, and reservoir sampling for bigger episodic memories, suggests a hybrid approach, whereby the writing strategy relies on reservoir sampling till some classes have too few samples stored in the memory. At that point, the writing strategy switches to the ring buffer scheme which guarantees a minimum number of examples for each class. For instance, in the experiment of Fig. 3 the memory budget consists of only memory slots (as there are tasks and classes per task), an average of sample per class by the end of the learning experience. The learner switches from reservoir sampling to ring buffer once it observes that any of the classes seen in the past has only one sample left in the memory. When the switch happens (marked by a red vertical line in the figure), the learner only keeps randomly picked examples per class in the memory, where is the number of examples of class in the memory and are the total number of classes observed so far. The overwriting happens opportunistically, removing examples from over-represented classes as new classes are observed. Fig. 3 shows that when the number of tasks is small, the hybrid version enjoys the high accuracy of reservoir sampling. As more tasks arrive and the memory per task shrinks, the hybrid scheme achieves superior performance than reservoir (and at least similar to ring buffer).
|Methods||Training Time [s]|
Finally, experience replay methods are not only outperforming all other approaches in terms of accuracy (and lower forgetting as reported in Appendix Tab. 2, 3, 4 and 5), but also in terms of compute time. Tab. 1 reports training time on both Permuted MNIST and Split CIFAR, using ring buffer as a use case since all other ER methods have the same computational complexity. We observe that ER adds only a slight overhead compared to the finetuning baseline, but it is much cheaper than stronger baseline methods like a-gem and mer, particularly when the model is bigger, like on the CIFAR dataset.
The strong performance of experience replay methods which directly learn using the examples stored in the memory may be surprising. In fact, Lopez-Paz and Ranzato (2017) discounted this loss design option by saying: “Obviously, minimizing the loss at the current example together with [the loss on the episodic memory] results in overfitting to the examples stored in [the memory]”. How can the repeated training over the same very small handful of examples possibly generalize?
To investigate this matter we conducted an additional experiment. For simplicity, let us assume there are only two tasks, and , and let’s study the generalization performance on as we train on . We denote by the training set of and by the memory storing examples from ’s training set. One hypothesis is that although direct training on the examples in (in addition to those coming from ) does indeed lead to strong memorization of (as measured by nearly zero cross-entropy loss on ), such training is still overall beneficial in terms of generalization on the original task because the joint learning over the examples of the current task acts as a strong, albeit implicit and data dependent, regularizer for .
To validate this hypothesis, we conducted experiments on the MNIST Rotations dataset (Lopez-Paz and Ranzato, 2017), where each task has digits rotated by a certain degree. We choose this dataset because it enables fine control over the relatedness between the tasks. We considered the same architecture used for the Permuted MNIST, with only memory slots, one for each class of . First, we verified that the loss on the memory quickly drops to nearly as the model is trained using both and . As expected, the model achieves a perfect performance on the examples in the memory, which is not true for methods like a-gem which make less direct use of the memory (see Tab. 6 in Appendix). We then verified that only training on without , yields strong overfitting to the examples in the memory and poor generalization performance, with a mere average accuracy of on from the initial which was obtained just after training on . If we only train on without using (same as finetune baseline), we also observed overfitting to as long as and are sufficiently unrelated, Fig. 4(b) and 4(c). When the two tasks are closely related instead (difference of rotation angles less than 20 degrees), we observe that even without the memory, generalization on improves as we train on because of positive transfer from the related task, see red curve in Fig. 4(a). However, when we train on both and , generalization on is better than finetune baseline, i.e., training with only, regardless of the degree of relatedness between the two tasks, as shown by the green curves in Fig. 4.
These findings suggest that while the model essentially memorizes the examples in the memory, this does not necessarily have a detrimental effect in terms of generalization as long as such learning is performed in conjunction with the examples of . Moreover, there are two major axes controlling this regularizer: the number of examples in and the relatedness between the tasks. The former sets the strength of the regularizer. The latter, as measured by the accuracy on when training only on , controls its effectiveness. When and are closely related, Fig. 4(a), training on prevents overfitting to by providing a data dependent regularization that, even by itself, produces positive transfer. When and are somewhat related, Fig. 4(b), training on still improves generalization on albeit to a much lesser extent. However, when the tasks are almost adversarial to each other as an upside down 2 may look like a 5, the resulting regularization becomes even harmful, Fig. 4(c). In this case, accuracy drops from (training only on ) to (training on both and ).
In this work we studied ER methods for CL. Our empirical analysis on several benchmark streams of data shows that ER methods offer the best performance at a very marginal increase of computational cost. We tested several baseline approaches to populate the episodic memory and found that reservoir sampling, works the best when the memory is relatively large. However, when the memory is tiny, methods sacrificing randomness for balancing among classes yield better performance. Based on these observations we proposed a hybrid approach that offers the best of both worlds; the high performance of reservoir sampling at early stages of training and the high performance of the second class of methods when more tasks arrive and the memory of each task shrinks.
Our study also sheds light into a very interesting phenomenon: memorization (zero cross-entropy loss) of tiny memories is useful for generalization because training on subsequent tasks acts like a data dependent regularizer. To the best of our knowledge this is the first investigation of this kind, as memory replay has mostly being used in past literature for reinforcement learning settings in conjunction with very large memories(Wang et al., 2017; Isele and Cosgun, 2018). Overall, we hope the CL community will adopt experience replay methods as a baseline, given their strong empirical performance, efficiency and simplicity of implementation.
There are several avenues of future work. For instance, we would like to investigate what are the optimal inputs that best mitigate expected forgetting and optimal strategies to remove samples from the memory when this is entirely filled up.
This work was supported by the Rhodes Trust, ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC grant Seebibyte EP/M013774/1 and EPSRC/MURI grant EP/N019474/1. We would also like to acknowledge the Royal Academy of Engineering and FiveAI.
Memory aware synapses: Learning what (not) to forget.In ECCV, 2018.
Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences of the United States of America (PNAS), 2016.
The mnist database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998.
European Conference on Computer Vision, pages 614–629, 2016.