Recent success in deep learning primarily relies on training powerful models with fixed datasets in stationary environments. However, in the non-stationary setting, where data distribution changes over time, artificial neural networks (ANNs) fail to match the efficiency of human learning. In this setup, humans can learn incrementally from sequential experiences while leveraging and maintaining past knowledge. In contrast, standard training algorithms(Goodfellow et al., 2016) for ANNs overwrite the representations from the past tasks upon exposure to a new task. This leads to rapid performance degradation on the past tasks - a phenomenon known as ‘Catastrophic Forgetting’ Mccloskey and Cohen (1989); Ratcliff (1990). Continual Learning (CL) (Ring, 1998; Thrun and Mitchell, 1995) aims to mitigate forgetting while sequentially updating the model on a stream of tasks.
To overcome catastrophic forgetting, an active line of research in continual learning stores a few training samples from the past tasks as experiences in an episodic memory. Some of these memory based methods Lopez-Paz and Ranzato (2017); Chaudhry et al. (2019a); Guo et al. (2020) use examples from the episodic memory to constrain the optimization process for the new task so that the loss of the past tasks does not increase. Several works have proposed variants of experience replay Chaudhry et al. (2019b, 2021); Riemer et al. (2019); Buzzega et al. (2020), where the model is jointly optimized on the samples from both episodic memory and new task. Such methods provide simple yet effective solutions to the catastrophic forgetting especially in online CL setting where each example is seen only once during training. However, performance of these methods strongly depends on the size of the episodic memory. The authors in Knoblauch, Husain, and Diethe (2020) argued that for an optimal performance one needs to store all the past examples in the memory. While experience replay with large memory would yield higher performance, this would essentially mimic the joint optimization process in independent and identically distributed (IID) data setting which puts the effectiveness of CL algorithms into question Prabhu, Torr, and Dokania (2020). Therefore, recent works Chaudhry et al. (2019b, 2021) have explored the idea of designing effective experience replay with tiny episodic memory. However, these methods suffer from high forgetting mainly due to overfitting Verwimp, Lange, and Tuytelaars (2021) to the small memory samples, thus show suboptimal performance.
In this paper, we propose a continual learning algorithm that trains a fixed-capacity model on an online stream of data using a small episodic memory. Our method, referred to as Experience Packing and Replay (EPR
), packs the memory with a more informative summary of the past experiences which improves performance of memory replay by reducing overfitting. To this end, we leverage the tools developed in the field of explainable artificial intelligence (XAI)Simonyan, Vedaldi, and Zisserman (2014); Zhou et al. (2016); Adadi and Berrada (2018) that shed light on the internal reasoning process of the ANNs. Among various explainability techniques, saliency methods Selvaraju et al. (2017); Zhang et al. (2018) highlight the part of the input data (image) that the model thinks is important for its final decision. Such analyses reveal that ANNs tend to make predictions based on some localized features or objects belonging to a part of the image whereas the rest of the image appears as background information, irrelevant for predictions. Thus we hypothesize that storing and replaying only these important parts of the images would be effective in reminding the networks about the past tasks and hence would reduce forgetting.
Therefore, in EPR, after learning each task, instead of storing full images, we identify important patches from different images belonging to each class with saliency method Selvaraju et al. (2017) and store them in the episodic memory. We introduce Experience Packing Factor (EPF) to set the number of patches kept per class and to determine the size of these patches. Thus, with these patches, we create composite images (for each class) that have higher diversity and capture richer summaries of past data distributions without increasing the memory size. While learning a new task, we retrieve these patches from the memory, zero-pad them to match with the original image dimensions, and use them for experience replay. We evaluate our algorithm in standard and directly comparable settings Chaudhry et al. (2019b, 2021)
on diverse image classification datasets including CIFAR-100, miniImageNet, and CUB. We compare EPR with the state-of-the-art methods for varying memory sizes and report better accuracy with least amount of forgetting. With detailed analyses, we show that the quality of memory patches and their effectiveness in replay depends on the quality of feature localization obtained from the saliency method. Moreover, we show EPR buffer summarizes the past distributions better, which helps in improving the performance of experience replay. Overall, our method provides simple yet effective solutions to catastrophic forgetting in continual learning, especially with tiny episodic memory.
2 Related Works
Methods for continual learning can be broadly divided into three categories Delange et al. (2021). Regularization based methods penalize changes in important parameters for the past tasks to prevent forgetting. Elastic Weight Consolidation (EWC) Kirkpatrick et al. (2017) computes such importance from the Fisher information matrix, Zenke, Poole, and Ganguli (2017) computes them from the sensitivity of loss with respect to the parameters, whereas Aljundi et al. (2018) measures them from the sensitivity of the model outputs to the inputs. Other works use knowledge distillation Li and Hoiem (2018) and variational inference framework Nguyen et al. (2018) for model regularization in continual learning. However, such methods suffer under longer task sequences and perform poorly Chaudhry et al. (2021)
in the single epoch (online) CL setup considered in this paper.
Parameter isolation methods allocate different subsets of network parameters for each task to overcome forgetting. Some methods Rusu et al. (2016); Yoon et al. (2018) under this category expand the network for accommodating new tasks, whereas in other methods Mallya and Lazebnik (2018); Serrà et al. (2018); Saha et al. (2020) a task-specific sub-network is selected by masking out the parameters. Unlike these methods, we train our model on online steam of tasks in a single epoch setting without increasing the network size.
Memory based methods mitigate forgetting by either storing a subset of old examples in the episodic memory for rehearsal Robins (1995); Rebuffi et al. (2017), or storing important gradient spaces from the past tasks for constrained optimization Farajtabar et al. (2020); Saha, Garg, and Roy (2021), or synthesizing old data from generative models for pseudo-rehearsal Shin et al. (2017). Experience Replay (ER) Robins (1995); Chaudhry et al. (2019b) jointly trains the model on the samples from the new tasks and episodic memory. Several recent methods expand on this idea: Meta-Experience Replay (MER) Riemer et al. (2019) combines episodic memory with meta-learning to maximize knowledge transfer and minimize forgetting; Aljundi et al. (2019b) stores examples in the memory for rehearsal based on the gradients; Maximal Interfered Retrieval (MIR) Aljundi et al. (2019a) selects a minibatch from the episodic memory for experience replay that incurs maximum change in loss; Hindsight Anchor Learning (HAL) Chaudhry et al. (2021) improves replay by adding an objective term to minimize forgetting on the meta-learned anchor data-points; Dark Experience Replay (DER++) Buzzega et al. (2020)
improves ER by replaying network logits along with the ground truth labels of the memory samples. Gradient Episodic Memory (GEM) Lopez-Paz and Ranzato (2017) and Averaged-GEM (A-GEM) Chaudhry et al. (2019a) use samples from the memory to compute gradient constraint so that loss on the past task does not increase. Guo et al. (2020) improved such methods by proposing a loss balancing update rules in MEGA. Ebrahimi et al. (2021) stores a saliency map corresponding to each episodic memory sample and complements replay with a regularization objective so that model explanations for the past tasks have minimal drift. Our method, EPR, also uses episodic memory for experience replay. However, unlike these methods, we neither store full images nor store saliency maps. Rather, leveraging the network’s reasoning process, we store only a part (patch) of the image in the memory and use them in experience replay (by zero-padding) for reminding the network about past decisions. With qualitative and quantitative analyses we show that our method achieves better performance than the state-of-the-art (SOTA) with small episodic memory.
3 Background and Notations
Continual Learning Protocol. We follow the online continual learning protocol introduced by Chaudhry et al. (2019). In this setup, a continual learner learns from a ordered sequence of dataset, consisting of tasks, where is the dataset of the -th task. Each example in these datasets consists of a triplet defined by an input (), an integer task descriptor (
) and a target vector (), where is the set of labels specific to task and . Following the prior works Chaudhry et al. (2019a, b, 2021), we use the first
tasks to cross-validate hyperparameters of each of the continual learning algorithms considered in this paper. We refer thesetasks as Cross-Validation tasks and the remaining tasks, which we use for training and evaluation of the algorithms, as Training-Evaluation tasks. In this setup, the learner observes each example only once during training. While observing these examples, the goal is to learn a neural network, , parameterized by , that maps any input pair
to its target outputand maintains performance on all the prior tasks.
Saliency Map Generation. Saliency methods provide visual explanations for the model predictions in terms of relevant features in the input. For example, for an input RGB image, belonging to class , these methods generate a saliency map, by assigning high intensity values to the relevant image regions that contribute to the model decision. The Saliency map is generated by:
where is a saliency method. Simonyan, Vedaldi, and Zisserman (2014) presented a saliency map generation method using a pre-trained neural network. Several works followed up that improved the quality of saliency maps Zhao et al. (2015); Wang et al. (2015) and reduced the cost for saliency computation Zhou et al. (2016); Selvaraju et al. (2017). In this work, we use Gradient-weighted Class Activation Mapping (Grad-CAM) Selvaraju et al. (2017) as the saliency method. Grad-CAM generates class-specific saliency maps based on gradients back-propagated to the later convolutional layers, given the model prediction. We describe the steps in detail in Appendix A.
4 Experience Packing and Replay (EPR)
Experience Replay. Continual learning algorithms, especially those which learn from an online stream of tasks, achieve SOTA performance using experience replay Hayes, Cahill, and Kanan (2019); Buzzega et al. (2020); Chaudhry et al. (2021). These methods update the model, while storing a few samples from the training data into a replay buffer called episodic memory, . When data from a new task becomes available, the model is jointly trained on both the current examples and the examples from the episodic memory (Figure 1(a)). Thus, experience replay from mitigates catastrophic forgetting by reminding the network about how to perform well on the prior tasks. However, performance of these methods shows strong dependence on the number of samples kept in the memory. Though with larger replay yields better performance, designing effective experience replay with small episodic memory Chaudhry et al. (2019b, 2021) still remains an open research problem. This is because, the model performance becomes highly sensitive to the examples stored in a smaller-sized . Moreover, lack of sample diversity (per class) leads to overfitting to the memory examples, which causes loss of generalization for the past tasks leading to catastrophic forgetting Verwimp, Lange, and Tuytelaars (2021). To overcome these issues, we propose a method to select and store only patches of images, instead of full images, from the past tasks in . This enables us to pack diverse experiences from an image class without any memory increase. Next, we introduce the concept of Experience Packing Factor and describe how we select these patches. Then, we show how we use the small patches with zero-padding in experience replay.
Experience Packing Factor (EPF). Let’s consider to be the number of (episodic) memory slots assigned for each class. Here, one memory slot can contain one full training image. For a given image, and the target patch size (from this image) (with and ) Experience Packing Factor (EPF) is defined as the following ratio:
EPF is integer-valued and it refers to the number of patches one can fit into the given memory slot, , for any particular class. In our design, we consider square images () and patches () and set EPF as a hyperparameter. Thus, for a given EPF we determine the image patch size as:
We take the floored integer value of . Equation 3 tells us, for instance, to pack patches (EPF) into memory slot (), the patch width (height) should be half of the full image width (height).
Memory Patch Selection and Storage. Explainability techniques Zhou et al. (2016); Selvaraju et al. (2017); Zhang et al. (2018) reveal that ANN bases its decision on the class-discriminative localized features in the input data (image). Hence, we propose to store only the important part (patch) of that image and use it during replay to remind the network about its past decision. We identify these patches from the saliency maps (Section 3) of the full images. Therefore, while learning each task, we store a small set of training images in a small, fixed-sized ‘ring’ buffer, Chaudhry et al. (2019b). At the end of each task, we extract the desired number of patches from these stored images using saliency maps and add them to the memory, . Note that, images from are only used for patch selection and not used in experience replay. Once the memory patches are selected and stored in , the temporary memory, is freed up to be reused in the next task. If we assume that data from -th task is available to the model until it sees the next tasks (as in Ebrahimi et al. (2021)), is not needed.
Let is the trained model after task . For each examples in , we generate the corresponding saliency map, using Equation 1. For given and chosen EPF, we obtain the (square) patch size (). Then, we average-pool the saliency map, with kernel size
and stride (hyperparameter),. We store the top left coordinate () of the kernel (patch) that corresponds to maximum averaged-pool value. In other words, we identify a square region (of size ) in the saliency map that has the maximum average intensity (Figure 1(b)). We obtain the memory patch from the image, by :
In our design, we keep a few more image samples in per class than the number of image patches we store in . As we will be using these patches with zero-padding (discussed next) for replay, for storage in , we want to prioritize the patches that after zero-padding gives (or remain closer to) the correct class prediction. Thus we Zero-pad each image patch, and check the model prediction. At first, we populate the memory with the patches for which model gives correct prediction. Then we fill up the remaining slots in by the patches for which correct class is in model’s Top3 predictions. Any remaining memory slot is filled up from the remaining patches irrespective of model predictions. Each selected image patch is then added to , with task id, class label and localizable coordinates in the original image.
Replay with Memory Patches. Since the patches stored in are smaller in size than the original images, we Zero-pad these patches (Figure 1(c)) each time we use them for experience replay. While zero-padding we place these patches in the ‘exact’ position of their original images using the coordinate values (). Each sample,
for replay will thus have the same dimensions as the samples of the current task. Throughout the paper, we use zero-padding with the exact placement of the memory patches for replay unless otherwise stated. We discuss other choices for memory patch padding and placement in Section 6. The steps of our algorithm in the form of pseudo-code is given in Appendix C.
|Split CIFAR||Split miniImageNet||Split CUB|
|Methods||ACC (%)||BWT||ACC (%)||BWT||ACC (%)||BWT|
|-||Finetune*||42.9 2.07||- 0.25 0.03||34.7 2.69||- 0.26 0.03||55.7 2.22||- 0.13 0.03|
|EWC*||42.4 3.02||- 0.26 0.02||37.7 3.29||- 0.21 0.03||55.0 2.34||- 0.14 0.02|
|1||A-GEM*||54.9 2.92||- 0.14 0.03||48.2 2.49||- 0.13 0.02||62.1 1.28||-0.09 0.01|
|MIR*||57.1 1.81||- 0.12 0.01||49.3 2.15||- 0.12 0.01||-||-|
|MER*||49.7 2.97||- 0.19 0.03||45.5 1.49||- 0.15 0.01||55.4 1.03||- 0.10 0.01|
|MEGA-I||55.2 1.21||- 0.14 0.02||48.6 1.11||- 0.10 0.01||65.1 1.30||- 0.05 0.01|
|DER++||54.0 1.18||- 0.15 0.02||48.3 1.44||- 0.11 0.01||66.8 1.36||- 0.04 0.01|
|ER-Reservoir||53.1 2.66||- 0.19 0.02||44.4 3.22||- 0.17 0.02||61.7 0.62||- 0.09 0.01|
|ER-RING*||56.2 1.93||- 0.13 0.01||49.0 2.61||- 0.12 0.02||65.0 0.96||- 0.03 0.01|
|EPR (Ours)||58.5 1.23||- 0.10 0.01||51.9 1.57||- 0.06 0.01||72.1 0.93||- 0.02 0.01|
|2||HAL*||60.4 0.54||- 0.10 0.01||51.6 2.02||- 0.10 0.01||-||-|
|EPR (Ours)||60.8 0.35||- 0.09 0.01||53.2 1.45||- 0.05 0.01||73.5 1.30||- 0.01 0.01|
) indicates results are reported from ER-RING. We (re) produced all the other results. Average and standard deviations are computed overruns for different random seeds. No. of memory slots per class, = refers to memory size, = for CIFAR and miniImageNet, and = for CUB.
|Split CIFAR||Split miniImageNet||Split CUB|
|Methods||ACC (%)||BWT||ACC (%)||BWT||ACC (%)||BWT|
|EPR (Zero-pad,exact)||58.5 1.23||- 0.10 0.01||51.9 1.57||- 0.06 0.01||72.1 0.93||- 0.02 0.01|
|EPR (Zero-pad,random)||57.0 1.21||- 0.11 0.01||51.5 1.33||- 0.06 0.01||71.9 0.88||- 0.02 0.01|
|EPR (Random-pad,exact)||57.2 1.22||- 0.11 0.01||49.7 0.73||- 0.07 0.01||71.5 0.53||- 0.02 0.01|
|Random Snip & Replay||53.6 2.76||- 0.14 0.01||49.5 1.07||- 0.08 0.01||67.4 1.04||- 0.05 0.01|
5 Experimental Setup
Datasets. We evaluate our algorithm on three image classification benchmarks widely used in continual learning. Split CIFAR Lopez-Paz and Ranzato (2017) consists of splitting the original CIFAR-100 dataset Krizhevsky (2009) into 20 disjoint subsets, each of which is considered as a separate task containing 5 classes. Split miniImageNet, used in (Chaudhry et al., 2019b, 2021; Ebrahimi et al., 2020), is constructed by splitting 100 classes of miniImageNet Vinyals et al. (2016) into 20 tasks where each task has 5 classes. Finally, Split CUB Chaudhry et al. (2019a, b); Guo et al. (2020) is constructed by splitting 200 bird categories from CUB dataset Welinder et al. (2010) into 20 tasks where each task has 10 classes. The dataset statistics are given in Appendix B. We do not use any data augmentation in our experiments. All datasets have 20 tasks (), where first 3 tasks () are used for hyperparameter selection while the remaining tasks are used for training. We report performances on the held-out test sets from these 17 tasks.
Network Architectures. For CIFAR and miniImageNet, we use a reduced ResNet18 with three times fewer feature maps across all layers, similar to Chaudhry et al. (2021). For CUB, we use a standard ResNet18 He et al. (2016)
with ImageNet pretrainingChaudhry et al. (2019a, b). Similar to Chaudhry et al. (2019a,b, 2021) and Guo et al. (2020), we train and evaluate our algorithm in ‘multi-head’ setting Hsu et al. (2018)
Performance Metrics. We evaluate the classification performance using the ACC metric, which is the average test classification accuracy of all tasks. We report backward transfer, BWT to measure the influence of new learning on the past knowledge. For instance, negative BWT indicates forgetting. Formally, ACC and BWT are defined as:
Baselines. We compare EPR with the SOTA methods in the online CL setup. From memory based methods, we compare with A-GEM Chaudhry et al. (2019a), MIR Aljundi et al. (2019a), MER Riemer et al. (2019), MEGA-I Guo et al. (2020), DER++ Buzzega et al. (2020), and HAL Chaudhry et al. (2021). We also compare with experience replay Chaudhry et al. (2019b) having ring (ER-RING) and reservoir (ER-Reservoir) buffer and EWC Kirkpatrick et al. (2017). We include two non-continual learning baselines: Finetune and Multitask. Finetune, where a single model is trained continually without any memory or regularization, gives performance lower bound. Multitask is an oracle baseline where a model is trained jointly on all tasks.
All the models are trained using Stochastic Gradient Descent (SGD) where batch size for both the current examples and examples from the episodic memory is set to. All experiments are averaged over runs using different random seeds, where each seed corresponds to a different model initialization and dataset ordering among tasks. A list of hyperparameters along with the EPFs used in these experiments is given in Appendix D.
6 Results and Analyses
Performance Comparison. First, we compare the performance of EPR (in terms of ACC and BWT) with the baselines methods. Table 1 summarizes the results, where for a given , episodic memory (of either ring or reservoir type) can store up to examples. Here, is the number of memory slots per class and the memory size, is :
Results in Table 1 show that performance of EWC is almost identical to the ‘Finetune’ baseline. This indicates that such method is ill-suited for online CL setup. For the case when one memory slot is assigned per class (), our method (EPR) outperforms A-GEM and MEGA-I considerably for all the datasets. Moreover, compared to the other experience replay methods, such as MIR, MER, DER++, and ER, EPR achieves and accuracy improvement for CIFAR and miniImageNet respectively with least forgetting. Whereas for CUB, EPR obtains accuracy improvement over these baselines with only forgetting.
Finally, we compare EPR with HAL Chaudhry et al. (2021) which holds the state-of-the-art performance in this setup. For the miniImgaeNet tasks, EPR (with ) achieves slightly better accuracy than HAL, whereas HAL outperforms EPR at the CIFAR tasks. However, in addition to the ring buffer, HAL uses extra memory to store anchor points having the same size of the original images for each class. Thus effectively, HAL uses two memory slots per class (). In Table 1, we compare EPR with HAL where EPR uses two memory slots per class (). Under this iso-memory condition, EPR has better accuracy and lower forgetting than HAL for both datasets.
Padding and Placement of Memory Patches. Next, we analyze the impact of different types of padding and placement of the memory patches on the EPR performance. For padding we have two different choices: we can either Zero-pad
these patches or we can pad these patches with pixels sampled from normal Gaussian distribution, which we refer to asRandom-pad. Similarly, we can place these patches either in the exact position of their original image using stored coordinate values or we can place them at random positions. Table 2 shows that across all datasets, exact placement works slightly better than random placement. These results indicate that neural network remembers the past tasks better if it finds the class-discriminative features in their original position during replay. For all the datasets, zero-padding performs better than random-padding (Table 2), which indicates that removing the background information completely serves as a better reminder of past tasks for the network. Thus, in all our experiments we use zero-padding with exact placement. For this purpose, we store a 2D coordinate value per memory patch which has an insignificant overhead compared to total episodic memory.
Effectiveness of Saliency Guided Memory Selection. A simple alternative to our saliency guided memory patch selection is to randomly select a patch (of size ) from the original image and use it for replay with zero-padding. We refer to this method as ‘Random Snip & Replay’ and compare its performance with EPR in Table 2. For CIFAR and CUB, EPR achieves and for miniImageNet EPR achieves better accuracy than this baseline. These results show that saliency based memory patch selection plays a key role in enabling high performance in EPR.
Experience Replay with Tiny Episodic Memories. Next, we study the impact of buffer size, on the performance of experience replay methods. In Table 1, we reported the results for to provide a direct comparison to the SOTA works. Here, we analyze whether it is possible to reduce the memory size further and still have an effective experience replay. This means, we consider the fractional values for . In such cases, for instance, means only half of the seen classes will have one example each stored in . Understandably this is a challenging condition for standard experience replay as many classes will not have any representation in the memory, leading to a sharp drop in performance. However, in our method, we can set an appropriate EPF () for any given and use Equation 3 to get the size of the memory patches. This allows us to pack representative memory patches from each class and preserve performance of experience replay. In Figure 3(a)-(c) we show how the performance (ACC) of different memory replay methods varies with the memory size for different datasets. Here we consider, which corresponds to for CIFAR and miniImageNet and for CUB. We also provide the results in tabular form (Table 7 in Appendix E). The ‘Finetune’ baselines in these figures correspond to case, and hence, serve as lower bounds to the performance. From these figures, we observe that ACC of memory replay methods such as ER-RING and MEGA-I falls sharply and approaches ‘Finetune’ baselines as we reduce the memory size. DER++, which uses both stored labels and logits during replay, performs slightly better than these methods. However, it still exhibits high accuracy drop (up to ) when memory size is reduced. In contrast, EPR shows high resilience under extreme memory reduction. For example, accuracy drop is only about for CIFAR, for miniImageNet, and for CUB dataset when memory size is reduced by a factor of 4. Thus, among the memory replay methods for CL, EPR promises to be the best option, especially in the tiny episodic memory regimes.
EPF vs. Performance. In our design, EPF determines how many image patches we can store per class for a given . A higher EPF would select smaller patches (Equation 3), and hence, increase the sample quantity (or diversity) per class. However, a large number of memory patches, unlike full images, does not imply better performance from experience replay. In this regard, feature localization quality in the saliency map gives us a better picture about the quality of these patches for experience replay. Figure 2 shows the saliency maps of different classes for different datasets from our experiments. For larger-sized and better quality images of CUB dataset, we observe that Grad-CAM localizes the object better within small regions of the images (Figure 2(c)). This gives us an impression that important part of the image for network decision can be captured with a smaller patch. Thus a higher EPF can be chosen to select a larger number of high quality patches, which would improve the performance of experience replay. In contrast, for smaller-sized and low quality images of CIFAR, we observe that network’s decisions are distributed over a large portion of the images (Figure 2(a)). Thus, a smaller patch (for high EPF) here may not capture enough information to be fully effective in experience replay. Figure 3(d) shows the impact of EPF on the performance of our method for different datasets. Since, for , our methods with EPF is similar to ER-RING, here we consider the cases with EPF . For CUB, accuracy of EPR improves as we increase EPF from to . Beyond that point the accuracy drops which indicates that the memory patches are too small to capture all the relevant information in the given images. For CIFAR, we obtain the best performance for EPF and as we increase EPF we observe drop in accuracy. For miniImageNet, we obtain the optimal performance for EPF . These results support our observations that link the size and quality of the memory patches to the quality of object localization in saliency maps for a given dataset.
Informativeness of Memory Buffer. Generalization capability of a model trained on the samples from the memory buffer, can reveal the informativeness of the buffer. Thus, following Buzzega et al. (2020), we compare the informativeness of the EPR buffer with the buffers used in DER++ and ER-RING. For each dataset, we train the corresponding model jointly on all the buffer data from all the tasks. This training does not correspond to CL, rather it mimics the multitasks learning. For EPR buffer, we train the model with zero-padded memory patches. Figure 3(e) shows the average (multitask) accuracy on the test set. For all the datasets, models trained on EPR buffer achieve the highest accuracy (better generalization). Thus, the proposed experience packing method enables us to capture a richer summary of underlying data distribution (without any memory increase) compared to the other buffers. This reduces overfitting to the memory buffer, which in turn improves accuracy and reduces forgetting (Table 1, Table 7 in Appendix E).
Training Time Analysis. Finally, we compare the training time of different algorithms in Figure 3(f). We measured time on a single NVIDIA GeForce GTX 1060 GPU. Compared to the standard replay (ER-RING), EPR only takes up to extra time for training. Compared to other recent works such as DER++ and MEGA-I, EPR trains faster. HAL did not report training time for the datasets under consideration, and hence, we could not provide a comparison. Since, HAL and MER both have meta-optimization steps, they are expected to require much larger training time Chaudhry et al. (2021) than ER-RING and A-GEM.
In this paper, we propose a new experience replay method with small episodic memory for continual learning. Using saliency maps, our method identifies the parts of the input images that are important for model’s prediction. We store these patches, instead of full images, in the memory and use them with appropriate zero-padding for replay. Our method thus packs the memory with diverse experiences, and hence captures the past data distribution better without memory increase. Comparison with the SOTA methods on diverse image classification tasks shows that our method is simple, fast, and achieves better accuracy with least amount of forgetting. We believe that the work opens up rich avenues for future research. Firstly, better understanding of the model’s decision process and better feature localization with saliency methods would improve the quality of the memory patches and hence improve experience replay. Secondly, new replay techniques for the patches can be explored to reduce the memory overfitting further. Finally, future studies can explore possible applications of our concept in other domains such as in reinforcement learningRolnick et al. (2019).
This work was supported in part by the National Science Foundation, Vannevar Bush Faculty Fellowship, Army Research Office, MURI, and by Center for Brain-Inspired Computing (C-BRIC), one of six centers in JUMP, a Semiconductor Research Corporation program sponsored by DARPA.
- Adadi and Berrada (2018) Adadi, A.; and Berrada, M. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6: 52138–52160.
Aljundi et al. (2018)
Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; and Tuytelaars, T.
Memory Aware Synapses: Learning what (not) to forget.In
The European Conference on Computer Vision (ECCV), 139–154.
- Aljundi et al. (2019a) Aljundi, R.; Belilovsky, E.; Tuytelaars, T.; Charlin, L.; Caccia, M.; Lin, M.; and Page-Caccia, L. 2019a. Online Continual Learning with Maximal Interfered Retrieval. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Aljundi et al. (2019b) Aljundi, R.; Lin, M.; Goujaud, B.; and Bengio, Y. 2019b. Gradient based sample selection for online continual learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Buzzega et al. (2020) Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; and Calderara, S. 2020. Dark Experience for General Continual Learning: a Strong, Simple Baseline. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 15920–15930. Curran Associates, Inc.
- Chaudhry et al. (2021) Chaudhry, A.; Gordo, A.; Dokania, P. K.; Torr, P. H.; and Lopez-Paz, D. 2021. Using Hindsight to Anchor Past Knowledge in Continual Learning. In AAAI.
- Chaudhry et al. (2019a) Chaudhry, A.; Ranzato, M.; Rohrbach, M.; and Elhoseiny, M. 2019a. Efficient Lifelong Learning with A-GEM. In International Conference on Learning Representations.
- Chaudhry et al. (2019b) Chaudhry, A.; Rohrbach, M.; Elhoseiny, M.; Ajanthan, T.; Dokania, P. K.; Torr, P. H. S.; and Ranzato, M. 2019b. Continual Learning with Tiny Episodic Memories. ArXiv, abs/1902.10486.
- Delange et al. (2021) Delange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; and Tuytelaars, T. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
- Ebrahimi et al. (2020) Ebrahimi, S.; Meier, F.; Calandra, R.; Darrell, T.; and Rohrbach, M. 2020. Adversarial Continual Learning. In The European Conference on Computer Vision (ECCV).
- Ebrahimi et al. (2021) Ebrahimi, S.; Petryk, S.; Gokul, A.; Gan, W.; Gonzalez, J. E.; Rohrbach, M.; and trevor darrell. 2021. Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting. In International Conference on Learning Representations.
Farajtabar et al. (2020)
Farajtabar, M.; Azizan, N.; Mott, A.; and Li, A. 2020.
Orthogonal Gradient Descent for Continual Learning.
In Chiappa, S.; and Calandra, R., eds., Proceedings of the
Twenty Third International Conference on Artificial Intelligence and
Statistics, volume 108 of
Proceedings of Machine Learning Research, 3762–3773. PMLR.
- Goodfellow et al. (2016) Goodfellow, I.; Bengio, Y.; Courville, A.; and Bengio, Y. 2016. Deep learning, volume 1. MIT Press.
- Guo et al. (2020) Guo, Y.; Liu, M.; Yang, T.; and Rosing, T. 2020. Improved Schemes for Episodic Memory-based Lifelong Learning. Advances in Neural Information Processing Systems, 33.
- Hayes, Cahill, and Kanan (2019) Hayes, T. L.; Cahill, N. D.; and Kanan, C. 2019. Memory Efficient Experience Replay for Streaming Learning. In International Conference on Robotics and Automation (ICRA). IEEE.
He et al. (2016)
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.
Deep Residual Learning for Image Recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.
- Hsu et al. (2018) Hsu, Y.-C.; Liu, Y.-C.; Ramasamy, A.; and Kira, Z. 2018. Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines. In NeurIPS Continual learning Workshop.
- Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N. C.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Hadsell, R. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 3521 – 3526.
- Knoblauch, Husain, and Diethe (2020) Knoblauch, J.; Husain, H.; and Diethe, T. 2020. Optimal Continual Learning has Perfect Memory and is NP-hard. In International Conference on Machine Learning (ICML).
- Krizhevsky (2009) Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report.
- Li and Hoiem (2018) Li, Z.; and Hoiem, D. 2018. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40: 2935–2947.
- Lopez-Paz and Ranzato (2017) Lopez-Paz, D.; and Ranzato, M. A. 2017. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems, volume 30.
Mahendran and Vedaldi (2016)
Mahendran, A.; and Vedaldi, A. 2016.
Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images.Int. J. Comput. Vision, 120(3): 233–255.
- Mallya and Lazebnik (2018) Mallya, A.; and Lazebnik, S. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7765–7773.
- Mccloskey and Cohen (1989) Mccloskey, M.; and Cohen, N. J. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. The Psychology of Learning and Motivation, 24: 104–169.
- Nguyen et al. (2018) Nguyen, C. V.; Li, Y.; Bui, T. D.; and Turner, R. E. 2018. Variational Continual Learning. In International Conference on Learning Representations.
- Prabhu, Torr, and Dokania (2020) Prabhu, A.; Torr, P.; and Dokania, P. 2020. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In The European Conference on Computer Vision (ECCV).
- Ratcliff (1990) Ratcliff, R. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 2: 285–308.
- Rebuffi et al. (2017) Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental Classifier and Representation Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5533–5542.
- Riemer et al. (2019) Riemer, M.; Cases, I.; Ajemian, R.; Liu, M.; Rish, I.; Tu, Y.; and Tesauro, G. 2019. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. In In International Conference on Learning Representations (ICLR).
- Ring (1998) Ring, M. B. 1998. Child: A First Step Towards Continual Learning. In Learning to Learn.
- Robins (1995) Robins, A. V. 1995. Catastrophic Forgetting, Rehearsal and Pseudorehearsal. Connect. Sci., 7: 123–146.
- Rolnick et al. (2019) Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.; and Wayne, G. 2019. Experience Replay for Continual Learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Rusu et al. (2016) Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive Neural Networks. ArXiv, abs/1606.04671.
- Saha et al. (2020) Saha, G.; Garg, I.; Ankit, A.; and Roy, K. 2020. SPACE: Structured Compression and Sharing of Representational Space for Continual Learning. ArXiv, abs/2001.08650.
- Saha, Garg, and Roy (2021) Saha, G.; Garg, I.; and Roy, K. 2021. Gradient Projection Memory for Continual Learning. In International Conference on Learning Representations.
- Selvaraju et al. (2017) Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
- Serrà et al. (2018) Serrà, J.; Surís, D.; Miron, M.; and Karatzoglou, A. 2018. Overcoming Catastrophic Forgetting with Hard Attention to the Task. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 4548–4557. PMLR.
- Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual Learning with Deep Generative Replay. In Advances in Neural Information Processing Systems, volume 30.
- Simonyan, Vedaldi, and Zisserman (2014) Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In In Workshop at International Conference on Learning Representations.
- Thrun and Mitchell (1995) Thrun, S.; and Mitchell, T. M. 1995. Lifelong robot learning. Robotics Auton. Syst., 15: 25–46.
- Verwimp, Lange, and Tuytelaars (2021) Verwimp, E.; Lange, M. D.; and Tuytelaars, T. 2021. Rehearsal revealed: The limits and merits of revisiting samples in continual learning. arXiv:2104.07446.
- Vinyals et al. (2016) Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems, volume 29.
Wang et al. (2015)
Wang, L.; Lu, H.; Ruan, X.; and Yang, M.-H. 2015.
Deep networks for saliency detection via local estimation and global search.In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3183–3192.
- Welinder et al. (2010) Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
- Yoon et al. (2018) Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2018. Lifelong Learning with Dynamically Expandable Networks. In 6th International Conference on Learning Representations.
- Zenke, Poole, and Ganguli (2017) Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 3987–3995. PMLR.
- Zhang et al. (2018) Zhang, J.; Bargal, S. A.; Lin, Z.; Brandt, J.; Shen, X.; and Sclaroff, S. 2018. Top-Down Neural Attention by Excitation Backprop. Int. J. Comput. Vision, 126(10): 1084–1102.
- Zhao et al. (2015) Zhao, R.; Ouyang, W.; Li, H.; and Wang, X. 2015. Saliency detection by multi-context deep learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1265–1274.
Zhou et al. (2016)
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2016.
Learning Deep Features for Discriminative Localization.In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2921–2929.
Section A describes the steps of saliency map generation using Grad-CAM. Section B provides the dataset statistics used in different experiments. Pseudo-code of the EPR algorithm is given in Section C. Section D provides the list of hyperparameters used for the baseline algorithms and our method. Additional results are provided in Section E.
Appendix A Saliency Method : Grad-CAM
Gradient-weighted Class Activation Mapping (Grad-CAM) Selvaraju et al. (2017) is a saliency method that uses gradients to determine the impact of specific feature map activations on a given prediction. Since later layers in the convolutional neural network capture high-level semantics Mahendran and Vedaldi (2016), taking gradients of a model output with respect to the feature map activations from one such layers identifies which high-level semantics are important for the model prediction. In our analysis, we select this layer and refer to as target layer Ebrahimi et al. (2021). List of target layer for different experiments is given in Table 3.
Let’s consider the target layer has feature maps where each feature map, is of width and height . Also consider, for a given image () belonging to class , the pre-softmax score of the image classifier is . To obtain the class-discriminative saliency map, Grad-CAM first takes derivative of with respect to each feature map . These gradients are then global-averaged-pooled over and to obtain importance weight, for each feature map :
where denotes location in the feature map . Next, these weights are used for computing linear combination of the feature map activations, which is then followed by ReLU to obtain the localization map :
This map is of the same size () of . Finally, saliency map, is generated by upsampling
to the input image resolution using bilinear interpolation.
Target Layer Name in PyTorch package
|Split CIFAR||ResNet18 (reduced)||layer4.1.shortcut|
|Split miniImageNet||ResNet18 (reduced)||layer4.1.shortcut|
Appendix B Dataset Statistics
|Split CIFAR||Split miniImageNet||Split CUB|
|num. of tasks||20||20||20|
|input size ()|
|num. of classes/task||5||5||10|
|num. of training samples/tasks||2,500||2,500||300|
|num. of test samples/tasks||500||500||290|
Appendix C EPR Algorithm
Appendix D List of Hyperparameters
|Finetune||: 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0|
|EWC||: 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0|
|regularization, : 0.1, 1, 10 (cifar, minImg, cub), 100, 1000|
|A-GEM||: 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0|
|MER||: 0.003, 0.01, 0.03 (cifar, minImg), 0.1 (cub), 0.3, 1.0|
|with in batch meta-learning rate, : 0.01, 0.03, 0.1 (cifar, minImg, cub), 0.3, 1.0|
|current batch learning rate multiplier, : 1, 2, 5 (cifar, minImg, cub), 10|
|MEGA-I||: 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0|
|sensitivity parameter, : , , 0.001, 0.01 (cifar, minImg, cub), 0.1|
|DER++||: 0.003, 0.01, 0.03 (minImg, cub), 0.1 (cifar), 0.3, 1.0|
|regularization : 0.1 (minImg), 0.2 (cifar), 0.5 (cub), 1.0|
|regularization, : 0.5 (cifar, minImg, cub), 1.0|
|ER-Reservoir||: 0.003, 0.01, 0.03 (cub), 0.1 (cifar, minImg), 0.3, 1.0|
|ER-RING||: 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0|
|HAL||: 0.003, 0.01, 0.03 (cifar, minImg), 0.1, 0.3, 1.0|
|regularization, : 0.01, 0.03, 0.1, 0.3 (minImg), 1 (cifar), 3, 10|
|mean embedding strength, : 0.01, 0.03, 0.1 (cifar, minImg), 0.3, 1, 3, 10|
|decay rate, : 0.5 (cifar, minImg)|
|gradient steps on anchors, : 100 (cifar, minImg)|
|Multitask||: 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0|
|EPR (ours)||: 0.01, 0.03 (cub), 0.05 (minImg), 0.1 (cifar), 0.3, 1.0|
|stride, : 1 (cifar, minImg), 2, 3 (cub)|
|Split CIFAR||Split miniImageNet||Split CUB|
Appendix E Additional Results
Implementation: Our codes are implemented in Python (version 3.7) with PyTroch (version 1.5.1) packages. We reported the results by running the codes on a single NVIDIA GeForce GTX 1060 GPU.
|Split CIFAR||Split miniImageNet||Split CUB|
|Methods||ACC (%)||BWT||ACC (%)||BWT||ACC (%)||BWT|
|-||Finetune||42.9 2.07||- 0.25 0.03||34.7 2.69||- 0.26 0.03||55.7 2.22||- 0.13 0.03|
|MEGA-I||57.6 0.87||- 0.12 0.01||50.3 1.14||- 0.08 0.01||67.8 1.30||- 0.04 0.01|
|2||DER++||56.3 0.98||- 0.14 0.01||50.1 1.14||- 0.09 0.01||70.7 0.62||- 0.03 0.01|
|ER-RING||58.6 2.68||- 0.12 0.01||51.2 2.06||- 0.10 0.01||68.3 1.13||- 0.02 0.01|
|EPR (Ours)||60.8 0.35||- 0.09 0.01||53.2 1.45||- 0.05 0.01||73.5 1.30||- 0.01 0.01|
|MEGA-I||48.9 1.68||- 0.21 0.01||43.8 1.58||- 0.14 0.01||61.5 2.08||- 0.08 0.01|
|0.75||DER++||50.0 1.81||- 0.19 0.02||47.2 1.54||- 0.12 0.01||64.8 1.61||- 0.06 0.01|
|ER-RING||50.4 0.85||- 0.21 0.02||44.9 1.49||- 0.14 0.02||64.0 1.29||- 0.05 0.01|
|EPR (Ours)||56.8 1.59||- 0.12 0.02||51.1 1.47||- 0.06 0.01||70.7 0.72||- 0.03 0.01|
|MEGA-I||43.7 1.26||- 0.26 0.02||39.6 2.35||- 0.18 0.02||57.7 0.62||- 0.11 0.01|
|0.5||DER++||47.5 1.58||- 0.21 0.01||45.6 0.56||- 0.13 0.01||62.5 1.45||- 0.08 0.01|
|ER-RING||44.6 0.84||- 0.27 0.01||39.1 1.38||- 0.20 0.02||59.2 0.97||- 0.10 0.01|
|EPR (Ours)||55.6 0.54||- 0.13 0.02||49.2 1.20||- 0.07 0.01||70.3 0.91||- 0.03 0.01|