Saliency Guided Experience Packing for Replay in Continual Learning

09/10/2021
by   Gobinda Saha, et al.
Purdue University
0

Artificial learning systems aspire to mimic human intelligence by continually learning from a stream of tasks without forgetting past knowledge. One way to enable such learning is to store past experiences in the form of input examples in episodic memory and replay them when learning new tasks. However, performance of such method suffers as the size of the memory becomes smaller. In this paper, we propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps which provide visual explanations for the model's decision. Guided by these saliency maps, we pack the memory with only the parts or patches of the input images important for the model's prediction. While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions. We evaluate our algorithm on diverse image classification datasets and report better performance than the state-of-the-art approaches. With qualitative and quantitative analyses we show that our method captures richer summary of past experiences without any memory increase, and hence performs well with small episodic memory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

03/17/2021

Gradient Projection Memory for Continual Learning

The ability to learn continually without forgetting the past tasks is a ...
07/01/2020

Continual Learning: Tackling Catastrophic Forgetting in Deep Neural Networks with Replay Processes

Humans learn all their life long. They accumulate knowledge from a seque...
12/31/2021

Revisiting Experience Replay: Continual Learning by Adaptively Tuning Task-wise Relationship

Continual learning requires models to learn new tasks while maintaining ...
10/06/2020

The Effectiveness of Memory Replay in Large Scale Continual Learning

We study continual learning in the large scale setting where tasks in th...
06/03/2019

Episodic Memory in Lifelong Language Learning

We introduce a lifelong language learning setup where a model needs to l...
04/19/2019

Continual Learning with Self-Organizing Maps

Despite remarkable successes achieved by modern neural networks in a wid...
03/22/2021

ZS-IL: Looking Back on Learned ExperiencesFor Zero-Shot Incremental Learning

Classical deep neural networks are limited in their ability to learn fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent success in deep learning primarily relies on training powerful models with fixed datasets in stationary environments. However, in the non-stationary setting, where data distribution changes over time, artificial neural networks (ANNs) fail to match the efficiency of human learning. In this setup, humans can learn incrementally from sequential experiences while leveraging and maintaining past knowledge. In contrast, standard training algorithms 

(Goodfellow et al., 2016) for ANNs overwrite the representations from the past tasks upon exposure to a new task. This leads to rapid performance degradation on the past tasks - a phenomenon known as ‘Catastrophic Forgetting’ Mccloskey and Cohen (1989); Ratcliff (1990). Continual Learning (CL) (Ring, 1998; Thrun and Mitchell, 1995) aims to mitigate forgetting while sequentially updating the model on a stream of tasks.

To overcome catastrophic forgetting, an active line of research in continual learning stores a few training samples from the past tasks as experiences in an episodic memory. Some of these memory based methods Lopez-Paz and Ranzato (2017); Chaudhry et al. (2019a); Guo et al. (2020) use examples from the episodic memory to constrain the optimization process for the new task so that the loss of the past tasks does not increase. Several works have proposed variants of experience replay Chaudhry et al. (2019b, 2021); Riemer et al. (2019); Buzzega et al. (2020), where the model is jointly optimized on the samples from both episodic memory and new task. Such methods provide simple yet effective solutions to the catastrophic forgetting especially in online CL setting where each example is seen only once during training. However, performance of these methods strongly depends on the size of the episodic memory. The authors in Knoblauch, Husain, and Diethe (2020) argued that for an optimal performance one needs to store all the past examples in the memory. While experience replay with large memory would yield higher performance, this would essentially mimic the joint optimization process in independent and identically distributed (IID) data setting which puts the effectiveness of CL algorithms into question Prabhu, Torr, and Dokania (2020). Therefore, recent works Chaudhry et al. (2019b, 2021) have explored the idea of designing effective experience replay with tiny episodic memory. However, these methods suffer from high forgetting mainly due to overfitting Verwimp, Lange, and Tuytelaars (2021) to the small memory samples, thus show suboptimal performance.

In this paper, we propose a continual learning algorithm that trains a fixed-capacity model on an online stream of data using a small episodic memory. Our method, referred to as Experience Packing and Replay (EPR

), packs the memory with a more informative summary of the past experiences which improves performance of memory replay by reducing overfitting. To this end, we leverage the tools developed in the field of explainable artificial intelligence (XAI) 

Simonyan, Vedaldi, and Zisserman (2014); Zhou et al. (2016); Adadi and Berrada (2018) that shed light on the internal reasoning process of the ANNs. Among various explainability techniques, saliency methods Selvaraju et al. (2017); Zhang et al. (2018) highlight the part of the input data (image) that the model thinks is important for its final decision. Such analyses reveal that ANNs tend to make predictions based on some localized features or objects belonging to a part of the image whereas the rest of the image appears as background information, irrelevant for predictions. Thus we hypothesize that storing and replaying only these important parts of the images would be effective in reminding the networks about the past tasks and hence would reduce forgetting.

Therefore, in EPR, after learning each task, instead of storing full images, we identify important patches from different images belonging to each class with saliency method Selvaraju et al. (2017) and store them in the episodic memory. We introduce Experience Packing Factor (EPF) to set the number of patches kept per class and to determine the size of these patches. Thus, with these patches, we create composite images (for each class) that have higher diversity and capture richer summaries of past data distributions without increasing the memory size. While learning a new task, we retrieve these patches from the memory, zero-pad them to match with the original image dimensions, and use them for experience replay. We evaluate our algorithm in standard and directly comparable settings Chaudhry et al. (2019b, 2021)

on diverse image classification datasets including CIFAR-100, miniImageNet, and CUB. We compare EPR with the state-of-the-art methods for varying memory sizes and report better accuracy with least amount of forgetting. With detailed analyses, we show that the quality of memory patches and their effectiveness in replay depends on the quality of feature localization obtained from the saliency method. Moreover, we show EPR buffer summarizes the past distributions better, which helps in improving the performance of experience replay. Overall, our method provides simple yet effective solutions to catastrophic forgetting in continual learning, especially with tiny episodic memory.

2 Related Works

Methods for continual learning can be broadly divided into three categories Delange et al. (2021). Regularization based methods penalize changes in important parameters for the past tasks to prevent forgetting. Elastic Weight Consolidation (EWCKirkpatrick et al. (2017) computes such importance from the Fisher information matrix, Zenke, Poole, and Ganguli (2017) computes them from the sensitivity of loss with respect to the parameters, whereas Aljundi et al. (2018) measures them from the sensitivity of the model outputs to the inputs. Other works use knowledge distillation Li and Hoiem (2018) and variational inference framework Nguyen et al. (2018) for model regularization in continual learning. However, such methods suffer under longer task sequences and perform poorly Chaudhry et al. (2021)

in the single epoch (online) CL setup considered in this paper.

Parameter isolation methods allocate different subsets of network parameters for each task to overcome forgetting. Some methods Rusu et al. (2016); Yoon et al. (2018) under this category expand the network for accommodating new tasks, whereas in other methods Mallya and Lazebnik (2018); Serrà et al. (2018); Saha et al. (2020) a task-specific sub-network is selected by masking out the parameters. Unlike these methods, we train our model on online steam of tasks in a single epoch setting without increasing the network size.

Memory based methods mitigate forgetting by either storing a subset of old examples in the episodic memory for rehearsal Robins (1995); Rebuffi et al. (2017), or storing important gradient spaces from the past tasks for constrained optimization Farajtabar et al. (2020); Saha, Garg, and Roy (2021), or synthesizing old data from generative models for pseudo-rehearsal Shin et al. (2017). Experience Replay (ERRobins (1995); Chaudhry et al. (2019b) jointly trains the model on the samples from the new tasks and episodic memory. Several recent methods expand on this idea: Meta-Experience Replay (MERRiemer et al. (2019) combines episodic memory with meta-learning to maximize knowledge transfer and minimize forgetting; Aljundi et al. (2019b) stores examples in the memory for rehearsal based on the gradients; Maximal Interfered Retrieval (MIRAljundi et al. (2019a) selects a minibatch from the episodic memory for experience replay that incurs maximum change in loss; Hindsight Anchor Learning (HALChaudhry et al. (2021) improves replay by adding an objective term to minimize forgetting on the meta-learned anchor data-points; Dark Experience Replay (DER++Buzzega et al. (2020)

improves ER by replaying network logits along with the ground truth labels of the memory samples. Gradient Episodic Memory (

GEMLopez-Paz and Ranzato (2017) and Averaged-GEM (A-GEMChaudhry et al. (2019a) use samples from the memory to compute gradient constraint so that loss on the past task does not increase. Guo et al. (2020) improved such methods by proposing a loss balancing update rules in MEGAEbrahimi et al. (2021) stores a saliency map corresponding to each episodic memory sample and complements replay with a regularization objective so that model explanations for the past tasks have minimal drift. Our method, EPR, also uses episodic memory for experience replay. However, unlike these methods, we neither store full images nor store saliency maps. Rather, leveraging the network’s reasoning process, we store only a part (patch) of the image in the memory and use them in experience replay (by zero-padding) for reminding the network about past decisions. With qualitative and quantitative analyses we show that our method achieves better performance than the state-of-the-art (SOTA) with small episodic memory.

Figure 1: (a) Conventional experience replay method, where full images are stored in the episodic memory and network is jointly trained on these memory data and current task data. (b)-(c) Our experience packing and replay (EPR) method. (b) Experience packing, where only the important part of the image for network prediction is selected using saliency method and stored in the episodic memory with the corner coordinate, . (c) Experience packing increases sample diversity per class without memory increase. Stored memory patches are zero-padded and replayed with the current examples.

3 Background and Notations

Continual Learning Protocol. We follow the online continual learning protocol introduced by Chaudhry et al. (2019). In this setup, a continual learner learns from a ordered sequence of dataset, consisting of tasks, where is the dataset of the -th task. Each example in these datasets consists of a triplet defined by an input (), an integer task descriptor (

) and a target vector (

), where is the set of labels specific to task and . Following the prior works Chaudhry et al. (2019a, b, 2021), we use the first

tasks to cross-validate hyperparameters of each of the continual learning algorithms considered in this paper. We refer these

tasks as Cross-Validation tasks and the remaining tasks, which we use for training and evaluation of the algorithms, as Training-Evaluation tasks. In this setup, the learner observes each example only once during training. While observing these examples, the goal is to learn a neural network, , parameterized by , that maps any input pair

to its target output

and maintains performance on all the prior tasks.

Saliency Map Generation. Saliency methods provide visual explanations for the model predictions in terms of relevant features in the input. For example, for an input RGB image, belonging to class , these methods generate a saliency map, by assigning high intensity values to the relevant image regions that contribute to the model decision. The Saliency map is generated by:

(1)

where is a saliency method. Simonyan, Vedaldi, and Zisserman (2014) presented a saliency map generation method using a pre-trained neural network. Several works followed up that improved the quality of saliency maps Zhao et al. (2015); Wang et al. (2015) and reduced the cost for saliency computation Zhou et al. (2016); Selvaraju et al. (2017). In this work, we use Gradient-weighted Class Activation Mapping (Grad-CAMSelvaraju et al. (2017) as the saliency method. Grad-CAM generates class-specific saliency maps based on gradients back-propagated to the later convolutional layers, given the model prediction. We describe the steps in detail in Appendix A.

4 Experience Packing and Replay (EPR)

Experience Replay. Continual learning algorithms, especially those which learn from an online stream of tasks, achieve SOTA performance using experience replay Hayes, Cahill, and Kanan (2019); Buzzega et al. (2020); Chaudhry et al. (2021). These methods update the model, while storing a few samples from the training data into a replay buffer called episodic memory, . When data from a new task becomes available, the model is jointly trained on both the current examples and the examples from the episodic memory (Figure 1(a)). Thus, experience replay from mitigates catastrophic forgetting by reminding the network about how to perform well on the prior tasks. However, performance of these methods shows strong dependence on the number of samples kept in the memory. Though with larger replay yields better performance, designing effective experience replay with small episodic memory Chaudhry et al. (2019b, 2021) still remains an open research problem. This is because, the model performance becomes highly sensitive to the examples stored in a smaller-sized . Moreover, lack of sample diversity (per class) leads to overfitting to the memory examples, which causes loss of generalization for the past tasks leading to catastrophic forgetting Verwimp, Lange, and Tuytelaars (2021). To overcome these issues, we propose a method to select and store only patches of images, instead of full images, from the past tasks in . This enables us to pack diverse experiences from an image class without any memory increase. Next, we introduce the concept of Experience Packing Factor and describe how we select these patches. Then, we show how we use the small patches with zero-padding in experience replay.

Experience Packing Factor (EPF). Let’s consider to be the number of (episodic) memory slots assigned for each class. Here, one memory slot can contain one full training image. For a given image, and the target patch size (from this image) (with and ) Experience Packing Factor (EPF) is defined as the following ratio:

(2)

EPF is integer-valued and it refers to the number of patches one can fit into the given memory slot, , for any particular class. In our design, we consider square images () and patches () and set EPF as a hyperparameter. Thus, for a given EPF we determine the image patch size as:

(3)

We take the floored integer value of . Equation 3 tells us, for instance, to pack patches (EPF) into memory slot (), the patch width (height) should be half of the full image width (height).

Memory Patch Selection and Storage. Explainability techniques Zhou et al. (2016); Selvaraju et al. (2017); Zhang et al. (2018) reveal that ANN bases its decision on the class-discriminative localized features in the input data (image). Hence, we propose to store only the important part (patch) of that image and use it during replay to remind the network about its past decision. We identify these patches from the saliency maps (Section 3) of the full images. Therefore, while learning each task, we store a small set of training images in a small, fixed-sized ‘ring’ buffer,  Chaudhry et al. (2019b). At the end of each task, we extract the desired number of patches from these stored images using saliency maps and add them to the memory, . Note that, images from are only used for patch selection and not used in experience replay. Once the memory patches are selected and stored in , the temporary memory, is freed up to be reused in the next task. If we assume that data from -th task is available to the model until it sees the next tasks (as in Ebrahimi et al. (2021)), is not needed.

Let is the trained model after task . For each examples in , we generate the corresponding saliency map, using Equation 1. For given and chosen EPF, we obtain the (square) patch size (). Then, we average-pool the saliency map, with kernel size

and stride (hyperparameter),

. We store the top left coordinate () of the kernel (patch) that corresponds to maximum averaged-pool value. In other words, we identify a square region (of size ) in the saliency map that has the maximum average intensity (Figure 1(b)). We obtain the memory patch from the image, by :

(4)

In our design, we keep a few more image samples in per class than the number of image patches we store in . As we will be using these patches with zero-padding (discussed next) for replay, for storage in , we want to prioritize the patches that after zero-padding gives (or remain closer to) the correct class prediction. Thus we Zero-pad each image patch, and check the model prediction. At first, we populate the memory with the patches for which model gives correct prediction. Then we fill up the remaining slots in by the patches for which correct class is in model’s Top3 predictions. Any remaining memory slot is filled up from the remaining patches irrespective of model predictions. Each selected image patch is then added to , with task id, class label and localizable coordinates in the original image.

Replay with Memory Patches. Since the patches stored in are smaller in size than the original images, we Zero-pad these patches (Figure 1(c)) each time we use them for experience replay. While zero-padding we place these patches in the ‘exact’ position of their original images using the coordinate values (). Each sample,

(5)

for replay will thus have the same dimensions as the samples of the current task. Throughout the paper, we use zero-padding with the exact placement of the memory patches for replay unless otherwise stated. We discuss other choices for memory patch padding and placement in Section 6. The steps of our algorithm in the form of pseudo-code is given in Appendix C.

Split CIFAR Split miniImageNet Split CUB
Methods ACC (%) BWT ACC (%) BWT ACC (%) BWT
- Finetune* 42.9 2.07 - 0.25 0.03 34.7 2.69 - 0.26 0.03 55.7 2.22 - 0.13 0.03
EWC* 42.4 3.02 - 0.26 0.02 37.7 3.29 - 0.21 0.03 55.0 2.34 - 0.14 0.02
1 A-GEM* 54.9 2.92 - 0.14 0.03 48.2 2.49 - 0.13 0.02 62.1 1.28 -0.09 0.01
MIR* 57.1 1.81 - 0.12 0.01 49.3 2.15 - 0.12 0.01 - -
MER* 49.7 2.97 - 0.19 0.03 45.5 1.49 - 0.15 0.01 55.4 1.03 - 0.10 0.01
MEGA-I 55.2 1.21 - 0.14 0.02 48.6 1.11 - 0.10 0.01 65.1 1.30 - 0.05 0.01
DER++ 54.0 1.18 - 0.15 0.02 48.3 1.44 - 0.11 0.01 66.8 1.36 - 0.04 0.01
ER-Reservoir 53.1 2.66 - 0.19 0.02 44.4 3.22 - 0.17 0.02 61.7 0.62 - 0.09 0.01
ER-RING* 56.2 1.93 - 0.13 0.01 49.0 2.61 - 0.12 0.02 65.0 0.96 - 0.03 0.01
EPR (Ours) 58.5 1.23 - 0.10 0.01 51.9 1.57 - 0.06 0.01 72.1 0.93 - 0.02 0.01
2 HAL* 60.4 0.54 - 0.10 0.01 51.6 2.02 - 0.10 0.01 - -
EPR (Ours) 60.8 0.35 - 0.09 0.01 53.2 1.45 - 0.05 0.01 73.5 1.30 - 0.01 0.01
- MultiTask* 68.3 - 63.5 - 65.6 -
Table 1: Performance comparison of different CL methods. (*) indicates results for CIFAR and miniImageNet are reported from HAL Chaudhry et al. (2021) and results for CUB are reported from ER-RING Chaudhry et al. (2019b). (

) indicates results are reported from ER-RING. We (re) produced all the other results. Average and standard deviations are computed over

runs for different random seeds. No. of memory slots per class, = refers to memory size, = for CIFAR and miniImageNet, and = for CUB.
Split CIFAR Split miniImageNet Split CUB
Methods ACC (%) BWT ACC (%) BWT ACC (%) BWT
EPR (Zero-pad,exact) 58.5 1.23 - 0.10 0.01 51.9 1.57 - 0.06 0.01 72.1 0.93 - 0.02 0.01
EPR (Zero-pad,random) 57.0 1.21 - 0.11 0.01 51.5 1.33 - 0.06 0.01 71.9 0.88 - 0.02 0.01
EPR (Random-pad,exact) 57.2 1.22 - 0.11 0.01 49.7 0.73 - 0.07 0.01 71.5 0.53 - 0.02 0.01
Random Snip & Replay 53.6 2.76 - 0.14 0.01 49.5 1.07 - 0.08 0.01 67.4 1.04 - 0.05 0.01
Table 2: Impact of padding, placement and selection method of memory patches on EPR performance (for ).

5 Experimental Setup

Datasets. We evaluate our algorithm on three image classification benchmarks widely used in continual learning. Split CIFAR Lopez-Paz and Ranzato (2017) consists of splitting the original CIFAR-100 dataset Krizhevsky (2009) into 20 disjoint subsets, each of which is considered as a separate task containing 5 classes. Split miniImageNet, used in (Chaudhry et al., 2019b, 2021; Ebrahimi et al., 2020), is constructed by splitting 100 classes of miniImageNet Vinyals et al. (2016) into 20 tasks where each task has 5 classes. Finally, Split CUB Chaudhry et al. (2019a, b); Guo et al. (2020) is constructed by splitting 200 bird categories from CUB dataset Welinder et al. (2010) into 20 tasks where each task has 10 classes. The dataset statistics are given in Appendix B. We do not use any data augmentation in our experiments. All datasets have 20 tasks (), where first 3 tasks () are used for hyperparameter selection while the remaining tasks are used for training. We report performances on the held-out test sets from these 17 tasks.

Network Architectures. For CIFAR and miniImageNet, we use a reduced ResNet18 with three times fewer feature maps across all layers, similar to Chaudhry et al. (2021). For CUB, we use a standard ResNet18 He et al. (2016)

with ImageNet pretraining 

Chaudhry et al. (2019a, b). Similar to Chaudhry et al. (2019a,b, 2021) and  Guo et al. (2020), we train and evaluate our algorithm in ‘multi-head’ setting Hsu et al. (2018)

where a task id is used to select a task-specific classifier head. All the networks use ReLU in the hidden units and softmax with cross-entropy loss in the final layer.

Performance Metrics. We evaluate the classification performance using the ACC metric, which is the average test classification accuracy of all tasks. We report backward transfer, BWT to measure the influence of new learning on the past knowledge. For instance, negative BWT indicates forgetting. Formally, ACC and BWT are defined as:

(6)

Here, is the total number of sequential tasks, is the accuracy of the model on task after learning the task sequentially Lopez-Paz and Ranzato (2017), and  Chaudhry et al. (2021).

Baselines. We compare EPR with the SOTA methods in the online CL setup. From memory based methods, we compare with A-GEM Chaudhry et al. (2019a), MIR Aljundi et al. (2019a), MER Riemer et al. (2019), MEGA-I Guo et al. (2020), DER++ Buzzega et al. (2020), and HAL Chaudhry et al. (2021). We also compare with experience replay Chaudhry et al. (2019b) having ring (ER-RING) and reservoir (ER-Reservoir) buffer and EWC Kirkpatrick et al. (2017). We include two non-continual learning baselines: Finetune and Multitask. Finetune, where a single model is trained continually without any memory or regularization, gives performance lower bound. Multitask is an oracle baseline where a model is trained jointly on all tasks.

Training Details.

All the models are trained using Stochastic Gradient Descent (SGD) where batch size for both the current examples and examples from the episodic memory is set to

. All experiments are averaged over runs using different random seeds, where each seed corresponds to a different model initialization and dataset ordering among tasks. A list of hyperparameters along with the EPFs used in these experiments is given in Appendix D.

6 Results and Analyses

Performance Comparison. First, we compare the performance of EPR (in terms of ACC and BWT) with the baselines methods. Table 1 summarizes the results, where for a given , episodic memory (of either ring or reservoir type) can store up to examples. Here, is the number of memory slots per class and the memory size, is :

(7)

Results in Table 1 show that performance of EWC is almost identical to the ‘Finetune’ baseline. This indicates that such method is ill-suited for online CL setup. For the case when one memory slot is assigned per class (), our method (EPR) outperforms A-GEM and MEGA-I considerably for all the datasets. Moreover, compared to the other experience replay methods, such as MIR, MER, DER++, and ER, EPR achieves and accuracy improvement for CIFAR and miniImageNet respectively with least forgetting. Whereas for CUB, EPR obtains accuracy improvement over these baselines with only forgetting.

Finally, we compare EPR with HAL Chaudhry et al. (2021) which holds the state-of-the-art performance in this setup. For the miniImgaeNet tasks, EPR (with ) achieves slightly better accuracy than HAL, whereas HAL outperforms EPR at the CIFAR tasks. However, in addition to the ring buffer, HAL uses extra memory to store anchor points having the same size of the original images for each class. Thus effectively, HAL uses two memory slots per class (). In Table 1, we compare EPR with HAL where EPR uses two memory slots per class (). Under this iso-memory condition, EPR has better accuracy and lower forgetting than HAL for both datasets.

Figure 2: (a) Saliency maps of images from (a) Split CIFAR, (b) Split miniImageNet, and (c) Split CUB dataset. CIFAR images have the lowest resolution whereas CUB images have the highest. With increasing image resolution we observe better object localization with Grad-CAM.

Padding and Placement of Memory Patches. Next, we analyze the impact of different types of padding and placement of the memory patches on the EPR performance. For padding we have two different choices: we can either Zero-pad

these patches or we can pad these patches with pixels sampled from normal Gaussian distribution, which we refer to as

Random-pad. Similarly, we can place these patches either in the exact position of their original image using stored coordinate values or we can place them at random positions. Table 2 shows that across all datasets, exact placement works slightly better than random placement. These results indicate that neural network remembers the past tasks better if it finds the class-discriminative features in their original position during replay. For all the datasets, zero-padding performs better than random-padding (Table 2), which indicates that removing the background information completely serves as a better reminder of past tasks for the network. Thus, in all our experiments we use zero-padding with exact placement. For this purpose, we store a 2D coordinate value per memory patch which has an insignificant overhead compared to total episodic memory.

Effectiveness of Saliency Guided Memory Selection. A simple alternative to our saliency guided memory patch selection is to randomly select a patch (of size ) from the original image and use it for replay with zero-padding. We refer to this method as ‘Random Snip & Replay’ and compare its performance with EPR in Table 2. For CIFAR and CUB, EPR achieves  and for miniImageNet EPR achieves  better accuracy than this baseline. These results show that saliency based memory patch selection plays a key role in enabling high performance in EPR.

Experience Replay with Tiny Episodic Memories. Next, we study the impact of buffer size, on the performance of experience replay methods. In Table 1, we reported the results for to provide a direct comparison to the SOTA works. Here, we analyze whether it is possible to reduce the memory size further and still have an effective experience replay. This means, we consider the fractional values for . In such cases, for instance, means only half of the seen classes will have one example each stored in . Understandably this is a challenging condition for standard experience replay as many classes will not have any representation in the memory, leading to a sharp drop in performance. However, in our method, we can set an appropriate EPF () for any given and use Equation 3 to get the size of the memory patches. This allows us to pack representative memory patches from each class and preserve performance of experience replay. In Figure 3(a)-(c) we show how the performance (ACC) of different memory replay methods varies with the memory size for different datasets. Here we consider, which corresponds to for CIFAR and miniImageNet and for CUB. We also provide the results in tabular form (Table 7 in Appendix E). The ‘Finetune’ baselines in these figures correspond to case, and hence, serve as lower bounds to the performance. From these figures, we observe that ACC of memory replay methods such as ER-RING and MEGA-I falls sharply and approaches ‘Finetune’ baselines as we reduce the memory size. DER++, which uses both stored labels and logits during replay, performs slightly better than these methods. However, it still exhibits high accuracy drop (up to ) when memory size is reduced. In contrast, EPR shows high resilience under extreme memory reduction. For example, accuracy drop is only about for CIFAR, for miniImageNet, and for CUB dataset when memory size is reduced by a factor of 4. Thus, among the memory replay methods for CL, EPR promises to be the best option, especially in the tiny episodic memory regimes.

Figure 3: Comparison of ACC for varying memory sizes for (a) Split CIFAR, (a) Split miniImageNet, and (c) Split CUB dataset. (d) ACC for different Experience Packing Factors for different datasets in EPR (for ). (e) Joint training accuracies on episodic memory data compare buffer informativeness (for ). (f) Total wall-clock training time for learning all the tasks.

EPF vs. Performance. In our design, EPF determines how many image patches we can store per class for a given . A higher EPF would select smaller patches (Equation 3), and hence, increase the sample quantity (or diversity) per class. However, a large number of memory patches, unlike full images, does not imply better performance from experience replay. In this regard, feature localization quality in the saliency map gives us a better picture about the quality of these patches for experience replay. Figure 2 shows the saliency maps of different classes for different datasets from our experiments. For larger-sized and better quality images of CUB dataset, we observe that Grad-CAM localizes the object better within small regions of the images (Figure 2(c)). This gives us an impression that important part of the image for network decision can be captured with a smaller patch. Thus a higher EPF can be chosen to select a larger number of high quality patches, which would improve the performance of experience replay. In contrast, for smaller-sized and low quality images of CIFAR, we observe that network’s decisions are distributed over a large portion of the images (Figure 2(a)). Thus, a smaller patch (for high EPF) here may not capture enough information to be fully effective in experience replay. Figure 3(d) shows the impact of EPF on the performance of our method for different datasets. Since, for , our methods with EPF  is similar to ER-RING, here we consider the cases with EPF . For CUB, accuracy of EPR improves as we increase EPF from to . Beyond that point the accuracy drops which indicates that the memory patches are too small to capture all the relevant information in the given images. For CIFAR, we obtain the best performance for EPF  and as we increase EPF we observe drop in accuracy. For miniImageNet, we obtain the optimal performance for EPF . These results support our observations that link the size and quality of the memory patches to the quality of object localization in saliency maps for a given dataset.

Informativeness of Memory Buffer. Generalization capability of a model trained on the samples from the memory buffer, can reveal the informativeness of the buffer. Thus, following Buzzega et al. (2020), we compare the informativeness of the EPR buffer with the buffers used in DER++ and ER-RING. For each dataset, we train the corresponding model jointly on all the buffer data from all the tasks. This training does not correspond to CL, rather it mimics the multitasks learning. For EPR buffer, we train the model with zero-padded memory patches. Figure 3(e) shows the average (multitask) accuracy on the test set. For all the datasets, models trained on EPR buffer achieve the highest accuracy (better generalization). Thus, the proposed experience packing method enables us to capture a richer summary of underlying data distribution (without any memory increase) compared to the other buffers. This reduces overfitting to the memory buffer, which in turn improves accuracy and reduces forgetting (Table 1, Table 7 in Appendix E).

Training Time Analysis. Finally, we compare the training time of different algorithms in Figure 3(f). We measured time on a single NVIDIA GeForce GTX 1060 GPU. Compared to the standard replay (ER-RING), EPR only takes up to extra time for training. Compared to other recent works such as DER++ and MEGA-I, EPR trains faster. HAL did not report training time for the datasets under consideration, and hence, we could not provide a comparison. Since, HAL and MER both have meta-optimization steps, they are expected to require much larger training time Chaudhry et al. (2021) than ER-RING and A-GEM.

7 Conclusions

In this paper, we propose a new experience replay method with small episodic memory for continual learning. Using saliency maps, our method identifies the parts of the input images that are important for model’s prediction. We store these patches, instead of full images, in the memory and use them with appropriate zero-padding for replay. Our method thus packs the memory with diverse experiences, and hence captures the past data distribution better without memory increase. Comparison with the SOTA methods on diverse image classification tasks shows that our method is simple, fast, and achieves better accuracy with least amount of forgetting. We believe that the work opens up rich avenues for future research. Firstly, better understanding of the model’s decision process and better feature localization with saliency methods would improve the quality of the memory patches and hence improve experience replay. Secondly, new replay techniques for the patches can be explored to reduce the memory overfitting further. Finally, future studies can explore possible applications of our concept in other domains such as in reinforcement learning 

Rolnick et al. (2019).

Acknowledgements

This work was supported in part by the National Science Foundation, Vannevar Bush Faculty Fellowship, Army Research Office, MURI, and by Center for Brain-Inspired Computing (C-BRIC), one of six centers in JUMP, a Semiconductor Research Corporation program sponsored by DARPA.

References

  • Adadi and Berrada (2018) Adadi, A.; and Berrada, M. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6: 52138–52160.
  • Aljundi et al. (2018) Aljundi, R.; Babiloni, F.; Elhoseiny, M.; Rohrbach, M.; and Tuytelaars, T. 2018.

    Memory Aware Synapses: Learning what (not) to forget.

    In

    The European Conference on Computer Vision (ECCV)

    , 139–154.
  • Aljundi et al. (2019a) Aljundi, R.; Belilovsky, E.; Tuytelaars, T.; Charlin, L.; Caccia, M.; Lin, M.; and Page-Caccia, L. 2019a. Online Continual Learning with Maximal Interfered Retrieval. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Aljundi et al. (2019b) Aljundi, R.; Lin, M.; Goujaud, B.; and Bengio, Y. 2019b. Gradient based sample selection for online continual learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Buzzega et al. (2020) Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; and Calderara, S. 2020. Dark Experience for General Continual Learning: a Strong, Simple Baseline. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 15920–15930. Curran Associates, Inc.
  • Chaudhry et al. (2021) Chaudhry, A.; Gordo, A.; Dokania, P. K.; Torr, P. H.; and Lopez-Paz, D. 2021. Using Hindsight to Anchor Past Knowledge in Continual Learning. In AAAI.
  • Chaudhry et al. (2019a) Chaudhry, A.; Ranzato, M.; Rohrbach, M.; and Elhoseiny, M. 2019a. Efficient Lifelong Learning with A-GEM. In International Conference on Learning Representations.
  • Chaudhry et al. (2019b) Chaudhry, A.; Rohrbach, M.; Elhoseiny, M.; Ajanthan, T.; Dokania, P. K.; Torr, P. H. S.; and Ranzato, M. 2019b. Continual Learning with Tiny Episodic Memories. ArXiv, abs/1902.10486.
  • Delange et al. (2021) Delange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; and Tuytelaars, T. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
  • Ebrahimi et al. (2020) Ebrahimi, S.; Meier, F.; Calandra, R.; Darrell, T.; and Rohrbach, M. 2020. Adversarial Continual Learning. In The European Conference on Computer Vision (ECCV).
  • Ebrahimi et al. (2021) Ebrahimi, S.; Petryk, S.; Gokul, A.; Gan, W.; Gonzalez, J. E.; Rohrbach, M.; and trevor darrell. 2021. Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting. In International Conference on Learning Representations.
  • Farajtabar et al. (2020) Farajtabar, M.; Azizan, N.; Mott, A.; and Li, A. 2020. Orthogonal Gradient Descent for Continual Learning. In Chiappa, S.; and Calandra, R., eds., Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of

    Proceedings of Machine Learning Research

    , 3762–3773. PMLR.
  • Goodfellow et al. (2016) Goodfellow, I.; Bengio, Y.; Courville, A.; and Bengio, Y. 2016. Deep learning, volume 1. MIT Press.
  • Guo et al. (2020) Guo, Y.; Liu, M.; Yang, T.; and Rosing, T. 2020. Improved Schemes for Episodic Memory-based Lifelong Learning. Advances in Neural Information Processing Systems, 33.
  • Hayes, Cahill, and Kanan (2019) Hayes, T. L.; Cahill, N. D.; and Kanan, C. 2019. Memory Efficient Experience Replay for Streaming Learning. In International Conference on Robotics and Automation (ICRA). IEEE.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 770–778.
  • Hsu et al. (2018) Hsu, Y.-C.; Liu, Y.-C.; Ramasamy, A.; and Kira, Z. 2018. Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines. In NeurIPS Continual learning Workshop.
  • Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N. C.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Hadsell, R. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114: 3521 – 3526.
  • Knoblauch, Husain, and Diethe (2020) Knoblauch, J.; Husain, H.; and Diethe, T. 2020. Optimal Continual Learning has Perfect Memory and is NP-hard. In International Conference on Machine Learning (ICML).
  • Krizhevsky (2009) Krizhevsky, A. 2009. Learning multiple layers of features from tiny images. Technical report.
  • Li and Hoiem (2018) Li, Z.; and Hoiem, D. 2018. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40: 2935–2947.
  • Lopez-Paz and Ranzato (2017) Lopez-Paz, D.; and Ranzato, M. A. 2017. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems, volume 30.
  • Mahendran and Vedaldi (2016) Mahendran, A.; and Vedaldi, A. 2016.

    Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images.

    Int. J. Comput. Vision, 120(3): 233–255.
  • Mallya and Lazebnik (2018) Mallya, A.; and Lazebnik, S. 2018. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7765–7773.
  • Mccloskey and Cohen (1989) Mccloskey, M.; and Cohen, N. J. 1989. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. The Psychology of Learning and Motivation, 24: 104–169.
  • Nguyen et al. (2018) Nguyen, C. V.; Li, Y.; Bui, T. D.; and Turner, R. E. 2018. Variational Continual Learning. In International Conference on Learning Representations.
  • Prabhu, Torr, and Dokania (2020) Prabhu, A.; Torr, P.; and Dokania, P. 2020. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In The European Conference on Computer Vision (ECCV).
  • Ratcliff (1990) Ratcliff, R. 1990. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 2: 285–308.
  • Rebuffi et al. (2017) Rebuffi, S.-A.; Kolesnikov, A.; Sperl, G.; and Lampert, C. H. 2017. iCaRL: Incremental Classifier and Representation Learning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5533–5542.
  • Riemer et al. (2019) Riemer, M.; Cases, I.; Ajemian, R.; Liu, M.; Rish, I.; Tu, Y.; and Tesauro, G. 2019. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. In In International Conference on Learning Representations (ICLR).
  • Ring (1998) Ring, M. B. 1998. Child: A First Step Towards Continual Learning. In Learning to Learn.
  • Robins (1995) Robins, A. V. 1995. Catastrophic Forgetting, Rehearsal and Pseudorehearsal. Connect. Sci., 7: 123–146.
  • Rolnick et al. (2019) Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.; and Wayne, G. 2019. Experience Replay for Continual Learning. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Rusu et al. (2016) Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive Neural Networks. ArXiv, abs/1606.04671.
  • Saha et al. (2020) Saha, G.; Garg, I.; Ankit, A.; and Roy, K. 2020. SPACE: Structured Compression and Sharing of Representational Space for Continual Learning. ArXiv, abs/2001.08650.
  • Saha, Garg, and Roy (2021) Saha, G.; Garg, I.; and Roy, K. 2021. Gradient Projection Memory for Continual Learning. In International Conference on Learning Representations.
  • Selvaraju et al. (2017) Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • Serrà et al. (2018) Serrà, J.; Surís, D.; Miron, M.; and Karatzoglou, A. 2018. Overcoming Catastrophic Forgetting with Hard Attention to the Task. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 4548–4557. PMLR.
  • Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual Learning with Deep Generative Replay. In Advances in Neural Information Processing Systems, volume 30.
  • Simonyan, Vedaldi, and Zisserman (2014) Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. In In Workshop at International Conference on Learning Representations.
  • Thrun and Mitchell (1995) Thrun, S.; and Mitchell, T. M. 1995. Lifelong robot learning. Robotics Auton. Syst., 15: 25–46.
  • Verwimp, Lange, and Tuytelaars (2021) Verwimp, E.; Lange, M. D.; and Tuytelaars, T. 2021. Rehearsal revealed: The limits and merits of revisiting samples in continual learning. arXiv:2104.07446.
  • Vinyals et al. (2016) Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems, volume 29.
  • Wang et al. (2015) Wang, L.; Lu, H.; Ruan, X.; and Yang, M.-H. 2015.

    Deep networks for saliency detection via local estimation and global search.

    In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3183–3192.
  • Welinder et al. (2010) Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
  • Yoon et al. (2018) Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2018. Lifelong Learning with Dynamically Expandable Networks. In 6th International Conference on Learning Representations.
  • Zenke, Poole, and Ganguli (2017) Zenke, F.; Poole, B.; and Ganguli, S. 2017. Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 3987–3995. PMLR.
  • Zhang et al. (2018) Zhang, J.; Bargal, S. A.; Lin, Z.; Brandt, J.; Shen, X.; and Sclaroff, S. 2018. Top-Down Neural Attention by Excitation Backprop. Int. J. Comput. Vision, 126(10): 1084–1102.
  • Zhao et al. (2015) Zhao, R.; Ouyang, W.; Li, H.; and Wang, X. 2015. Saliency detection by multi-context deep learning. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1265–1274.
  • Zhou et al. (2016) Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2016.

    Learning Deep Features for Discriminative Localization.

    In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2921–2929.

Appendix

Section A describes the steps of saliency map generation using Grad-CAM. Section B provides the dataset statistics used in different experiments. Pseudo-code of the EPR algorithm is given in Section C. Section D provides the list of hyperparameters used for the baseline algorithms and our method. Additional results are provided in Section E.

Appendix A Saliency Method : Grad-CAM

Gradient-weighted Class Activation Mapping (Grad-CAMSelvaraju et al. (2017) is a saliency method that uses gradients to determine the impact of specific feature map activations on a given prediction. Since later layers in the convolutional neural network capture high-level semantics Mahendran and Vedaldi (2016), taking gradients of a model output with respect to the feature map activations from one such layers identifies which high-level semantics are important for the model prediction. In our analysis, we select this layer and refer to as target layer Ebrahimi et al. (2021). List of target layer for different experiments is given in Table 3.

Let’s consider the target layer has feature maps where each feature map, is of width and height . Also consider, for a given image () belonging to class , the pre-softmax score of the image classifier is . To obtain the class-discriminative saliency map, Grad-CAM first takes derivative of with respect to each feature map . These gradients are then global-averaged-pooled over and to obtain importance weight, for each feature map :

(8)

where denotes location in the feature map . Next, these weights are used for computing linear combination of the feature map activations, which is then followed by ReLU to obtain the localization map :

(9)

This map is of the same size () of . Finally, saliency map, is generated by upsampling

to the input image resolution using bilinear interpolation.

(10)
Dataset Network

Target Layer Name in PyTorch package

Split CIFAR ResNet18 (reduced) layer4.1.shortcut
Split miniImageNet ResNet18 (reduced) layer4.1.shortcut
Split CUB ResNet18 net.layer4.1.conv2
Table 3: Target layer names for saliencies generated by different network architectures in Grad-CAM for different dataset.

Appendix B Dataset Statistics

Split CIFAR Split miniImageNet Split CUB
num. of tasks 20 20 20
input size ()
num. of classes/task 5 5 10
num. of training samples/tasks 2,500 2,500 300
num. of test samples/tasks 500 500 290
Table 4: Dataset Statistics.

Appendix C EPR Algorithm

1:procedure TRAIN(, , , , EPF, )
2:     : Number of tasks; : No. of memory slots per class; EPF: No. patches per class; : training image width (height);
3:     : learning rate; training dataset: ; network model: with parameters ; : mini-batch size
4:      temporary ‘ring’ buffer
5:      episodic memory
6:      Get patch size from Equation 3
7:     for  do
8:         for   do Sample examples from current dataset
9:               Sample patches from episodic memory
10:               Zero-pad using Equation 5
11:               update model with experience replay
12:               add training samples from current task in the ring buffer
13:         end for
14:          update episodic memory
15:          clear temporary ‘ring’ buffer
16:     end for
17:     return
18:end procedure
19:
20:
1:procedure UPDATEMEMORY()
2:     : Grad-CAM procedure for saliency map generation; : stride; : task id
3:     Initialize: ; ; ; ; Initialize for memory selection
4:     for   do Sample 1 example at a time without replacement from
5:          generate saliency map using Equation 1
6:          get corner coordinates of the most salient region in input,
7:          get patch from Equation 4
8:         
9:          check model prediction after zero-padding
10:          add patch
11:          add prediction
12:          add class label
13:          add
14:          add
15:     end for
16:      see section 4: memory patch selection
17:     
18:      update episodic memory
19:     return
20:end procedure
Algorithm 1 Algorithm for Continual Learning with Experience Packing and Replay (EPR)

Appendix D List of Hyperparameters

Methods Hyperparameters
Finetune : 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0
EWC :  0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0
regularization, : 0.1, 1, 10 (cifar, minImg, cub), 100, 1000
A-GEM : 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0
MER : 0.003, 0.01, 0.03 (cifar, minImg), 0.1 (cub), 0.3, 1.0
with in batch meta-learning rate, : 0.01, 0.03, 0.1 (cifar, minImg, cub), 0.3, 1.0
current batch learning rate multiplier, : 1, 2, 5 (cifar, minImg, cub), 10
MEGA-I : 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0
sensitivity parameter, , , 0.001, 0.01 (cifar, minImg, cub), 0.1
DER++ : 0.003, 0.01, 0.03 (minImg, cub), 0.1 (cifar), 0.3, 1.0
regularization : 0.1 (minImg), 0.2 (cifar), 0.5 (cub), 1.0
regularization, : 0.5 (cifar, minImg, cub), 1.0
ER-Reservoir : 0.003, 0.01, 0.03 (cub), 0.1 (cifar, minImg), 0.3, 1.0
ER-RING : 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0
HAL : 0.003, 0.01, 0.03 (cifar, minImg), 0.1, 0.3, 1.0
regularization, : 0.01, 0.03, 0.1, 0.3 (minImg), 1 (cifar), 3, 10
mean embedding strength, : 0.01, 0.03, 0.1 (cifar, minImg), 0.3, 1, 3, 10
decay rate, : 0.5 (cifar, minImg)
gradient steps on anchors, : 100 (cifar, minImg)
Multitask : 0.003, 0.01, 0.03 (cifar, minImg, cub), 0.1, 0.3, 1.0
EPR (ours) : 0.01, 0.03 (cub), 0.05 (minImg), 0.1 (cifar), 0.3, 1.0
stride, : 1 (cifar, minImg), 2, 3 (cub)
Table 5: Hyperparameters grid considered for the baselines and our approach. The best values are given in parentheses. Here, ‘’ represents learning rate. In the table, we represent Split CIFAR as ‘cifar’, Split miniImageNet as ‘minImg’ and Split CUB as ‘cub’.
Split CIFAR Split miniImageNet   Split CUB
   EPF   EPF EPF
2 3 26 5 53 7 119
1 2 22 3 48 4 112
0.75 1 27 2 51 3 112
0.5 1 22 2 42 2 112
Table 6: Experience Packing Factor (EPF) for different used in our experiments. Input image width, for CIFAR, miniImageNet and CUB dataset are and respectively. For given , EPF and , corresponding memory patch sizes, are also given in the table.

Appendix E Additional Results

Implementation: Our codes are implemented in Python (version 3.7) with PyTroch (version 1.5.1) packages. We reported the results by running the codes on a single NVIDIA GeForce GTX 1060 GPU.

Split CIFAR Split miniImageNet Split CUB
Methods ACC (%) BWT ACC (%) BWT ACC (%) BWT
- Finetune 42.9 2.07 - 0.25 0.03 34.7 2.69 - 0.26 0.03 55.7 2.22 - 0.13 0.03
MEGA-I 57.6 0.87 - 0.12 0.01 50.3 1.14 - 0.08 0.01 67.8 1.30 - 0.04 0.01
2 DER++ 56.3 0.98 - 0.14 0.01 50.1 1.14 - 0.09 0.01 70.7 0.62 - 0.03 0.01
ER-RING 58.6 2.68 - 0.12 0.01 51.2 2.06 - 0.10 0.01 68.3 1.13 - 0.02 0.01
EPR (Ours) 60.8 0.35 - 0.09 0.01 53.2 1.45 - 0.05 0.01 73.5 1.30 - 0.01 0.01
MEGA-I 48.9 1.68 - 0.21 0.01 43.8 1.58 - 0.14 0.01 61.5 2.08 - 0.08 0.01
0.75 DER++ 50.0 1.81 - 0.19 0.02 47.2 1.54 - 0.12 0.01 64.8 1.61 - 0.06 0.01
ER-RING 50.4 0.85 - 0.21 0.02 44.9 1.49 - 0.14 0.02 64.0 1.29 - 0.05 0.01
EPR (Ours) 56.8 1.59 - 0.12 0.02 51.1 1.47 - 0.06 0.01 70.7 0.72 - 0.03 0.01
MEGA-I 43.7 1.26 - 0.26 0.02 39.6 2.35 - 0.18 0.02 57.7 0.62 - 0.11 0.01
0.5 DER++ 47.5 1.58 - 0.21 0.01 45.6 0.56 - 0.13 0.01 62.5 1.45 - 0.08 0.01
ER-RING 44.6 0.84 - 0.27 0.01 39.1 1.38 - 0.20 0.02 59.2 0.97 - 0.10 0.01
EPR (Ours) 55.6 0.54 - 0.13 0.02 49.2 1.20 - 0.07 0.01 70.3 0.91 - 0.03 0.01
Table 7: Performance comparison of different experience replay methods for different memory sizes. Number of memory slots per class, = refers to memory size, = for CIFAR and miniImageNet, and = for CUB. Average and standard deviations are computed over runs for different random seeds.