GCR: Gradient Coreset Based Replay Buffer Selection For Continual Learning

Continual learning (CL) aims to develop techniques by which a single model adapts to an increasing number of tasks encountered sequentially, thereby potentially leveraging learnings across tasks in a resource-efficient manner. A major challenge for CL systems is catastrophic forgetting, where earlier tasks are forgotten while learning a new task. To address this, replay-based CL approaches maintain and repeatedly retrain on a small buffer of data selected across encountered tasks. We propose Gradient Coreset Replay (GCR), a novel strategy for replay buffer selection and update using a carefully designed optimization criterion. Specifically, we select and maintain a "coreset" that closely approximates the gradient of all the data seen so far with respect to current model parameters, and discuss key strategies needed for its effective application to the continual learning setting. We show significant gains (2 absolute) over the state-of-the-art in the well-studied offline continual learning setting. Our findings also effectively transfer to online / streaming CL settings, showing upto 5 demonstrate the value of supervised contrastive loss for continual learning, which yields a cumulative gain of up to 5 subset selection strategy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/29/2021

Distilled Replay: Overcoming Forgetting through Synthetic Samples

Replay strategies are Continual Learning techniques which mitigate catas...
08/11/2019

Online Continual Learning with Maximally Interfered Retrieval

Continual learning, the setting where a learning agent is faced with a n...
10/14/2021

Continual Learning on Noisy Data Streams via Self-Purified Replay

Continually learning in the real world must overcome many challenges, am...
08/15/2021

An Investigation of Replay-based Approaches for Continual Learning

Continual learning (CL) is a major challenge of machine learning (ML) an...
11/03/2018

Closed-Loop GAN for continual Learning

Sequential learning of tasks using gradient descent leads to an unremitt...
05/11/2021

TAG: Task-based Accumulated Gradients for Lifelong learning

When an agent encounters a continual stream of new tasks in the lifelong...
06/03/2019

Continual learning improves Internet video streaming

We describe Fugu, a continual learning algorithm for bitrate selection i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of continual learning (CL) [thrun1995lifelong] studies the training of models in an incremental fashion, to generalize across a number of sequentially encountered scenarios or tasks, and to avoid the training & maintenance costs of one-off models. A key challenge in CL is the limited access to data from prior tasks; this results in catastrophic forgetting [mccloskey1989catastrophic], where training on subsequent tasks may potentially erase information in the model parameters pertaining to previous tasks. Approaches to address catastrophic forgetting111The papers cited, here and subsequently, under each paradigm or technique are representative or recent papers, not necessarily the papers defining or proposing said paradigm.

include modifications to the loss function (e.g., 

[kirkpatrick2017overcoming, Rudner2021cfsvi]), to the network architecture (e.g., [hung2020compacting, SOKAR2021]), and to the training procedure and data augmentation (e.g., [derpp]). In particular, replay-based continual learning maintains a small data sketch from previous tasks, to be included in the training mix throughout the lifetime learning of the model. Remarkably, as little as 1% of saved historical data, using random sampling, is enough to provide significant gains over other CL approaches [derpp]. This suggests that sophisticated methods for selecting compact data summaries or coresets may perform much better. However, previous approaches to coreset in continual learning have focused on qualitative/diversity-based criteria [aljundi2019gradient, ocs], or bi-level optimizations with significant computational costs and scaling limitations [borsos2020coresets].

Figure 1: When training CL models on S-Cifar100 with different replay buffer sizes , the use of replay buffers selected by GCR produces higher model accuracy and lower model forgetting than using reservoir-sampled replay buffers [derpp]. See text for more details.

We present Gradient-based Coresets for Replay-based CL (GCR), a principled, optimization-driven criterion for selecting and updating coresets for continual learning. Specifically, we select a coreset that approximates the gradient of model parameters over the entirety of the data seen so far. We provide empirical evidence that coresets selected utilizing the above approach mitigate catastrophic forgetting on par with, or better than previous methods. Figure 1 illustrates how our coreset selection approach improves over random data selection by large margins, both in overall accuracy and in retaining of previous tasks. We explore the effect of better representation learning in the CL setting by including a representation learning component [supcon] into our CL loss function. We conduct extensive experimentation against the state-of-the-art (SOTA) in the well-studied offline CL setting, where the current task’s data is available in entirety for iterative training (2-4% absolute gains over SOTA), as well as the online/streaming setting, where the data is only available in small batches and cannot be revisited (up to 5% gains over SOTA). In particular, we show that the benefit from our coreset selection mechanism increases with increasing number of tasks, showing that GCR scales effectively with task count. Finally, we demonstrate the superiority of GCR over other coreset methods for CL in a head-to-head comparison.

2 Related Work

CL approaches divide broadly into three main themes:
Regularization based approaches preserve learning from earlier tasks via regularization terms placed on model weights (EWC [Kirkpatrick3521], synaptic intelligence [synaptic]), structural regularization [pmlr-v80-serra18a], or functional regularization [Cha2021, Rudner2021cfsvi]). In other work, incremental momentum matching [NIPS2017_f708f064] proposes a model-merging step which merges a model trained on the new task with the model trained over the previous tasks, or use knowledge distillation-based regularization [L2F] to retain learnings from previous tasks.

Architecture adjustments

modify model architectures to incorporate CL challenges; for instance, using a recurrent neural network 

[9207550]

, learning overcapacitated deep neural networks in an adaptive way that repeatedly compresses sparse connections 

[SOKAR2021] or using overlapping convolutional filters [hung2020compacting]. Other work attempts to identify shared information across tasks (CLAW [Adel2020Continual]) using variational inference.

Replay based approaches preserve knowledge on previous tasks by storing a small memory buffer of representative data samples from previous tasks for continued training alongside new tasks (ER [doi:10.1080/09540099550039318, Ratcliff90connectionistmodels]). Some approaches add a distillation loss to preserve learned representations (iCaRL [rebuffi2017icarl]

) or for classifier outputs (DER 

[derpp] for points in the buffer. Gradient Episodic Memory (GEM[lopez2017gradient], AGEM[chaudhry2019efficient]) focuses on minimizing catastrophic forgetting by the efficient use of episodic memory. Meta-Experience Replay (MER[riemer2018learning] adds a penalty term for between-task interference in a meta-learning framework. Maximally Interfered Retrieval (MIR) method [MIR] uses controlled memory sampling for replay buffer by retrieving samples that the foreseen model parameters update will most negatively impact. Of these approaches, the distillation loss on the replay buffer [rebuffi2017icarl, derpp] show the strongest performance gains at standard CL benchmarks.

Coreset selection: Coresets [feldman2020core]

are small, informative weighted data subsets that approximate specific attributes (e.g., loss, gradients, logits) of original data. Previous work 

[har2004coresets, Lemke2016DensitybasedCA]

used coresets for unsupervised learning problems such as K-means & K-medians clustering. Coresets enable efficient and scalable Bayesian inference 

[pmlr-v130-zhang21g, NEURIPS2019_7bec7e63, 10.5555/3322706.3322721, DBLP:conf/icml/CampbellB18, NIPS2016_2b0f658c] by approximating the model logit sum of the entire data. Several recent works [killamsetty21a, NEURIPS2020_8493eeac, Mirzasoleiman2020CoresetsFD, killamsetty2021glister]

also use coresets to approximate the gradient sum of the individual samples across the entire dataset, for efficient and robust supervised learning. Although coresets are increasingly applied to supervised and unsupervised learning scenarios, the problem of coreset selection for CL is relatively under-studied. We discuss existing coreset based approaches for CL below.


Coresets for Replay-based CL: Approaches include selecting replay buffers by maximizing criteria like sample gradient diversity (GSS [aljundi2019gradient]), a mix of mini-batch gradient similarity and cross-batch diversity (OCS [ocs]), adversarial Shapley value(ASER [shim2021online]), or representation matching (iCARL [rebuffi2017icarl]). These approaches introduce a secondary selection criterion (to reduce the interference of the current task and forgetting of previous tasks) that is highly specific to each particular approach. As a result, they may not generalize different CL loss criteria. Additionally, previous approaches have not considered weighted replay buffer selection, which increases the representational capabilities of replay buffers. Closest to our work is a recent paper [borsos2020coresets] that proposed solving a bi-level optimization for selecting the optimal replay buffer under certain assumptions, rather than for any specified loss function. Their proposal is computationally expensive, intractable in large task scenarios, and scales poorly with buffer size; moreover, they did not compare against recent strong proposals [derpp]. We compare directly against this approach in our experiments. In contrast to previous coreset approaches in CL, (a) We use an optimization criterion directly tied to the replay loss function used by the CL approach; we show that doing so can achieve better model accuracies across different replay loss functions (ER [riemer2018learning] and DER [derpp], see Results), (b) We use a weighted coreset selection mechanism for continual learning, where the weights are selected by the coreset optimization criterion and allow effective use of the buffer data.
CL settings: Apart from different learning approaches, CL is also studied under a range of settings222see van de Ven & Tolias [Ven2019ThreeSF] for a taxonomy, depending on what information and data are available at which stage. In particular, the task-incremental and class-incremental settings assume that task label is known, or not available, respectively, at inference time. In the offline setting, each new task’s data is available in entirety for repeated/ iterative learning [DeLangeMatthias2021Acls], whereas online CL only considers data availability in small buffers that cannot be revisited [10.5555/304710.304720].

3 Preliminaries

3.1 Notation

We assume that there are tasks in the continual learning classification problem considered. For each task , we have an associated dataset that is composed of i.i.d data points where is the sample and is the ground truth label for the sample in the dataset . We assume that each task is associated with a set of distinct classes and that no tasks have common classes i.e . Let the classifier model be characterized by parameters

. We split the model parameters into feature extraction layer and linear classification layer. Let the model’s feature extractor output for input sample

is denoted by . The model logits output for input sample is denoted by

and the predicted probability distribution over the classes for input

is denoted by . We use to denote the last task observed so far.

3.2 Continual Learning

Following the above notations, the goal of continual learning at step is to minimize the following objective:

(1)

where is the cross-entropy loss function.

In a continual learning setup, at step , we only have access to data points from task , making it difficult to optimize the above objective directly. In particular, we have to make sure that while the model is learning on a new task, it should not forget previous tasks.

3.3 Replay-based Continual Learning

Replay-based CL methods maintain a small buffer of data points from previous tasks on which the model is trained along with the data samples from the new task to make the model retain knowledge on previous tasks. Let the data buffer of previous tasks used for the replay be denoted by . One of the possible formulations for replay-based methods for continual learning is as follows:

(2)

where is the replay loss of the model on the samples from replay buffer, and

is a hyperparameter denoting the replay loss coefficient. As seen in Eq (

2), existing methods give equal weight to all samples in the replay buffer for the replay loss and fail to consider each sample’s representativeness or information content. Instead, GCR selects a weighted subset of the data as the summary that will serve as replay buffer and assigns weights to selected samples based on their contribution to the overall objective. We denote the weights assigned to samples in the replay buffer by the variable .

4 Methodology

We discuss the main components of GCR separately and how they combine in our overall continual learning framework. The following sections cover the role of replay buffers for preserving data history, the GCR learning objective, and the end-to-end learning algorithm. In the experimental section, we systematically examine the contribution of each component of GCR to overall performance.

Figure 2: Block diagram showing training, replay buffer selection, and update operations of GCR at time step .

4.1 Replay buffer Selection

In order to cover both Offline CL (all task data is available) and Online CL (data available in small buffers), we utilize a candidate pool from the current task for summary construction instead of the current task data . This candidate pool is selected using reservoir sampling (see the end of section for more details). Following training on the current task, GCR updates the replay buffer with a weighted summary of size selected from the candidate pool and the previous replay buffer

. Each point in the pool is assigned unit weight; that is, the weight vector associated with the candidate pool is a vector of all ones

.

(3)

In particular, GCR constructs the data summary by selecting a weighted subset such that the weighted sum of its individual sample’s replay loss gradient is closest to the replay loss gradient of the entire dataset. We give a detailed formulation of the GradApprox subset selection optimization problem used by GCR in the subsequent sections. Moreover, a detailed pseudo code of the GradApprox algorithm is given in Algorithm 3.

GradApprox optimization problem: In the CL setting, the ideal reply buffer is the one that minimizes catastrophic forgetting the most. The calculation of catastrophic forgetting, however, requires access to the previous tasks. Hence, we approximate catastrophic forgetting through the model’s replay loss on the combined replay buffer and candidate pool samples; note that the replay loss is calculated using the historical model outputs for samples within the replay buffer and the candidate pool.

Given a dataset and its associated sample weights , GradApprox tries to select a data subset and its associated weights . The formulation of the optimization problem used by GradApprox for weighted subset selection is given below:

(4)

In the above equation, the replay loss function is a weighted replay loss where individual data sample losses are weighted by their corresponding weight values. Note that the above coreset selection optimization problem is general in terms of the replay loss function. We give the specific formulations of weighted versions of the replay loss formulations used in the works DER [derpp] and ER [doi:10.1080/09540099550039318] below.
Weighted DER replay loss: , where . The first loss component of weighted DER replay loss is the distillation loss, and the second loss component is the label loss. Moreover, the hyperparameters and are the coefficients associated with distillation and label loss components, respectively. Note that we need to store historical model logits output along with the model label prediction for a sample in the case of weighted DER replay loss.

Input: Dataset: , Parameters: , Scalars: , , , , Learning rate: , Batch size:, Buffer Size: , Tolerance:
Output: Parameters:
1 Initialize Replay Buffer: , Replay Buffer weights: , and sample count: for   do
2        Initialize Candidate Pool: and task sample count: for  do
3               Update task sample count: and sample count: , Calculate model logit outputs:
4       
Algorithm 1 GCRAlgorithm
Input: Replay Buffer: , Replay Buffer weights: , Candidate Pool: , Task sample count: , Entire sample count:
Output: Data sample: {(x, y, z, w)}
1

Calculate probability:

Sample a random number: if  then
2        Sample a index randomly:
3else
4        Sample a index randomly:
Algorithm 2 Adaptive Sampling Algorithm

Weighted ER replay loss: , where d = (x, y) and the loss component is the label loss calculated based on the stored historical model label output for a sample . Further, In Eq (4), we also added an regularization component over the weights of the replay buffer to discourage large weight assignments to data samples in the selected replay buffer, thereby preventing the model from overfitting on some samples. The optimization problem given in Eq (4) is weakly submodular [killamsetty21a, Natarajan1995SparseAS]. Hence, we can effectively solve the above optimization problem using a greedy algorithm with approximation guarantees. We use Orthogonal matching pursuit as the greedy algorithm to find the subset and their associated weights in our work. A similar approach for subset selection using cross-entropy loss for efficient and robust supervised learning has been proposed in offline, single-task learning scenarios [killamsetty21a]. In this work, for all experiments, we adopt balanced replay buffer construction with an equal number of samples from each class to make the selected replay buffer robust to the class imbalance in the dataset. In other words, if there are classes in the dataset , we solve gradient approximation problems using per-class approximation by selecting a weighted subset of size from the data instances that pertain to the class being considered.

Input: Dataset: , Existing Weights: , Parameters: , Scalar: , Budget: , Tolerance:
Output: Subset: , Subset Weights:
1 Initialize Partition dataset : Partition dataset weights based on the labels of data samples: Initialize Replay Buffer: and Replay Buffer weights: for  do
2        Initialize PerClass budget: , PerClass subset: and PerClass weights: Calculate residuals : while  and  do
3               Find out maximum residual element: Update PerClass subset: Calculate updated weights: Calculate residuals :
4       Update subset: and subset weights:
Algorithm 3 GradApprox Algorithm

4.2 Gcr Loss objective

As mentioned earlier, GCR maintains two bounded weighted data buffers of size . The first weighted data buffer is a replay buffer with an associated weight vector having data samples from the past tasks seen so far. The second weighted buffer is a candidate pool and its associated weight vector with data samples from the current task . Let the model parameters at step is given by . Even though our proposed subset selection approach can be applied with different replay loss functions, for the sake of further experiments and analysis, we have adopted the weighted version of the replay loss function the state-of-the-art approach DER [derpp]. Furthermore, we also include the representation learning objective in GCR to analyze the impact of representation learning for CL. Therefore, for all experiments reported in Tables 12345 the formulation of the loss objective considered by GCR is given below:

(5)

where,

(6)

In the above equation, the optimization objective consists of four components. The first component (a) measures prediction loss compared to the ground-truth labels for current task data. The second and third components (b, c) measure the current model’s loss computed over data from the combination of replay buffer and current task candidate pool, in terms of distillation loss and label loss, respectively. Unlike previous work, GCR uses a weighted loss over the replay buffer, where the weights are an outcome of the replay buffer selection procedure (see previous sections). The fourth and final loss component (d) is a weighted version of supervised contrastive loss [supcon], initially proposed for standard supervised learning settings. This loss term improves the model’s learned representations by forcing data samples from the same class to be closer in the embedding space than those from other classes. In our work, we always use the data samples from the combined candidate pool and replay buffer as anchor points for supervised contrastive learning loss. We also show results of GCR using ER [doi:10.1080/09540099550039318] based replay loss function in Appendix B. Although DER [derpp] using the first three-loss components above (unweighted) along with a randomly selected subset as a replay buffer was shown to have superior performance than previous replay-based CL approaches, a randomly chosen replay buffer has several disadvantages: (a) it can be a poor representation of previous tasks, (b) it does not take into consideration the current model parameters, (c) it is susceptible to data imbalance across tasks. GCR address these issues and increase the robustness of the continual training process by selecting a representative weighted subset via gradient approximation.

Further optimization: The candidate set for the current task is selected by reservoir sampling, specifically in order to maintain candidate data points alongside intermediate logits for the distillation loss in the GCR objective. It is possible to improve the candidate pool selection process further through a gradient-approximation objective at the end of current task training. However, this strategy would require maintaining intermediate logits for the entire dataset, requiring prohibitive storage requirements.

4.3 The Gcr algorithm

We now put together the various components described above in a continual learning workflow. The pictorial representation of the GCR algorithm was presented in Figure 2 and its detailed pseudocode in Algorithm 1. As mentioned earlier and as shown in the pseudocode, we employ reservoir sampling [reservoir] for selecting the candidate pool. We also use adaptive sampling described in Algorithm 2 to sample tasks from the combined candidate pool and the replay buffer. Finally, the pseudocode given in Algorithm 1 requires the knowledge of task boundaries to update replay buffer using GradApprox; however, we can also use streaming boundaries or regular intervals of data samples as boundaries in practice to implement the GCR algorithm, thereby making it task agnostic.

5 Experiment setup

We test the efficacy of our approach GCR by comparing its performance with different state-of-the-art continual learning baselines under different CL settings.

S-Cifar-10

S-Cifar-100 S-TinyImageNet
Setting Method K=200 K=500 K=2000 K=200 K=500 K=2000 K=200 K=500 K=2000
Class-IL ER 49.16±2.08 62.03±1.70 77.13±0.87 21.78±0.48 27.66±0.61 42.80±0.49 8.65±0.16 10.05±0.28 18.19±0.47
GEM 29.99±3.92 29.45±5.64 27.2±4.5 20.75±0.66 25.54±0.65 37.56±0.87 - - -
GSS 38.62±3.59 48.97±3.25 60.40±4.92 19.42±0.29 21.92±0.34 27.07±0.25 8.57±0.13 9.63±0.14 11.94±0.17
iCARL 32.44±0.93 34.95±1.23 33.57±1.65 28.0±0.91 33.25±1.25 42.19±2.42 5.5±0.52 11.0±0.55 18.1±1.13
DER 63.69±2.35 72.15±1.31 81.00±0.97 31.23±1.38 41.36±1.76 55.45±0.86 13.22±0.92 19.05±1.32 31.53±0.87
GCR 64.84±1.63 74.69±0.85 83.97±0.58 33.69±1.4 45.91±1.3 60.09±0.72 13.05±0.91 19.66±0.68 35.68±0.52
Task-IL ER 91.92±1.01 93.82±0.41 96.01±0.28 60.19±1.01 66.23±1.52 74.67±1.2 38.83±1.15 47.86±0.87 62.04±0.7
GEM 88.67±1.76 92.33±0.8 94.34±1.31 58.84±1.0 66.31±0.86 74.93±0.6 - - -
GSS 90.0±1.58 91.73±1.18 93.54±1.32 55.38±1.34 60.28±1.18 66.88±0.88 31.77±1.34 36.52±0.91 43.75±0.83
iCARL 74.59±1.24 75.63±1.42 76.97±1.04 51.43±1.47 58.16±1.76 67.95±2.69 22.89±1.83 35.86±1.07 49.07±2.0
DER 91.91±0.51 93.96±0.37 95.43±0.26 63.09±1.09 71.73±0.74 78.73±0.61 42.27±0.9 53.32±0.92 64.86±0.48
GCR 90.8±1.05 94.44±0.32 96.32±0.18 64.24±0.83 71.64±2.1 80.22±0.49 42.11±1.01 52.99±0.89 66.7±0.59
Table 1: Offline Class-IL and Task-IL Continual Learning. See text for details. Numbers represent mean ± SEM of model test accuracy over 15 runs. Best-performing models in each column are bolded (paired -test, ). Subsequent tables follow the same style.
S-Cifar-10 S-Cifar-100 S-TinyImageNet
Setting Method K=200 K=500 K=2000 K=200 K=500 K=2000 K=200 K=500 K=2000
Class-IL DER 35.79±2.59 24.02±1.63 12.92±1.1 62.72±2.69 49.07±2.54 28.18±1.93 64.83±1.48 59.95±2.31 39.83±1.15
GCR 32.75±2.67 19.27±1.48 8.23±1.02 57.65±2.48 39.2±2.84 19.29±1.83 65.29±1.73 56.4±1.08 32.45±1.79
Task-IL DER 6.08±0.7 3.72±0.55 1.95±0.32 25.98±1.55 25.98±1.55 7.37±0.85 40.43±1.05 28.21±0.97 15.08±0.49
GCR 7.38±1.02 3.14±0.36 1.24±0.27 24.12±1.17 15.07±1.88 5.75±0.72 40.36±1.08 27.88±1.19 13.1±0.57
Table 2: Forgetting metric (lower is better) in Offline Class-IL and Task-IL Continual Learning. For simplicity, we only present numbers for the two best-performing algorithms–DER and GCR (ours). Full table in appendix.

5.1 Continual Learning settings

We compare the performance of GCR with the baselines considered in the following continual learning settings.
Offline Class-Incremental (Class-IL)–All data for the current task is available for training with multiple learning iterations. Previous tasks’ data are only partially available through the replay buffer. Offline Task-Incremental (Task-IL) uses different output heads/ classifier layers for different tasks, and needs a task identifier at inference time to select the appropriate classifier. Online Streaming–Data arrives in the form of long streams; the learner can iterate only once on streaming data and multiple times on the stored replay buffer.

5.2 Baselines

We consider following continual learning methods as baselines for a thorough evaluation of GCR performance. ER-Experience replay, a rehearsal method with random sampling in Memory-Retrieval and reservoir sampling in Memory-Update. GEM-Gradient Episodic Memory, focus on minimizing negative backward transfer (catastrophic forgetting) by the efficient use of episodic memory. MIR

-Maximally Interfered Retrieval, which retrieves memory samples suffering from an increase in loss given estimated parameters update for the current task.

GSS-Gradient-Based Sample Selection, which diversifies the gradients of the samples in the replay memory. iCaRL-incremental classifier and representation learning, which simultaneously learns classifiers and feature representation in the class-incremental setting. DER-Dark Experience Replay, a strong baseline built upon Reservoir Sampling that matches the network’s logits sampled throughout the optimization trajectory, thus promoting consistency with its past. We compare against ER, GEM, GSS, iCaRL, and DER baselines in the offline CL setting. Whereas for the online streaming setting, we compare with MIR in addition to all the above mentioned baselines.

5.3 Datasets

We perform experiments on the following datasets: Sequential CIFAR-10, which splits the CIFAR-10 dataset [Krizhevsky09learningmultiple] into 5 different tasks with non-overlapping classes and 2 classes in each task; Sequential CIFAR-100, obtained similarly by Cifar-100 [Krizhevsky09learningmultiple] into 5 tasks of 10 classes each; and

Sequential Tiny-ImageNet

which splits Tiny-ImageNet

[Pouransari2014TinyIV] into 10 tasks of 20 classes each. For an extended learning experiment, we also split Cifar-100 into 20 tasks of 5 classes each (see Results below).

5.4 Model architecture and other training details

We used ResNet18 as the base model architecture for all datasets throughout all experiments. We use a minibatch size of 32 for both the current task data and the replay buffer. We use a standard stochastic gradient descent (SGD) optimizer without any learning rate scheduler. Other parameter values that are not explicitly mentioned are hyperparameters, and we calculate the best performing values for each run using hyperparameter tuning. Details of grid search and selected values are added in the supplementary material.

5.5 Evaluation

Following previous work [derpp]

, we select hyperparameters by performing a grid search on validation data obtained by splitting 10% of training data. In all comparisons, the respective models are trained with chosen hyperparameters on the training data, and model accuracy is evaluated on a separate test dataset. This procedure is repeated 15 times with different random seeds controlling weight initialization and subsequent data randomization. We present the mean accuracy along with the standard error of the mean. We also present the

forgetting metric (F) [chaudhry2018riemannian], which measures how much the accuracy of learnt tasks decrease over time as the continual models tries to learn subsequent tasks. Since all models within a comparison are trained on the same set of random seeds, we use paired -tests to determine the statistical significance of any differences in metrics, evaluated at confidence threshold.

6 Results

6.1 Offline Continual Learning

Class-IL Task-IL
K=200 K=500 K=200 K=500
Method Acc () F () Acc () F () Acc () F () Acc () F ()
ER 9.57±3.06 76.08±4.38 15.79±4.63 69.46±5.26 68.66±8.37 19.84±7.72 75.75±4.94 14.18±4.62
DER 15.64±2.59 64.34±4.33 26.61±4.94 51.27±7.57 66.87±3.91 21.08±3.68 77.41±4.15 13.05±4.17
GCR 21.02±1.99 51.82±4.04 31.93±4.86 42.1±6.6 68.86±2.98 20.11±3.07 78.09±3.33 12.97±3.15
Table 3: Offline Class-IL and Task-IL results on S-Cifar100 (20 Tasks) for 20 tasks with each task having 5 classes. GCR significantly outperforms ER & DER, by larger margins than the 5-task setting, showing that GCR coreset selection scales effectively to more tasks.
S-Cifar-10 S-Cifar-100 S-TinyImageNet
Method K=200 K=500 K=2000 K=200 K=500 K=2000 K=200 K=500 K=2000
ER 32.63±5.11 43.32±4.75 49.31±7.45 11.72±1.25 14.94±1.69 20.78±1.48 4.36±1.02 6.69±0.53 11.3±1.45
GEM 16.81±1.17 17.73±0.98 20.35±2.44 10.42±1.66 9.18±1.8 12.38±3.81 5.15±0.71 6.14±1.01 9.56±2.63
GSS 33.83±3.95 40.59±4.2 41.7±4.56 12.31±0.57 13.84±0.57 16.24±0.8 5.82±0.38 6.78±0.41 8.44±0.48
iCARL 38.66±2.03 33.53±3.17 39.56±1.53 11.23±0.31 11.81±0.29 12.28±0.39 4.39±0.19 5.36±0.27 6.66±0.3
MIR 21.5±0.7 30.52±1.22 46.40±1.81 10.3±0.3 11.40±0.38 16.02±0.5 4.9±0.17 5.1±0.33 7.05±0.3
DER 38.26±5.08 45.56±6.02 49.9±7.55 10.68±2.03 14.43±2.44 24.83±1.58 4.81±0.57 8.49±0.48 15.97±0.95
GCR 42.26±7.6 51.17±3.5 52.3±7.01 14.88±0.94 18.89±1.5 22.47±2.01 6.19±0.36 8.77±0.5 11.83±1.26
Table 4: Online Continual Learning. See text for details.
Method Buffer Size
500 2000
GCR/ -Supcon -Coreset +Bilevel 38.0±0.99 47.0±0.97
GCR/ -Coreset +Bilevel 40.41±1.05 51.11±0.96
GCR/ -Supcon -Coreset +Reservoir 41.36±1.76 55.45±0.86
GCR/ -Coreset +Reservoir 43.73±1.7 57.41±1.01
GCR/ -Supcon 45.54±1.39 58.35±0.78
GCR 45.91±1.3 60.09±0.72
Table 5: Ablation Study on Class-IL offline continual learning with S-Cifar-100 for two different buffer sizes. For buffer size=2000, all pairwise differences were statistically significant with .

Table 1

shows the average accuracy of various continual learning methods at the end of tasks for Cifar-10, Cifar-100 and Tiny-ImageNet for different buffer sizes. Models are trained for 50 epochs for Cifar-10 and Cifar-100. Epochs are increased to 100 for Tiny-ImageNet, which is commonly done for harder datasets. The candidate pool size in

GCR is kept equal to the buffer size for all experiments.

In the Class-IL setting, GCR outperforms other methods with gains of roughly 1-5% that are statistically significant. For S-Cifar-100, for buffer sizes of 500 and 2000, the difference is more than 4.5%. Among the baselines considered, only DER appears close; this can be attributed to the usage of distillation loss proposed by DER, which other methods do not use. S-TinyImageNet is a challenging dataset for continual learning, and both DER and GCR show significant improvements over alternative approaches.

For the Task-IL setting, although GCR outperforms others in many instances (e.g., in S-Cifar-100 for buffer size 2000, the difference is close to 1.5%), DER and ER both have accuracy reasonably close to that of GCR. We believe this is due to the substantially lower complexity of the Task-IL setting; the CL model only needs to learn each task in isolation. GSS gives comparable results on S-Cifar-10, but its performance deteriorates for S-Cifar-100 and S-TinyImageNet, suggesting that it does not scale to more difficult datasets.

Forgetting metric: Table 2 shows the forgetting metric (F) for different settings and buffer sizes. This metric shows that GCR performs better due to less catastrophic forgetting than other replay based methods; the forgetting metric and accuracy numbers in Table 1 covary closely.

Increased number of tasks: Table 3 shows final accuracy and forgetting metric on S-Cifar-100 20-task data with each task having 5 classes to study the impact of increasing numbers of rounds of gradient approximation / replay buffer updates. The numbers show robust gains, especially at smaller buffer sizes and in Class-incremental setting, which correspond to the significantly harder continual learning problems. The wider gap between DER and GCR compared to Table 1 shows that our coreset selection procedure accumulates value over time, scaling effectively to 4x the number of tasks.

6.2 Online Streaming Continual Learning

Table 4 compares GCR against baselines in the online setting. Here, data is made available in streams of size 6250, and the task boundary is blurred. Due to the streaming nature of the data, models are only allowed a single epoch of training on each stream. Since iCARL and GEM require knowledge of task boundaries, we explicitly provided the task identity while training those two baselines for a fair comparison. In this setting, the performance gap between GCR and baselines is much more apparent for low buffer sizes, where the quality of examples stored plays a significant role. As buffer size increases, coverage can compensate for suboptimal data selection. In fact, for S-Cifar-10 with a buffer size of 500, the gap between DER and GCR is more than 5%. Forgetting metrics are reported in Appendix A due to space considerations.

Recent papers like [shim2021online] in online continual learning do not compare against techniques in an offline setting that can be straightforwardly adapted and are very competitive in the online setting. We note that their reported numbers are significantly lower even than DER in Table 4; as a result, we did not re-evaluate those techniques in our study.

6.3 Other coreset methods

Table 5 shows the performance of GCR with various components removed or replaced. For simplicity, we only show metrics on one dataset and two buffer sizes; results are similar for other experimental combinations. We see from the results that bilevel coreset selection (first two rows, see [borsos2020coresets]) performs worse even than reservoir sampling (rows three and four). Similarly, comparing row 3 (approximately equivalent to DER) to row 5 shows that our gradient approximation technique is significantly better than reservoir sampling, without the use of the supervised contrastive loss.

6.4 Ablation studies

Continuing with Table 5, we see that using either supervised contrastive loss (“Supcon” in table) or gradient coresets instead of reservoir sampling (rows 4,5 in the table) significantly improves performance over DER (row 3), showing the significant, separate impact of GCR coreset selection and supervised contrastive loss respectively The complete framework of GCR  including both coresets and supervised contrastive loss, achieved the best performance compared to all other combinations. Finally, we also tested for statistical significance of the accuracy gains achieved by GCR; for buffer size 2000, all pairwise comparisons were statistically significant with .

6.5 Generality of gradient-based coresets

We also show that our gradient based coresets can be combined effectively with other replay based approaches to improve their performance. Refer Appendix B for details.

7 Conclusion

We presented GCR, a gradient-based coreset selection method for replay-based continual learning. We propose gradient approximation as an optimization criterion for selecting coresets, building on recent advances in supervised learning settings [killamsetty21a]. We integrate this objective into the continual learning workflow in selecting and updating replay buffers for future training. We also include a supervised representation learning loss [supcon] in our CL objective, enhancing the learned representations over the model’s lifetime. Extensive experiments across datasets, replay buffer sizes, and CL settings (offline/online, class-IL/task-IL) show that GCR significantly outperforms previous approaches in comparable settings, to a degree of 2-4% accuracy in offline settings and up to 5% in online settings. Ablation studies show that the GCR coreset selection objective outperforms previous best selection mechanisms [borsos2020coresets, derpp], and the representation loss also independently contributes to performance gains. Experiments that increase the number of tasks show that our coreset selection approach scales effectively, providing increasing gains as number of tasks increase. By unifying performance analysis in offline and online settings (typically studied separately), we show that the core ideas of GCR apply to both settings and hope that cross-fertilization of ideas across settings will continue in future work.

References

Appendix A Additional Results

a.1 Forgetting metric

S-Cifar-10 S-Cifar-100 S-TinyImageNet
Setting Method K=200 K=500 K=2000 K=200 K=500 K=2000 K=200 K=500 K=2000
ER 59.3±2.48 43.22±2.1 23.85±1.09 75.06±0.63 67.96±0.78 49.12±0.57 76.53±0.51 75.21±0.54 65.58±0.53
GEM 80.36±5.25 78.93±6.53 82.33±5.83 77.4±1.09 71.34±0.78 55.27±1.37 - - -
Class-IL GSS 72.48±4.45 59.18±4.0 44.59±6.13 77.62±0.76 74.12±0.42 67.42±0.62 76.47±0.4 75.3±0.26 72.49±0.43
iCARL 23.52±1.27 28.2±2.41 21.91±1.15 47.2±1.23 40.99±1.02 30.64±1.85 31.06±1.91 37.3±1.42 39.88±1.51
DER 35.79±2.59 24.02±1.63 12.92±1.1 62.72±2.69 49.07±2.54 28.18±1.93 64.83±1.48 59.95±2.31 39.83±1.15
GCR 32.75±2.67 19.27±1.48 8.23±1.02 57.65±2.48 39.2±2.84 19.29±1.83 65.29±1.73 56.4±1.08 32.45±1.79
ER 6.07±1.09 3.5±0.53 1.37±0.44 27.38±1.46 17.37±1.06 8.03±0.66 40.47±1.54 30.73±0.62 18.0±0.83
GEM 9.57±2.05 5.6±0.96 2.95±0.81 29.59±1.66 20.44±1.13 9.5±0.73 - - -
Task-IL GSS 8.49±2.05 6.37±1.55 4.31±1.68 32.81±1.75 26.57±1.34 18.98±1.13 50.75±1.63 45.59±0.99 38.05±1.17
iCARL 25.34±1.64 22.61±3.97 24.47±1.36 36.2±1.85 27.9±1.37 16.99±1.76 42.47±2.47 39.44±0.84 30.45±2.18
DER 6.08±0.7 3.72±0.55 1.95±0.32 25.98±1.55 25.98±1.55 7.37±0.85 40.43±1.05 28.21±0.97 15.08±0.49
GCR 7.38±1.02 3.14±0.36 1.24±0.27 24.12±1.17 15.07±1.88 5.75±0.72 40.36±1.08 27.88±1.19 13.1±0.57
Table 6: Forgetting metric in Offline Class-IL and Task-IL Continual Learning
S-Cifar-10 S-Cifar-100 S-TinyImageNet
Method K=200 K=500 K=2000 K=200 K=500 K=2000 K=200 K=500 K=2000
ER 47.01±6.63 38.72±7.94 31.96±8.93 30.16±0.69 26.29±1.31 16.42±2.17 27.86±1.69 32.53±1.18 27.91±1.41
GEM 73.63±3.96 73.07±6.58 53.27±10.93 32.94±2.88 27.15±3.78 29.97±7.12 - - -
GSS 48.8±6.56 40.62±6.74 40.67±5.75 33.06±1.05 25.37±1.93 19.56±1.64 36.91±1.44 32.67±1.36 23.63±1.18
iCARL 23.78±3.64 26.2±4.31 22.11±4.61 9.53±0.57 9.15±0.49 8.9±0.49 6.95±0.5 7.22±0.38 6.89±0.37
DER 34.12±7.04 29.05±8.59 27.5±8.69 26.84±1.7 22.92±2.73 13.72±2.03 31.68±1.46 27.09±0.79 14.97±1.28
GCR 26.7±8.37 20.1±3.32 22.18±9.9 21.86±1.77 19.46±1.72 17.91±2.3 34.19±1.07 27.47±0.8 22.31±1.35
Table 7: Forgetting metric in Online Continual Learning.

In this section we present additional results for the experiments shown in Section 6.1 and 6.2. We report the forgetting metric (FRG) which shows how much the accuracy of learnt tasks over time as the continual models tries to learn subsequent tasks. The average forgetting across the tasks is reported in Table 6 (offline setting), and Table 7 (online streaming setting). It is worth noting that FRG should only be seen along with final accuracy to draw comparisons between two continual learning models. This is because FRG alone can be misleading–a model which does not learn any subsequent tasks throughout the training will give near to 0 forgetting but will give random final accuracy which is undesirable. A clear example is that of iCaRL [rebuffi2017icarl] in the offline setting; we see that the method has poor overall accuracy (Table 1) but highly favorable forgetting metrics (Table 6).

a.2 OCS vs GCR

OCS vs GCR
Method S-Cifar 100 (20 Tasks)
OCS 60.5±0.55
GCR 60.86±3.53
Table 8: Comparing OCS with GCR for Task-IL setting of S-Cifar-100 (20 tasks) and buffer size of 100.

Table 8 compares published performance numbers from one setting, for the OCS algorithm [ocs], against our trained model in the same setting. Our approach shows better performance in the comparison, suggesting that our gradient approximation objective is superior to the gradient diversity-based selection objective of OCS. However, this comparison is incomplete; the authors of OCS have not made their code available for comparison, the paper’s description of the algorithm was insufficient for reproduction, and they did not publish numbers on any of the other settings, datasets, buffer sizes we explored in our paper.

Appendix B Generality of gradient-based coresets

S-Cifar-10 S-Cifar-100
Class-IL Task-IL Class-IL Task-IL
Method K=500 K=2000 K=500 K=2000 K=500 K=2000 K=500 K=2000
ER 62.03±1.70 77.13±0.87 93.82±0.41 96.01±0.28 27.66±0.61 42.80±0.49 66.23±1.52 74.67±1.2
ER+GCR 66.66±2.1 80.15±1.17 94.17±0.46 96.47±0.22 30.68±0.47 47.09±1.08 70.25±0.81 78.59±0.5
Table 9: GCR coreset selection with ER method. Numbers represent mean ± SEM of model test accuracy over 15 runs. Best-performing models in each column are bolded (paired -test, ).

We also examined whether the gains from our gradient approximation procedure for coreset selection (Section 4.1) were dependent on the specific loss function that we use for CL (Section 4.2). To evaluate this, we enhanced ER [riemer2018learning] (a simple replay-based continual learning procedure that does not use the distillation loss from Section 4.2) with our gradient approximation procedure. The results in Table 9 show that the gains from GCR are robust, and apply to other replay-based methods as well. Other baseline methods like iCARL, GSS, etc have specific replay buffer selection methods, unlike ER which uses random samples, and it was not clear how to add GCR on top of those methods. In any case, our results show that GCR beats those methods by significant margins.

Appendix C Implementation details

c.1 Hyperparameter Search

Offline Class-IL
Method Buffer Size S-Cifar10 S-Cifar100 S-Tinyimg
ER
200
500
2000
lr: 0.1
lr: 0.03
lr: 0.1
lr: 0.1
lr: 0.1
lr: 0.1
lr: 0.03
lr: 0.1
lr: 0.03
GEM
200
500
2000
lr: 0.01 : 1.0
lr: 0.01 : 0.5
lr: 0.1 : 0.5
lr: 0.03 : 0.5
lr: 0.1 : 0.5
lr: 0.03 : 1.0
-
GSS
200
500
2000
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
iCARL
200
500
2000
lr: 0.1 wd: 5e-5
lr: 0.01 wd: 1e-5
lr: 0.1 wd: 1e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 1e-5
lr: 0.1 wd: 1e-5
lr: 0.03 wd: 1e-5
lr: 0.03 wd: 1e-5
DER
200
500
2000
lr: 0.03 : 0.2 : 1.0
lr: 0.03 : 0.1 : 1.0
lr: 0.03 : 0.2 : 1.0
lr: 0.03 : 0.5 : 0.1
lr: 0.03 : 0.5 : 0.1
lr: 0.03 : 0.2 : 0.1
lr: 0.03 : 0.2 : 0.1
lr: 0.03 : 0.2 : 0.1
lr: 0.03 : 0.1 : 0.5
GCR
200
500
2000
lr: 0.03 : 0.5 : 0.5 : 0.01
lr: 0.03 : 0.1 : 0.1 : 0.1
lr: 0.03 : 0.1 : 1.0 : 0.1
lr: 0.03 : 0.2 : 0.1 : 0.01
lr: 0.03 : 0.1 : 0.1 : 0.01
lr: 0.03 : 0.2 : 0.1 : 0.1
lr: 0.03 : 0.5 : 0.5 : 0.01
lr: 0.03 : 0.5 : 0.1 : 0.01
lr: 0.03 : 0.2 : 1.0 : 0.01
Offline Task-IL
Method Buffer Size S-Cifar10 S-Cifar100 S-Tinyimg
ER
200
500
2000
lr: 0.01
lr: 0.1
lr: 0.03
lr: 0.03
lr: 0.1
lr: 0.1
lr: 0.1
lr: 0.1
lr: 0.03
GEM
200
500
2000
lr: 0.01 : 1.0
lr: 0.03 : 0.5
lr: 0.03 : 0.5
lr: 0.1 : 0.5
lr: 0.03 : 0.5
lr: 0.1 : 0.5
-
GSS
200
500
2000
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
iCARL
200
500
2000
lr: 0.1 wd: 5e-5
lr: 0.01 wd: 5e-5
lr: 0.01 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 1e-5
lr: 0.1 wd: 1e-5
lr: 0.03 wd: 1e-5
lr: 0.03 wd: 1e-5
DER
200
500
2000
lr: 0.03 : 0.2 : 0.1
lr: 0.03 : 0.2 : 0.5
lr: 0.03 : 0.2 : 1.0
lr: 0.03 : 0.1 : 0.1
lr: 0.03 : 0.1 : 0.1
lr: 0.03 : 0.1 : 0.5
lr: 0.03 : 0.1 : 0.1
lr: 0.03 : 0.1 : 0.5
lr: 0.03 : 0.1 : 0.1
GCR
200
500
2000
lr: 0.03 : 0.1 : 0.1 : 0.1
lr: 0.03 : 0.2 : 0.5 : 0.01
lr: 0.03 : 0.2 : 0.5 : 0.05
lr: 0.03 : 0.1 : 0.1 : 0.01
lr: 0.03 : 0.1 : 0.1 : 0.05
lr: 0.03 : 0.1 : 0.1 : 0.1
lr: 0.03 : 0.1 : 0.5 : 0.01
lr: 0.03 : 0.1 : 1.0 : 0.01
lr: 0.03 : 0.1 : 1.0 : 0.01
Online Streaming
Method Buffer Size S-Cifar10 S-Cifar100 S-Tinyimg
ER
200
500
2000
lr: 0.03
lr: 0.01
lr: 0.01
lr: 0.01
lr: 0.01
lr: 0.03
lr: 0.1
lr: 0.01
lr: 0.01
GEM
200
500
2000
lr: 0.1 : 0.5
lr: 0.03 : 1.0
lr: 0.1 : 1.0
lr: 0.03 : 0.5
lr: 0.1 : 0.5
lr: 0.03 : 1.0
lr: 0.03 : 0.5
lr: 0.03 : 1.0
lr: 0.03 : 1.0
GSS
200
500
2000
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.03 gmbs: 32 nb: 1
lr: 0.1 gmbs: 32 nb: 1
lr: 0.1 gmbs: 32 nb: 1
lr: 0.1 gmbs: 32 nb: 1
iCARL
200
500
2000
lr: 0.1 wd: 1e-5
lr: 0.1 wd: 1e-5
lr: 0.1 wd: 1e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
lr: 0.1 wd: 5e-5
MIR
200
500
2000
lr: 0.03
lr: 0.03
lr: 0.03
lr: 0.03
lr: 0.03
lr: 0.03
lr: 0.03
lr: 0.03
lr: 0.03
DER
200
500
2000
lr: 0.03 : 0.1 : 1.0
lr: 0.01 : 0.2 : 2.0
lr: 0.01 : 0.2 : 1.0
lr: 0.03 : 0.5 : 0.5
lr: 0.03 : 1.0 : 0.5
lr: 0.01 : 1.0 : 2.0
lr: 0.03 : 0.2 : 2.0
lr: 0.005 : 1.0 : 3.0
lr: 0.01 : 1.0 : 3.5
GCR
200
500
2000
lr: 0.03 : 0.2 : 1.0 : 1.0
lr: 0.005 : 0.5 : 3.5 : 1.0
lr: 0.03 : 0.1 : 0.5 : 1.5
lr: 0.005 : 1.0 : 3.5 : 1.0
lr: 0.01 : 0.2 : 3.0 : 1.5
lr: 0.01 : 1.0 : 2.0 : 1.0
lr: 0.01 : 1.0 : 3.0 : 0.1
lr: 0.005 : 1.0 : 3.5 : 1.0
lr: 0.01 : 1.0 : 3.0 : 1.5
Table 10: Hyperparameter values obtained from the grid search.

Table 10 shows the hyperparameter values selected from the grid search that were used in our experiments.

c.2 Hyperparameter Search Space

Method Parameters Offline Online
ER lr [0.01, 0.03, 0.1] [0.01, 0.03, 0.1]
GEM
lr
GSS lr
iCARL
lr
wd
MIR lr -
DER
lr
[0.03]
GCR
lr
Table 11: Hyperparameter Search Space.

Table 11 shows the hyperparameter search space for offline and online setting on which grid search was done.