Dark Experience for General Continual Learning: a Strong, Simple Baseline

04/15/2020 ∙ by Pietro Buzzega, et al. ∙ 13

Neural networks struggle to learn continuously, as they forget the old knowledge catastrophically whenever the data distribution changes over time. Recently, Continual Learning has inspired a plethora of approaches and evaluation settings; however, the majority of them overlooks the properties of a practical scenario, where the data stream cannot be shaped as a sequence of tasks and offline training is not viable. We work towards General Continual Learning (GCL), where task boundaries blur and the domain and class distributions shift either gradually or suddenly. We address it through Dark Experience Replay, namely matching the network's logits sampled throughout the optimization trajectory, thus promoting consistency with its past. By conducting an extensive analysis on top of standard benchmarks, we show that such a seemingly simple baseline outperforms consolidated approaches and leverages limited resources. To provide a better understanding, we further introduce MNIST-360, a novel GCL evaluation setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

Code Repositories

DER-SSL

DER-SSL: Dark Experience Replay with Self-Supervised Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last decade, the research community has mostly focused on optimizing neural networks over a finite set of shuffled data, testing on unseen examples belonging to the same distribution. Nevertheless, practical applications may face continuously changing data when new classes or tasks emerge, thus violating the i.i.d. property. Ideally, similarly to human beings, such models should acquire new knowledge on-the-fly and prevent the forgetting of past experiences. Still, conventional gradient-based learning paradigms lead to the catastrophic forgetting of the acquired knowledge (McCloskey and Cohen, 1989). As a trivial workaround, one could store all the incoming examples and re-train from scratch when needed. However, this is often impracticable in terms of required resources.

Continual Learning (CL) methods aim at training a neural network from a stream of non i.i.d. samples, relieving catastrophic forgetting whilst limiting computational costs and memory footprint (Rebuffi et al., 2017). With the goal of assessing the merits of these methods, the standard experimental setting involves the creation of different tasks out of the most common image classification datasets (e.g. MNIST, CIFAR-10, etc.), and processing them sequentially. The authors of (Hsu et al., 2018; van de Ven and Tolias, 2018) identify three incremental learning (IL) settings:

Figure 1: Performance vs.

 memory footprint for Sequential Tiny ImageNet (Class-IL). Given the approach, successive points on the same line indicate an increase in its memory buffer.

Methods Constant No task Online No test
memory boundaries learning time oracle
PNN
PackNet
HAT
ER
MER
GEM
A-GEM
iCaRL
EWC
oEWC
SI
LwF
P&C
DER (ours)
DER++ (ours)
Table 1: Continual learning approaches and their compatibility with the General Continual Learning major requirements (De Lange et al., 2019). For an exhaustive discussion, please refer to supplementary materials.
  • Task-Incremental (Task-IL)

    splits the training samples into partitions of classes (tasks). At test time, task identities are provided to select the relevant classifier for each example, thus making Task-IL the most straightforward scenario.

  • Class-Incremental (Class-IL) divides the dataset into partitions, as Task-IL does, without providing the task identity at inference time. This is the a more challenging setting, as the network must predict both the class and the task.

  • Domain-Incremental (Domain-IL) feeds all classes to the network during each task, applying a task-dependent transformation on the input. Task identities remain unknown at test time.

Despite being widely adopted in literature, these protocols do not perfectly fit real-world applications: they assume to know when a distribution change occurs and tasks are unnaturally different from each other, separated by sharp boundaries. On the contrary, in many practical applications, tasks may interlace in such a way that the search for a boundary becomes infeasible. Moreover, most of the methods allow multiple iterations over the entire training set, which is not possible in an online stream with bounded memory requirements. Recently, (De Lange et al., 2019) highlighted several guidelines for General Continual Learning (GCL), in an attempt to address the issues mentioned above. GCL consists of a stream of examples, the latter shifting in their distribution and class either gradually or suddenly, providing no information about the task to be performed. Among all the CL methods, only a few can completely adhere to the requirements of such a challenging scenario (Table 1).

In this work, we introduce a new baseline to overcome catastrophic forgetting – which we call Dark Experience Replay (DER) – suitable for the GCL setting. DER leverages Knowledge Distillation (Hinton et al., 2015) by matching logits of a small subset of past samples. Our proposal satisfies the major GCL guidelines: i) online learning: it is able to learn with no additional passages over the training data; ii) no task boundaries: there is no need for a given or inferred boundary between tasks during training; iii) no test time oracle: it does not require task identifiers at inference time; iv) constant memory: even if the number of tasks becomes large, the memory footprint remains bounded while the model is being trained. We notice that DER complies with these requirements without affecting its performance; in fact, it outperforms most of the current state-of-the-art approaches at the cost of negligible memory overhead (as shown in Figure 1). Such a claim is justified by extensive experiments encompassing the three standard CL settings.

Finally, we propose MNIST-360, a novel GCL dataset that combines the Class-IL and Domain-IL setting in the lack of explicit task boundaries. It consists of the classification of two classes of MNIST digits at a time, subject to a smooth increasing rotation. By doing so, we are further able to assess the model’s ability to classify when the input distribution undergoes both swift and gradual changes.

2 Related Work

Continual Learning strategies have been categorized into three families (Farquhar and Gal, 2018; De Lange et al., 2019) of methods: architectural methods, rehearsal-based methods and regularization-based methods. In the following we review the most relevant works in the field and discuss whether they can be extended in the General Continual Learning setting.

Architectural methods devote distinguished sets of model parameters to distinct tasks. Among these, Progressive Neural Networks (PNN) (Rusu et al., 2016) starts with a single sub-network, incrementally instantiating new ones as novel tasks occur. This methodology avoids forgetting by design, although its memory requirement grows linearly with the number of tasks. To mitigate such an issue, PackNet (Mallya and Lazebnik, 2018) and Hard Attention to the Task (HAT) (Serra et al., 2018)

share the same networks for subsequent tasks. The former prunes out unimportant weights and then re-trains the model on the current task, whereas the latter exploits task-specific attention masks to avoid interference with the past. Both of them employ a heuristic strategy to prevent intransigence, allocating additional units when needed.

Rehearsal-based methods tackle catastrophic forgetting by replaying a subset of the training data stored in a memory buffer. Early works (Ratcliff, 1990; Robins, 1995) proposed Experience Replay (ER), that is interleaving old samples with current data in training batches. In combination with the reservoir strategy (Vitter, 1985), this technique suits the GCL requirements, as it requires no knowledge of task boundaries. Meta-Experience Replay (MER) (Riemer et al., 2019) casts replay as a meta-learning problem to maximize transfer from past tasks while minimizing interference. However, its update rule involves a high number of meta-steps for prioritizing the current example, thus making its computational cost prohibitive in an online scenario. On the other hand, Gradient Episodic Memory (GEM) (Lopez-Paz and Ranzato, 2017) and its lightweight counterpart Averaged-GEM (A-GEM) (Chaudhry et al., 2019) profit from old training data by building optimization constraints to be satisfied by the current update step. However: i) they rely on task boundaries to store examples in the memory buffer; ii) as GEM makes use of Quadratic Programming, its update step leads to similar computational limitations as MER. The authors of (Rebuffi et al., 2017) proposed iCaRL, which employs a buffer as a training set for a prototype-based nearest-mean-of-exemplars classifier. To this end, iCaRL prevents the learnt representation from deteriorating in later tasks via a self-knowledge distillation loss term. Albeit the use of distillation is in common with our proposed baseline, we point out a clear difference: while iCaRL stores a model snapshot at each task boundary, DER plainly retains a subset of network responses for later replay.

Regularization-based methods

complement the loss function with additional terms to prevent forgetting. A first family of methods collects gradient information to estimate weights’ importance. A loss term is then devised to prevent them from drifting too far from the previous configuration.

Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) accomplishes this by estimating a Fisher information matrix for each task, resulting in growing memory consumption. Conversely, online EWC (oEWC) (Schwarz et al., 2018) and Synaptic Intelligence (SI) (Zenke et al., 2017) rely on a single estimate. A second line of works leverages knowledge distillation for consolidating previous tasks. Progress & Compress (P&C) (Schwarz et al., 2018) uses an Active Column model to learn the current task. At a later stage, it distills the acquired knowledge to a Knowledge Base model, in charge of preserving information on all previous tasks. Learning Without Forgetting (LwF) (Li and Hoiem, 2017) uses an old snapshot of the model to distill its responses to current examples. As with iCaRL, this approach requires storing network weights at task boundaries, which differentiates it from the baseline we propose.

GCL compliance. On top of requiring an increasing amount of memory, architectural methods need explicit knowledge of task boundaries and task identities, thus being unsuitable for the GCL setting. These methods additionally suffer from an increasing memory footprint as more task are considered. With the exception of EWC, which shares most of the architectural methods’ shortcomings, other regularization-based strategies require a fixed memory footprint and do not need task identities. Nevertheless, they still depend on the availability of task boundaries to update their weights importance (oEWC, SI), the Knowledge Base (P&C), or the distillation teacher (LwF). Finally, several rehearsal-based methods rely on task limits in order to adopt a task-stratified sampling strategy (iCaRL, GEM, A-GEM).

3 Baseline

In the light of the review proposed in Section 2, we draw the following remarks: on the one hand, Experience Replay provides a versatile and easy-to-implement procedure for addressing the GCL requirements at a bounded memory footprint; on the other hand, regularization-based methods deliver robust strategies to retain old knowledge. Among the latter, as observed in (Li and Hoiem, 2017), Knowledge Distillation proves to be especially efficient, as teacher responses encode richer information while requiring fewer resources. We combine these two methodologies in a straightforward GCL-compliant baseline, which inter alia exhibits strong results in most standard CL settings.

Current distillation-based approaches employ task boundaries to pin a model snapshot as the new teacher, which will then provide the regularization objective during the following task. Such a practice is not suitable in the lack of sharp task limits, which is peculiar to the setting we focus on. With this in mind, we do not model the teacher in terms of its parameters at the end of a task, but rather retain its outputs (soft labels). Namely: i) we store the network responses along with their corresponding examples for replaying purposes; ii) we do it throughout the task, adopting a reservoir strategy. Although all the methods above appoint teachers at their local optimum, our procedure does not hurt performance at all. Moreover – even if counter-intuitive – we experimentally observe that sampling the soft labels during the optimization trajectory delivers an interesting regularization effect (see Section 4.5). We call the proposed baseline Dark Experience Replay (DER) due to its twofold nature: it relies on dark knowledge (Hinton et al., 2014) for distilling past memories and it remains in line with ER, leveraging its past experiences sampled over the entire training trajectory.

3.1 Dark Experience Replay

Formally, a CL classification problem is split in tasks; during each task input samples and their corresponding ground truth labels are drawn from an i.i.d. distribution . A function , with parameters , is optimized on one task at a time in a sequential manner. We indicate the output logits with

and the corresponding probability distribution over the classes with

. At any given point in training, the goal is to learn how to correctly classify examples for the current task , while retaining good performance on the previous ones. This corresponds to optimizing the following objective:

(1)

This is especially challenging as data from previous tasks are assumed to be unavailable, meaning that the best configuration of with respect to must be sought without for . Ideally, we look for parameters that fit the current task well whilst approximating the behavior observed in the old ones: effectively, we encourage the network to mimic its original responses for past samples. To preserve the knowledge about previous tasks, we seek to minimize the following objective:

(2)

where is the optimal set of parameters at the end of task , and is a hyper-parameter balancing the trade-off between the terms. This objective, which resembles the teacher-student approach, would require the availability of for previous tasks. To overcome such a limitation, we introduce a replay buffer holding past experiences for task . Differently from other rehearsal-based methods (Riemer et al., 2019), we retain the network’s logits , instead of the ground truth labels .

(3)

As we focus on General Continual Learning, we intentionally avoid relying on task boundaries to populate the buffer as the training progresses. Therefore, in place of the common task-stratified sampling strategy, we adopt reservoir sampling (Vitter, 1985): this way, we select samples at random from the input stream, guaranteeing that they have the same probability of being stored in the buffer, without knowing the length of the stream in advance. We can rewrite Eq. 3 as follows:

(4)

It is worth noting that this strategy implies picking logits during the optimization trajectory, so potentially far from the ones that can be observed at the task’s local optimum. Although this may appear counter-intuitive, this does not hurt performance, and surprisingly it makes the optimization less prone to overfitting (see Section 4.5). Finally, we characterize the KL divergence in Eq. 4 in terms of the Euclidean Distance between the corresponding logits. This adheres to (Hinton et al., 2015); indeed, matching logits is a special case of distillation that avoids the information loss occurring in probability space due to the squashing function (e.g. softmax) (Liu et al., 2018). With these considerations in hands, Dark Experience Replay (DER) optimizes the following objective:

(5)

In our experiments, we approximate the expectation in Eq. 5 by computing gradients on a mini-batch sampled from the replay buffer.

3.2 Dark Experience Replay++

It is worth noting that the reservoir strategy may weaken DER under some specific circumstances. Namely, when a distribution shift occurs in the input stream, logits that are highly biased by the training on previous tasks might be sampled for later replay. When the distribution shift is sudden, leveraging the ground truth labels could mitigate such a shortcoming. On these grounds, we additionally propose Dark Experience Replay++ (DER++), which equips the objective of Eq. 5 with an additional term on buffer datapoints, promoting higher conditional likelihood w.r.t. their ground truth labels:

(6)

where is an additional coefficient balancing the last term (DER++ collapses to DER when ). We provide in Algorithm 1 a description of DER and DER++.

  Input: dataset , parameters , scalars and ,
            learning rate
  
  for  in  do
     for  in  do
        
        
                                 DER update rule
        

 

                                            DER++ update rule
        

 

                
        
     end for
  end for

Algorithm 1 Dark Experience Replay (++)

4 Experiments

In this section, we draw a comprehensive comparison of our proposed baselines and several recent methods presented in Section 2. In more detail: i) we introduce a set of well known Continual Learning benchmarks in Section 4.1; ii) we define the evaluation protocol in Section 4.2; iii) we present an extensive performance analysis on the aforementioned tasks in Section 4.3. Moreover, in Section 4.4, we propose a novel benchmark (MNIST-360) aimed at modeling the peculiarities of General Continual Learning. By doing so, we provide an assessment of the suitable methods: to the best of our knowledge, this is the first setting catering to the GCL desiderata presented in Section 1. Eventually, in Section 4.5 we focus on the features of our proposal in terms of qualitative properties and resource consumption.

4.1 Benchmarks

As highlighted in Section 1, typical Continual Learning experiments fall into Task, Class and Domain-IL. In the following, we introduce the standard benchmarks we employ to draw a clear comparison between the proposed baselines and other well-known continual learning methods.

We conduct experiments on the Domain-Incremental scenario, leveraging two common protocols built upon the MNIST dataset (LeCun et al., 1998), namely Permuted MNIST (Kirkpatrick et al., 2017) and Rotated MNIST (Lopez-Paz and Ranzato, 2017). They both require the learner to classify all MNIST digits for subsequent tasks. To differentiate the tasks, the former applies a random permutation to the pixels, whereas the latter rotates the images by a random angle within the interval .

Secondly, for the Class-Incremental and Task-Incremental scenarios, we make use of three increasingly harder datasets, presenting samples belonging to different classes from MNIST, CIFAR-10 (Krizhevsky and others, 2009) and Tiny ImageNet (Stanford, 2015) in a progressive fashion. In more detail, Sequential MNIST and Sequential CIFAR-10 consist of tasks, each of which introduces two new classes; Sequential Tiny ImageNet presents distinct image classes for 10 tasks. In each setting, we show the same classes in the same fixed order across runs.

Method S-MNIST S-CIFAR-10 S-Tiny-ImageNet P-MNIST R-MNIST
Class-IL Task-IL Class-IL Task-IL Class-IL Task-IL Domain-IL Domain-IL
SGD
oEWC
SI
LwF
PNN - - - - -
ER
MER - - - - - -
GEM - -
A-GEM
iCaRL - -
DER
DER++
Table 2: Classification results for standard CL benchmarks, averaged across runs. For rehearsal methods, we report performance when the memory buffer size equals . ’-’ indicates experiments we were unable to run, because of compatibility issues (e.g. between PNN and the Domain-IL setting) or intractable training time (e.g. MER or GEM on Tiny ImageNet).

4.2 Evaluation Protocol

Architecture. Following  (Lopez-Paz and Ranzato, 2017; Riemer et al., 2019) – for tests we conducted on variants of the MNIST dataset – we employ a fully-connected network with two hidden layers, each one comprising of ReLU units. Differently, as done in (Rebuffi et al., 2017), we rely on ResNet18 (He et al., 2016) (not pre-trained) for CIFAR-10 and Tiny ImageNet.

Hyperparameter selection.

 We select hyperparameters performing a grid-search on a validation set, the latter obtained by sampling

of the training set. For S-MNIST, S-CIFAR-10 and S-Tiny Imagenet, we choose as best configuration the one achieving the highest average classification accuracy on all tasks both from the Class-IL and Task-IL settings, as follows:

(7)

This does not apply to the Domain-IL setting, where we make use of the final average accuracy as the selection criterion. We always adopt the aforementioned schema to have a fair comparison among all competitors and the proposed baselines. Please refer to the supplementary materials for a detailed characterization of the hyperparameter grids we explored along with the chosen configurations.

Training.

 Each task lasts one epoch in all MNIST-based settings; conversely, due to their increased complexity w.r.t. MNIST, we set the number of epochs to

for Sequential CIFAR-10 and

for Sequential Tiny ImageNet. As regards batch size, we deliberately hold it out from the hyperparameter space, thus avoiding the flaw of a variable number of update steps for different methods. For the same reason, we train all the networks using the same optimization algorithm, namely Stochastic Gradient Descent (SGD).

4.3 Experimental Results

In this section, we compare our baselines111We make the code available at https://github.com/hastings24/mammoth., DER and DER++, against three regularization-based methods (oEWC, SI, LwF), one architectural method (PNN) and five rehearsal-based methods (ER, MER, GEM, A-GEM, iCaRL). We further provide a lower bound, consisting of SGD without any countermeasure to catastrophic forgetting. Performance have been measured in terms of average accuracy at the end of all tasks, which we in turn average across ten runs (each one involving a different random initialization).

Comparison with regularization and architectural methods. As shown in Table 2, DER and DER++ achieve state-of-the-art performance in almost all settings, proving to be reliable in different contexts. We achieve such a robustness without compromising the intents we declared in Section 1 concerning the GCL framework: in fact, we do not rely on the hypothesis that the data-stream resembles a sequence of tasks, which is crucial for these two families of methods.

When compared to oEWC and SI, the gap appears unbridgeable, suggesting that regularization towards old sets of parameters does not suffice to prevent forgetting. We conjecture that this is due to a twofold reason: on the one hand, local information modeling weights importance – computed in earlier tasks – could become untrustworthy in later ones; on the other hand, matching logits does not preclude moving towards better configurations for past tasks.

Figure 2: Performance (accuracy) for rehearsal methods, showing the impact of different memory sizes (colors) on various benchmarks.

LwF and PNN prove to be effective in the Task-IL setting, for which they are specifically designed. However, they overtake the proposed baselines on Sequential Tiny ImageNet solely. It is worth noting that LwF suffers severe failures on Class-IL, which appears harder than the former (Hsu et al., 2018). In this case, merely relying on a memory buffer appears a more robust strategy, as demonstrated by the results achieved by Experience Replay and its spinoffs (MER, DER, and DER++). Moreover, PNN is outright unsuited for Domain-IL and Class-IL: since it instantiates a new subnetwork for each incoming task, it is not able to handle distribution shifts within the classes nor can decide which subnetwork to use in prediction without task identifiers.

Comparison with reharsal methods. Figure 2 focuses on rehearsal-based methods and the benefits of a larger memory buffer222We refer the reader to the supplementary materials for a tabular representation of these results.. As discussed above, DER and DER++ show strong performance in the majority of benchmarks, especially when dealing with small memory buffers. Interestingly, this holds especially for Domain-IL, where they outperform other approaches by a large margin at any buffer size. For these problems, a shift occurs within the input domain, but not within the classes. Hence, the relationships among them also likely persist. As an example, if it is true that during the first task number 2’s visually look like 3’s, this still holds when applying rotations or permutations, as it is done in the following tasks. In this regard, we argue that leveraging soft-targets in place of hard ones (as done by ER) carries more valuable information (Hinton et al., 2015), exploited by DER and DER++ to preserve the similarity structure through the data-stream.

The gap in performance we observe in Domain-IL lessens when looking at Task-IL results: here, the proposed baselines perform slightly better than ER (Tiny ImageNet) or slightly worse (CIFAR-10). The underlying reason for such a difference w.r.t. the previous setting may be found in the way classes are observed. Indeed, as classes are presented a little at a time in exclusive subsets, the above-mentioned similarity structures arise gradually as the tasks progress; differently from Domain-IL where an overall vision is available since the first one.

Interestingly, iCaRL achieves superior performances on Sequential TinyImagenet (200 classes) in the presence of a minimal buffer (200 samples). We believe that the reason mainly lies in the way its buffer is built. In fact, iCaRL’s herding strategy ensures all classes being equally represented in the memory buffer: for the case in the exam, each class is accordingly represented by one sample in the replay memory. On the other hand, reservoir might omit some of them333In Tiny ImageNet, we experimentally verified that a reservoir buffer with a size of 200 loses on average of the total number classes., thus potentially leading to a degradation in terms of performance. However: i) the benefits of herding fade when increasing the buffer size; ii) it needs an additional forward pass on the last observed task, thus making it unsuitable in an online scenario; iii) most importantly, it heavily depends on task boundaries.

4.4 Mnist-360

To address the General Continual Learning desiderata, we propose a novel protocol, namely MNIST-360. It models a stream of data presenting batches of two consecutive MNIST digits (e.g. , , etc.), as depicted in Figure 3. We switch the lesser of the two digits with the following one after a fixed number of steps. Aimed at modeling a continuous drift in the underlying distribution, we make sure each example of the stream to be rotated by an increasing angle w.r.t. the previous one. As it is impossible to distinguish sixes and nines upon rotation, we do not use nines in MNIST-360. The stream visits the nine possible couples of classes three times, allowing the model to leverage positive transfer when revisiting a previous task. In the implementation, we guarantee that: i) each example is shown once during the overall training; ii) two digits of the same class are never observed under the same rotation. We provide a detailed description of the design of training and test sets in supplementary materials.

Methods Buffer 200 Buffer 500 Buffer 1000
SGD
ER
MER
A-GEM-R
DER
DER++
Table 3: Performance of rehearsal methods on MNIST-360, in terms of classification accuracy on the test set.

It is worth noting that such a setting presents both sharp (change in class) and smooth (rotation) distribution shifts. Therefore, it is hard for the algorithms to identify explicit shifts in the data distribution. According to the analysis provided in Table 1, the suitable methods for this benchmark are ER and MER, as well as A-GEM when equipped with a reservoir memory buffer (A-GEM-R). We compare these approaches with DER and DER++, reporting the results in Table 3444We keep the same fully-connected network we used on MNIST datasets.. As can be seen, DER and DER++ outperform the other approaches by a large margin on such a challenging scenario. Summarily, the analysis above supports the effectiveness of the proposed baselines against alternative replay methods. Due to space constraints, we refer the reader to supplementary materials for an additional evaluation regarding the memory footprint.

Figure 3: Example batches of the MNIST-360 stream.

4.5 Model analysis

Figure 4: Trends of the weighted cumulative sum of the energy of the Fisher matrix (), computed on S-MNIST for different buffer sizes. Lower is better. For the sake of clarity, we report it in logarithmic scale and split the analysis as follows: (a) and (b) highlight the comparison between ER and DER(++), whilst (c) and (d) focus on taking logits at the end of each task.

On the optimization trajectory. We investigate the reasons behind the regularizing capabilities of DER and DER++, conjecturing that these partially emerge when matching logits along the optimization trajectory. According to us, such an approach could impact on the strictness of soft-targets that we replay in later tasks and on its generalization properties favorably. Following this intuition, we analyze the training in terms of the weighted cumulative sum of the energy of the Fisher matrix after updates (), as described in (Liao et al., 2018). Figure 4 depicts the trend of this approximation as the training on Sequential MNIST (Class-IL) progresses. As regards the comparison with ER, (a) and (b) show that its increases at a faster rate w.r.t. DER(++). According to (Liao et al., 2018), this indicates that the training process leaves the optimal convergence/generalization region, thus empirically confirming the effectiveness of soft labels in continual learning scenarios. On the other hand, (c) and (d) compare sampling logits along the optimization trajectory (as done by DER(++)) against storing them at task boundaries solely. Interestingly, we observe higher generalization capabilities when exploiting reservoir.

Figure 5: Wall-clock time for various rehearsal methods, the former being measured on S-MNIST.

On training time. One of the goals we had in mind regards the feasibility of our proposal in a practical scenario. When facing up with a data-stream, one often cares about reducing the overall processing time: otherwise, training would not keep up with the rate at which data are made available on the stream. In this regard, we assess the performance of both DER and DER++ and other rehearsal methods in terms of wall-clock time (seconds) at the end of the last task. To guarantee a fair comparison, we conduct all tests under the same conditions, running each benchmark on a GeForce RTX 2080. Looking at Figure 5, which reports the execution time we measured on P-MINST and S-MNIST, we draw the following remarks: i) thanks to their lightweight update rule, DER and DER++ are faster than GEM and MER; ii) the overhead between the proposed baselines and methods like ER and iCaRL appears negligible, in the face of the improvements in mitigating catastrophic forgetting.

5 Conclusions

In this paper, we introduce Dark Experience Replay: a simple baseline for Continual Learning, which leverages Knowledge Distillation for retaining past experience and therefore avoiding catastrophic forgetting. We demonstrate the effectiveness of our proposal through an extensive experimental analysis, carried out on top of standard benchmarks. Given the results observed in Section 4.5, it emerges that Dark Experience provides a stronger regularization effect than plain rehearsal. At the same time, experiments highlight DER(++) as a strong baseline, which should not be ignored when assessing novel Continual Learning methods.

On the other hand, we designed it by looking ahead and adhering to the guidelines of General Continual Learning. Although being recently formalized, we argue that those provide the foundation for advances in diverse applications. Namely, the ones characterized by continuous streams of data, where relying on task switches and performing multiple passes on training data are not viable options. We therefore contributed to the study with MNIST-360 – a novel benchmark that encompasses the peculiarities of General Continual Learning – thus enabling future experimental studies on it.

References

  • A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019) Efficient lifelong learning with a-gem. In International Conference on Learning Representations, Cited by: §2.
  • M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2019) Continual learning: a comparative study on how to defy forgetting in classification tasks. arXiv preprint arXiv:1909.08383. Cited by: Table 1, §1, §2.
  • S. Farquhar and Y. Gal (2018) Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §4.2.
  • G. Hinton, O. Vinyals, and J. Dean (2014) Dark knowledge. Presented as the keynote in BayLearn 2. Cited by: §3.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In NeurIPS Deep Learning and Representation Learning Workshop, External Links: Link Cited by: §1, §3.1, §4.3.
  • Y. Hsu, Y. Liu, A. Ramasamy, and Z. Kira (2018) Re-evaluating continual learning scenarios: a categorization and case for strong baselines. In NeurIPS Continual learning Workshop, External Links: Link Cited by: §1, §4.3.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: §2, §4.1.
  • A. Krizhevsky et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §2, §3.
  • Z. Liao, T. Drummond, I. Reid, and G. Carneiro (2018) Approximate fisher information matrix to characterize the training of deep neural networks. IEEE transactions on pattern analysis and machine intelligence 42 (1), pp. 15–26. Cited by: §4.5.
  • X. Liu, X. Wang, and S. Matwin (2018) Improving the interpretability of deep neural networks with knowledge distillation. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 905–912. Cited by: §3.1.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §2, §4.1, §4.2.
  • A. Mallya and S. Lazebnik (2018) Packnet: adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7765–7773. Cited by: §2.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • R. Ratcliff (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §2.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Cited by: Appendix B, Appendix B, §1, §2, §4.2.
  • M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro (2019) Learning to learn without forgetting by maximizing transfer and minimizing interference. In International Conference on Learning Representations, Cited by: §2, §3.1, §4.2.
  • A. Robins (1995) Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7 (2), pp. 123–146. Cited by: §2.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §2.
  • J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. In

    International Conference on Machine Learning

    ,
    Cited by: §2.
  • J. Serra, D. Suris, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, Cited by: §2.
  • Stanford (2015) Tiny ImageNet Challenge, Course CS231n. Note: https://tiny-imagenet.herokuapp.com/ Cited by: §4.1.
  • G. M. van de Ven and A. S. Tolias (2018) Three continual learning scenarios. NeurIPS Continual Learning Workshop. Cited by: §1.
  • J. S. Vitter (1985) Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11 (1), pp. 37–57. Cited by: §2, §3.1.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning, Cited by: §2.

Appendix A Justification of Table 1

Below we provide a justification for each mark of Table 1:

  • Constant Memory

    • Architectural methods increase the model size linearly with respect to the number of tasks. In more detail, PackNet and HAT need a boolean and float mask respectively, while PNN devotes a whole new network to each task.

    • Regularization methods usually require to store up to two parameters sets, thus respecting the constant memory constraint; EWC represents the exception, as it uses both a set of parameters and a checkpoint for each task.

    • Rehearsal methods need to store a memory buffer of a fixed size.

  • No Task Boundaries

    • Architectural methods are designed under the notion of sharp task boundaries: they require to know exactly when the task finishes to update the model.

    • Regularization methods exploit the task change to take a snapshot of the network, using it either to constrain drastic changes for the most important weights (EWC, oEWC, SI) or to distill the old responses (P&C, LwF).

    • Rehearsal methods such as GEM and iCaRL require task boundaries for both their buffer update rule and their regularization strategy. The former needs to know from which task the example is drawn to solve the quadratic problem; the latter saves a copy of the model to perform distillation. Both update their buffer at the end of each task. A-GEM also depends on them for its buffer update strategy only. This means that simply adopting the reservoir update rule (as done in section 4.4 with A-GEM-R) solves this issue. ER, MER, DER and DER++ do not rely on task boundaries at all.

  • Online Learning

    • Among the architectural methods, PackNet re-trains the network (offline) at the end of each task.

    • Only some Regularization methods require costly procedures at the end of a task: oEWC and EWC need to pass over the whole training set to compute the weights importance; P&C needs to distill the knowledge from its by using the training set as a transfer set.

    • Generally, rehearsal methods do not need such a restriction, but iCaRL’s herding strategy – to populate the memory buffer – requires to compute the features of each example at the end of the task.

  • No test time oracle

    • Architectural methods need to know the task label to modify the model accordingly before they make any prediction.

    • Regularization methods and rehearsal methods can perform with no information about the task.

Appendix B Further Details on Training

Further details on iCaRL. Although iCaRL [Rebuffi et al., 2017] was initially proposed for the Class-IL setting, we make it possible to use it for Task-IL as well by introducing a modification of its classification rule. Let

be the average feature vector of the exemplars for class

and be the feature vector computed on example , iCaRL predicts a label as

(8)

Instead, given the tensor of average feature vectors for all classes

, we formulate iCaRL’s network response as

(9)

Considering the argmax for , without masking (Class-IL setting), results in the same prediction as Eq. 8.

It is also worth noting that iCaRL exploits a weight-decay regularization term (wd_reg), as suggested in [Rebuffi et al., 2017], in order to make its performance competitive with the other proposed approaches.

Data Preparation

. In our experiments, we apply no pre-processing to the MNIST dataset. Conversely, we normalize CIFAR-10 and TinyImagenet w.r.t. the mean and standard deviation of their training set and further pre-process their examples by the means of random horizontal flips and random crops.

Hyperparameter search. In Table 4, we show the best hyperparameter combination that we chose for each method for the experiments in the main paper, according to the criteria outlined in Section 4.2. A list of all the parameter combinations that were considered is featured in Table 5. We denote the learning rate with lr, the batch size with bs and the minibatch size (i.e. the size of the batches drawn from the buffer in rehearsal-based methods) with mbs, while other symbols refer to the respective methods. We deliberately hold batch size out from the hyperparameter search space for all Continual Learning benchmark so as to share the same number of updates across all methods. Its values are fixed as follows:

  • [noitemsep,topsep=0pt]

  • Sequential MNIST: batch size ;

  • Sequential CIFAR-10, Sequential Tiny ImageNet: batch size ;

  • Permuted MNIST, Rotated MNIST: batch size .

Conversely, the batch size belongs to the hyperparameter search space for experiments on the novel MNIST-360 dataset. It must be noted that MER does not depend on this parameter, as it internally always adopts a single-example forward pass.

Method Buffer Permuted MNIST Buffer Rotated MNIST
SGD lr: lr:
oEWC lr:    :    : lr:    :    :
SI lr:    c:    : lr:    c:    :
LwF lr:    :    T: lr:    :    T:
ER 200 lr:    mbs: 200 lr:    mbs:
500 lr:    mbs: 500 lr:    mbs:
5120 lr:    mbs: 5120 lr:    mbs:
GEM 200 lr:    : 200 lr:    :
500 lr:    : 500 lr:    :
5120 lr:    : 5120 lr:    :
A-GEM 200 lr:    mbs: 200 lr:    mbs:
500 lr:    mbs: 500 lr:    mbs:
5120 lr:    mbs: 5120 lr:    mbs:
DER 200 lr:    mbs:    : 200 lr:    mbs:    :
500 lr:    mbs:    : 500 lr:    mbs:    :
5120 lr:    mbs:    : 5120 lr:    mbs:    :
DER++ 200 lr:    mbs:    :    : 200 lr:    mbs:    :    :
500 lr:    mbs:    :    : 500 lr:    mbs:    :    :
5120 lr:    mbs:    :    : 5120 lr:    mbs:    :    :
Method Buffer Sequential MNIST Buffer Sequential CIFAR-10
SGD lr: lr:
oEWC lr:    :    : lr:    :    :
SI lr:    c:    : lr:    c:    :
LwF lr:    :    T: lr:    :    T:
PNN lr: lr:
ER 200 lr:    mbs: 200 lr:    mbs:
500 lr:    mbs: 500 lr:    mbs:
5120 lr:    mbs: 5120 lr:    mbs:
Method Buffer Sequential MNIST Buffer Sequential CIFAR-10
MER 200 lr:    mbs:    :    :    nb:
500 lr:    mbs:    :    :    nb:
5120 lr:    mbs:    :    :    nb:
GEM 200 lr:    : 200 lr:    :
500 lr:    : 500 lr:    :
5120 lr:    : 5120 lr:    :
A-GEM 200 lr:    mbs: 200 lr:    mbs:
500 lr:    mbs: 500 lr:    mbs:
5120 lr:    mbs: 5120 lr:    mbs:
iCaRL 200 lr:   mbs:   T:   wd_reg: 200 lr:   mbs:   T:   wd_reg:
500 lr:   mbs:   T:   wd_reg: 500 lr:   mbs:   T:   wd_reg:
5120 lr:   mbs:   T:   wd_reg: 5120 lr:   mbs:   T:   wd_reg:
DER 200 lr:    mbs:    : 200 lr:    mbs:    :
500 lr:    mbs:    : 500 lr:    mbs:    :
5120 lr:    mbs:    : 5120 lr:    mbs:    :
DER++ 200 lr:    mbs:    :    : 200 lr:    mbs:    :    :
500 lr:    mbs:    :    : 500 lr:    mbs:    :    :
5120 lr:    mbs:    :    : 5120 lr:    mbs:    :    :
Method Buffer Sequential Tiny ImageNet Buffer MNIST-360
SGD lr: lr:    bs:
oEWC lr:    :    :
SI lr:    c:    :
LwF lr:    :    T:
PNN lr:
ER 200 lr:    mbs: 200 lr:    bs:    mbs:
500 lr:    mbs: 500 lr:    bs:    mbs:
5120 lr:    mbs: 1000 lr:    bs:    mbs:
MER 200 lr:    mbs:    :    :    nb:
500 lr:    mbs:    :    :    nb:
1000 lr:    mbs:    :    :    nb:
A-GEM 200 lr:    mbs: 200 lr:    bs:    mbs:
500 lr:    mbs: 500 lr:    bs:    mbs:
5120 lr:    mbs: 1000 lr:    bs:    mbs:
iCaRL 200 lr:   mbs:   T:   wd_reg:
500 lr:   mbs:   T:   wd_reg:
5120 lr:   mbs:   T:   wd_reg:
DER 200 lr:   mbs:   T:   : 200 lr:    bs:    mbs:    :
500 lr:    mbs:    : 500 lr:    bs:    mbs:    :
5120 lr:    mbs:    : 1000 lr:    bs:    mbs:    :
DER++ 200 lr:    mbs:    :    : 200 lr:   bs:   mbs:   :   :
500 lr:    mbs:    :    : 500 lr:   bs:   mbs:   :   :
5120 lr:    mbs:    :    : 1000 lr:   bs:   mbs:   :   :
Table 4: Hyperparameters selected for our experiments.
Permuted MNIST
Method Buffer Par Values
SGD - lr [0.03, 0.1, 0.2]
oEWC - lr [0.1, 0.01]
[0.1, 1, 10, 30, 90,100]
[0.9, 1.0]
SI - lr [0.01, 0.1]
c [0.5, 1.0]
[0.9, 1.0]
LwF - lr [0.01, 0.1, 0.3]
[0.5, 1.0]
T [2.0, 4.0]
ER 200 mbs [128]
lr [0.03, 0.1, 0.2]
500 mbs [128]
lr [0.03, 0.1, 0.2]
5120 mbs [128]
lr [0.03, 0.1, 0.2]
GEM 200 lr [0.01, 0.1, 0.3]
[0.1, 0.5, 1]
500 lr [0.01, 0.1, 0.3]
[0.1, 0.5, 1]
5120 lr [0.01, 0.1, 0.3]
[0.1, 0.5, 1]
A-GEM 200 mbs [128, 256]
lr [0.01, 0.1, 0.3]
500 mbs [128, 256]
lr [0.01, 0.1, 0.3]
5120 mbs [128, 256]
lr [0.01, 0.1, 0.3]
DER 200 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
500 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
5120 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
DER++ 200 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
[0.5, 1.0]
500 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
[0.5, 1.0]
5120 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
[0.5, 1.0]
Rotated MNIST
Method Buffer Par Values
SGD - lr [0.03, 0.1, 0.2]
oEWC - lr [0.01, 0.1]
[0.1, 0.7, 1, 10, 30, 90,100]
[0.9, 1.0]
SI - lr [0.01, 0.1]
c [0.5, 1.0]
[0.9, 1.0]
LwF - lr [0.01, 0.1, 0.2, 0.3]
[0.5, 1.0]
T [2.0, 4.0]
ER 200 mbs [128]
lr [0.1, 0.2]
500 mbs [128]
lr [0.1, 0.2]
5120 mbs [128]
lr [0.1, 0.2]
GEM 200 lr [0.01, 0.3, 0.1]
[0.1, 0.5, 1]
500 lr [0.01, 0.3, 0.1]
[0.1, 0.5, 1]
5120 lr [0.01, 0.3, 0.1]
[0.1, 0.5, 1]
A-GEM 200 mbs [128, 256]
lr [0.01, 0.1, 0.3]
500 mbs [128, 256]
lr [0.01, 0.1, 0.3]
5120 mbs [128, 256]
lr [0.01, 0.1, 0.3]
DER 200 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
500 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
5120 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
DER++ 200 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
[0.5, 1.0]
500 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
[0.5, 1.0]
5120 mbs [128]
lr [0.1, 0.2]
[0.5, 1.0]
[0.5, 1.0]
Sequential MNIST
Method Buffer Par Values
SGD - lr [0.03, 0.01, 0.1]
oEWC - lr [0.03, 0.1]
[10, 25, 30, 90, 100]
[0.9, 1.0]
SI - lr [0.03, 0.01, 0.1]
c [0.5, 1.0]
[0.9, 1.0]
LwF - lr [0.03, 0.01, 0.1]
[0.5, 1.0]
T [2.0, 4.0]
PNN - lr [0.03, 0.1]
ER 200 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
500 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
5120 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
MER 200 mbs [128]
lr [0.03, 0.1]
[1.0]
[1.0]
nb [1, 3]
500 mbs [128]
lr [0.03, 0.1]
[1.0]
[1.0]
nb [1, 3]
5120 mbs [128]
lr [0.03, 0.1]
[1.0]
[1.0]
nb [1, 3]
GEM 200 lr [0.01, 0.03, 0.1]
[0.5, 1]
500 lr [0.01, 0.03, 0.1]
[0.5, 1]
5120 lr [0.01, 0.03, 0.1]
[0.5, 1]
A-GEM 200 mbs [10, 128, 64]
lr [0.03, 0.1]
500 mbs [10, 64, 128]
lr [0.03, 0.1]
5120 mbs [10, 64, 128]
lr [0.03, 0.1]
iCaRL 200 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
T [2.0, 4.0]
wd-reg [0.0001, 0.0005]
500 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
T [2.0, 4.0]
wd-reg [0.0001, 0.0005]
5120 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
T [2.0, 4.0]
wd-reg [0.0001, 0.0005]
DER 200 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
[0.2, 0.5, 1.0]
500 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
[0.2, 0.5, 1.0]
5120 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
[0.2, 0.5, 1.0]
DER++ 200 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
[0.2, 0.5, 1.0]
[0.5, 1.0]
500 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
[0.2, 0.5, 1.0]
[0.5, 1.0]
5120 mbs [10, 64, 128]
lr [0.03, 0.1, 0.2]
[0.2, 0.5, 1.0]
[0.5, 1.0]
Sequential CIFAR-10
Method Buffer Par Values
SGD - lr [0.01, 0.03, 0.1]
oEWC - lr [0.03, 0.1]
[10, 25, 30, 90, 100]
[0.9, 1.0]
SI - lr [0.03, 0.01, 0.1]
c [0.5, 0.1]
[1.0]
LwF - lr [0.03, 0.01, 0.1]
[0.5, 1.0]
T [2.0]
PNN - lr [0.03, 0.1]
ER 200 mbs [32, 128]
lr [0.03, 0.01, 0.1]
500 mbs [32, 128]
lr [0.03, 0.01, 0.1]
5120 mbs [32, 128]
lr [0.03, 0.01, 0.1]
GEM 200 lr [0.01, 0.03, 0.1]
[0.5, 1]
500 lr [0.01, 0.03, 0.1]
[0.5, 1]
5120 lr [0.01, 0.03, 0.1]
[0.5, 1]
A-GEM 200 mbs [32, 128]
lr [0.03, 0.1]
500 mbs [32, 128]
lr [0.03, 0.1]
5120 mbs [32, 128]
lr [0.03, 0.1]
iCaRL 200 mbs [32, 128]
lr [0.03, 0.01, 0.1]
T [2.0, 4.0]
wd-reg [0.0001, 0.0005]
500 mbs [32, 128]
lr [0.03, 0.01, 0.1]
T [2.0, 4.0]
wd-reg [0.0001, 0.0005]
5120 mbs [32, 128]
lr [0.03, 0.01, 0.1]
T [2.0, 4.0]
wd-reg [0.0001, 0.0005]
DER 200 mbs [32, 128]
lr [0.03, 0.01, 0.1]
[0.2, 0.5, 1.0]
500 mbs [32, 128]
lr [0.03, 0.01, 0.1]
[0.2, 0.5, 1.0]
5120 mbs [32, 128]
lr [0.03, 0.01, 0.1]
[0.2, 0.5, 1.0]
DER++ 200 mbs [32, 128]
lr [0.01, 0.03, 0.1]
[0.1, 0.2, 0.5]
[0.5, 1.0]
500 mbs [32, 128]
lr [0.01, 0.03, 0.1]
[0.1, 0.2, 0.5]
[0.5, 1.0]
5120 mbs [32, 128]
lr [0.01, 0.03, 0.1]
[0.1, 0.2, 0.5]
[0.5, 1.0]
Sequential Tiny ImageNet
Method Buffer Par Values
SGD - lr [0.01, 0.03, 0.1]
oEWC - lr [0.01, 0.03]
[10, 25, 30, 90, 100]
[0.9, 0.95, 1.0]
SI - lr [0.01, 0.03]
c [0.5]
[1.0]
LwF - lr [0.03, 0.01, 0.1]
[0.5, 1.0]
T [2.0]
PNN - lr [0.03, 0.1]
ER 200 mbs [32]
lr [0.01, 0.03, 0.1]
500 mbs [32]
lr [0.01, 0.03, 0.1]
5120 mbs [32]
lr [0.01, 0.03, 0.1]
A-GEM 200 mbs [32]
lr [0.003, 0.01]
500 mbs [32]
lr [0.003, 0.01]
5120 mbs [32]
lr [0.003, 0.01]
iCaRL 200 mbs [32]
lr [0.03, 0.1, 0.3]
T [2.0, 4.0]
wd-reg [0.0005]
500 mbs [32]
lr [0.03, 0.1, 0.3]
T [2.0, 4.0]
wd-reg [0.0005]
5120 mbs [32]
lr [0.03, 0.1, 0.3]
T [2.0, 4.0]
wd-reg [0.0005]
DER 200 mbs [32]
lr [0.01, 0.03, 0.1]
[0.2, 0.5, 1.0]
500 mbs [32]
lr [0.01, 0.03, 0.1]
[0.2, 0.5, 1.0]
5120 mbs [32]
lr [0.01, 0.03, 0.1]
[0.2, 0.5, 1.0]
DER++ 200 mbs [32]
lr [0.01, 0.03]
[0.1, 0.2, 0.5, 1.0]
[0.5, 1.0]
500 mbs [32]
lr [0.01, 0.03]
[0.1, 0.2, 0.5, 1.0]
[0.5, 1.0]
5120 mbs [32]
lr [0.01, 0.03]
[0.1, 0.2, 0.5, 1.0]
[0.5, 1.0]
MNIST-360
Method Buffer Par Values
SGD - lr [0.1, 0.2]
bs [1, 4, 8, 16]
ER 200 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
500 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
1000 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
MER 200 mbs [128]
lr [0.1, 0.2]
[1.0]
[1.0]
nb [1, 3]
500 mbs [128]
lr [0.1, 0.2]
[1.0]
[1.0]
nb [1, 3]
1000 mbs [128]
lr [0.1, 0.2]
[1.0]
[1.0]
nb [1, 3]
A-GEM-R 200 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 16]
500 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 16]
1000 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 16]
DER 200 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
[0.5, 1.0]
500 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
[0.5, 1.0]
1000 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
[0.5, 1.0]
DER++ 200 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
[0.2, 0.5]
[0.5, 1.0]
500 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
[0.2, 0.5]
[0.5, 1.0]
1000 mbs [16, 64, 128]
lr [0.1, 0.2]
bs [1, 4, 8, 16]
[0.2, 0.5]
[0.5, 1.0]
Table 5: Hyperparameter space for Grid-Search

Appendix C Details on MNIST-360

MNIST-360 presents the evaluated method with a sequence of MNIST digits from to shown at increasing angles.

c.1 Training

For Training purposes, we build batches using exemplars that belong to two consequent classes at a time, meaning that pairs of classes are possibly encountered: , , , , , , , , and . Each pair is shown in this order in rounds ( in our experiments) at changing rotations. This means that MNIST-360 consists of pseudo-tasks, whose boundaries are not signaled to the tested method. We indicate them with where is the round number and are digits forming one of the pairs listed above.

As every MNIST digit appears in pseudo-tasks, we randomly split its example images evenly in groups where . The set of exemplars that are shown in is given as , where is an integer division.

At the beginning of , we initialize two counters and to keep track of how many exemplars of and are shown respectively. Given batch size ( in our experiments), each batch is made up of samples from and samples from , where:

(10)
(11)

This allows us to produce balanced batches, in which the proportion of exemplars of and is maintained the same. Pseudo-task ends when the entirety of has been shown, which does not necessarily happen after a fixed number of batches.

Each digit is also associated with a counter that is never reset during training and is increased every time an exemplar of is shown to the evaluated method. Before its showing, every exemplar is rotated by

(12)

where is the number of total examples of digit in the training set and is a digit-specific angular offset, whose value for the digit is given by (, , , etc.). By so doing, every digit’s exemplars are shown with an increasing rotation spanning an entire angle throughout the entire procedure. Rotation changes within each pseudo-task, resulting into a gradually changing distribution. Figure 3 in the main paper shows the initial batches of the first pseudo-tasks.

c.2 Test

As no task boundaries are provided, evaluation on MNIST-360 can only be carried out after the training is complete. For test purposes, digits are still shown with an increasing rotation as per Equation 12, with referring to the test-set digit cardinality and no offset applied ().

The order with which digits are shown is irrelevant, therefore no specific batching strategy is necessary and we simply show one digit at a time.

Appendix D Results table

Buffer Method S-MNIST S-CIFAR-10 S-Tiny-ImageNet P-MNIST R-MNIST
Class-IL Task-IL Class-IL Task-IL Class-IL Task-IL Domain-IL Domain-IL
SGD
oEWC
SI
LwF
PNN - - - - -
ER
MER - - - - - -
GEM - -
200 A-GEM
iCaRL - -
DER
DER++