Log In Sign Up

Continual Learning Based on OOD Detection and Task Masking

Existing continual learning techniques focus on either task incremental learning (TIL) or class incremental learning (CIL) problem, but not both. CIL and TIL differ mainly in that the task-id is provided for each test sample during testing for TIL, but not provided for CIL. Continual learning methods intended for one problem have limitations on the other problem. This paper proposes a novel unified approach based on out-of-distribution (OOD) detection and task masking, called CLOM, to solve both problems. The key novelty is that each task is trained as an OOD detection model rather than a traditional supervised learning model, and a task mask is trained to protect each task to prevent forgetting. Our evaluation shows that CLOM outperforms existing state-of-the-art baselines by large margins. The average TIL/CIL accuracy of CLOM over six experiments is 87.6/67.9 only 82.4/55.0


A Theoretical Study on Solving Continual Learning

Continual learning (CL) learns a sequence of tasks incrementally. There ...

Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference

Despite rapid advances in continual learning, a large body of research i...

CoMFormer: Continual Learning in Semantic and Panoptic Segmentation

Continual learning for segmentation has recently seen increasing interes...

CLASSIC: Continual and Contrastive Learning of Aspect Sentiment Classification Tasks

This paper studies continual learning (CL) of a sequence of aspect senti...

Lifelong Learning Without a Task Oracle

Supervised deep neural networks are known to undergo a sharp decline in ...

Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism

Continual learning is a challenging real-world problem for constructing ...

1 Introduction

Continual learning (CL) learns a sequence of tasks incrementally. Each task has its dataset , where is a data sample in task and is the class label of and is the set of classes of task . The key challenge of CL is catastrophic forgetting (CF) McCloskey and Cohen (1989), which refers to the situation where the learning of a new task may significantly change the network weights learned for old tasks, degrading the model accuracy for old tasks. Researchers have mainly worked on two CL problems: class incremental/continual learning (CIL) and task incremental/continual learning (TIL) (Dhar et al., 2019; van de Ven and Tolias, 2019). The main difference between CIL and TIL is that in TIL, the task-id is provided for each test sample during testing so that only the model for task

is used to classify

, while in CIL, the task-id for each test sample is not provided.

Existing CL techniques focus on either CIL or TIL (Parisi et al., 2019). In general, CIL methods are designed to function without given task-id to perform class prediction over all classes in the tasks learned so far, and thus tend to forget the previous task performance due to plasticity for the new task. TIL methods are designed to function with a task-id for class prediction within the task. They are more stable to retain previous within-task knowledge, but incompetent if task-id is unknown (see Sec. 4).

This paper proposes a novel and unified method called CLOM (Continual Learning based on OOD detection and Task Masking) to solve both problems by overcoming the limitations of CIL and TIL methods. CLOM has two key mechanisms: (1) a task mask mechanism for protecting each task model to overcome CF, and (2) a learning method for building a model for each task based on out-of-distribution (OOD) detection. The task mask mechanism is inspired by hard attention in the TIL system HAT Serrà et al. (2018). The OOD detection based learning method for building each task model is quite different from the classic supervised learning used in existing TIL systems. It is, in fact, the key novelty and enabler for CLOM to work effectively for CIL.

OOD detection is stated as follows Bulusu et al. (2020): Given a training set of classes, called the in-distribution (IND) training data, we want to build a model that can assign correct classes to IND test data and reject or detect OOD test data that do not belong to any of the IND training classes. The OOD rejection capability makes a TIL model effective for CIL problem because during testing, if a test sample does not belong to any of the classes of a task, it will be rejected by the model of the task. Thus, only the task that the test sample belongs to will accept it and classify it to one of its classes. No task-id is needed. The OOD detection algorithm in CLOM is inspired by several recent advances in self-supervised learning, data augmentation He et al. (2020), contrastive learning Oord et al. (2018); Chen et al. (2020), and their applications to OOD detection Golan and El-Yaniv (2018); Hendrycks et al. (2019); Tack et al. (2020).

The main contributions of this work are as follows:

  • It proposes a novel CL method CLOM, which is essentially a TIL system in training, but solves both TIL and CIL problems in testing. Existing methods mainly solve either TIL or CIL problem, but are weak on the other problem.

  • CLOM uses task masks to protect the model for each task to prevent CF and OOD detection to build each task model, which to our knowledge has not been done before. More importantly, since the task masks can prevent CF, continual learning performance gets better as OOD models get better too.

  • CLOM needn’t to save any replay data. If some past data is saved for output calibration, it performs even better.

Experimental results show that CLOM improves state-of-the-art baselines by large margins. The average TIL/CIL accuracy of CLOM over six different experiments is 87.6/67.9% while that of the best baseline is only 82.4/55.0%.

2 Related Work

Many approaches have been proposed to deal with CF in CL. Using regularization Kirkpatrick et al. (2017) and knowledge distillation Li and Hoiem (2016) to minimize the change to previous models are two popular approaches Jung et al. (2016); Camoriano et al. (2017); Fernando et al. (2017); Rannen Ep Triki et al. (2017); Seff et al. (2017); Zenke et al. (2017); Kemker and Kanan (2018); Ritter et al. (2018); Schwarz et al. (2018); Xu and Zhu (2018); Castro et al. (2018); Dhar et al. (2019); Hu et al. (2019); Lee et al. (2019); Liu et al. (2020b). Memorizing some old examples and using them to adjust the old models in learning a new task is another popular approach (called replayRusu et al. (2016); Lopez-Paz and Ranzato (2017); Rebuffi et al. (2017); Chaudhry et al. (2019); de Masson d’Autume et al. (2019); Hou et al. (2019); Wu et al. (2019); Rolnick et al. (2019); Buzzega et al. (2020); Zhao et al. (2020); Rajasegaran et al. (2020a); Liu et al. (2021). Several systems learn to generate pseudo training data of old tasks and use them to jointly train the new task, called pseudo-replay Gepperth and Karaoguz (2016); Kamra et al. (2017); Shin et al. (2017); Wu et al. (2018); Seff et al. (2017); Wu et al. (2018); Kemker and Kanan (2018); Hu et al. (2019); Hayes et al. (2019); Rostami et al. (2019); Ostapenko et al. (2019). CLOM differs from these approaches as it does not replay any old task data to prevent forgetting and can function with/without saving some old data. Parameter isolation is yet another popular approach, which makes different subsets (which may overlap) of the model parameters dedicated to different tasks using masks Fernando et al. (2017); Serrà et al. (2018); Ke et al. (2020) or finding a sub-network for each task by pruning Mallya and Lazebnik (2017); Wortsman et al. (2020); Hung et al. (2019). CLOM uses parameter isolation, but it differs from these approaches as it combines the idea of parameter isolation and OOD detection, which can solve both TIL and CIL problems effectively.

There are also many other approaches, e.g., a network of experts Aljundi et al. (2017) and generalized CL Mi et al. (2020), etc. PASS Zhu et al. (2021) uses data rotation and regularizes them. CoCha et al. (2021) is a replay method that uses contrastive loss on old samples. CLOM also uses rotation and constrative loss, but its CF handling is based on masks. None of the existing methods uses OOD detection.

CLOM is a TIL method that can also solve CIL problems. Many TIL systems exist Fernando et al. (2017), e.g., GEM Lopez-Paz and Ranzato (2017), A-GEM Chaudhry et al. (2019), UCL Ahn et al. (2019), ADP Yoon et al. (2020), CCLL Singh et al. (2020), Orthog-Subspace Chaudhry et al. (2020), HyperNet von Oswald et al. (2020), PackNet Mallya and Lazebnik (2017), CPG Hung et al. (2019), SupSup Wortsman et al. (2020), HAT Serrà et al. (2018), and CAT Ke et al. (2020). GEM is a replay based method and UCL is a regularization based method. A-GEM (Chaudhry et al., 2019) improves GEM’s efficiency. ADP decomposes parameters into shared and adaptive parts to construct an order robust TIL system. CCLL uses task-adaptive calibration on convolution layers. Orthog-Subspace learns each task in subspaces orthogonal to each other. HyperNet initializes task-specific parameters conditioned on task-id. PackNet, CPG and SupSup find an isolated sub-network for each task and use it at inference. HAT and CAT protect previous tasks by masking the important parameters. Our CLOM also uses this general approach, but its model building for each task is based on OOD detection, which has not been used by existing TIL methods. It also performs well in the CIL setting (see Sec. 4).

Related work on out-of-distribution (OOD) detection (also called open set detection) is also extensive. Excellent surveys include Bulusu et al. (2020); Geng et al. (2020). The model building part of CLOM is inspired by the latest method in Tack et al. (2020) based on data augmentation He et al. (2020) and contrastive learning Chen et al. (2020).

3 Proposed CLOM Technique

As mentioned earlier, CLOM is a TIL system that solves both TIL and CIL problems. It takes the parameter isolation approach for TIL. For each task (task-id), a model is built for the task, where and are the feature extractor and the task specific classifier, respectively, and

is the output of the neural network for task

. We omit task-id in to simplify notation. In learning each task , a task mask is also trained at the same time to protect the learned model of the task.

In testing for TIL, given the test sample with task-id provided, CLOM uses the model for task to classify ,


For CIL (no task-id for each test sample ), CLOM uses the model of every task and predicts the class using


where is the concatenation over the output space and is the last task that has been learned. Eq. 2 essentially chooses the class with the highest classification score among the classes of all tasks learned so far. This works because the OOD detection model for a task will give very low score for a test sample that does not belong to the task. An overview of the prediction is illustrated in Fig. 1(a).

Note that in fact any CIL model can be used as a TIL model if task-id is provided. The conversion is made by selecting the corresponding task classification heads in testing. However, a conversion from a TIL method to a CIL method is not obvious as the system requires task-id in testing. Some attempts have been made to predict the task-id in order to make a TIL method applicable to CIL problems. iTAML Rajasegaran et al. (2020b) requires the test samples to come in batches and each batch must be from a single task. This may not be practical as test samples usually come one by one. CCG Abati et al. (2020) builds a separate network to predict the task-id, which is also prone to forgetting. Expert Gate Aljundi et al. (2017)

constructs a separate autoencoder for each task. Our CLOM classifies one test sample at a time and does not need to construct another network for task-id prediction.

In the following subsections, we present (1) how to build an OOD model for a task and use it to make a prediction, and (2) how to learn task masks to protect each model. An overview of the process is illustrated in Fig. 1(b).

Learning Each Task as OOD Detection

Figure 1: (a) Overview of TIL and CIL prediction. For TIL prediction, we consider only the output heads of the given task. For CIL prediction, we obtain all the outputs from task to current task

, and choose the label over the concatenated output. (b) Training overview for a generic backbone. Each CNN includes a convolution layer with batch normalization, activation, or pooling depending on the configuration. In our experiments, we use AlexNet 

Krizhevsky et al. (2012) and ResNet-18 He et al. (2016). We feed-forward augmented batch consisting of rotated images of different views (). We first train feature extractor by using contrastive loss (step 1). At each layer, binary mask is multiplied to the output of each convolution layer to learn important parameters of current task . After training feature extractor, we fine-tune the OOD classifier (step 2).

We borrow the latest OOD ideas based on contrastive learning Chen et al. (2020); He et al. (2020) and data augmentation due to their excellent performance Tack et al. (2020). Since this section focuses on how to learn a single task based on OOD detection, we omit the task-id unless necessary. The OOD training process is similar to that of contrastive learning. It consists of two steps: 1) learning the feature representation by the composite , where is a feature extractor and is a projection to contrastive representation, and 2) learning a linear classifier mapping the feature representation of to the label space. In the following, we describe the training process: contrastive learning for feature representation learning (1), and OOD classifier building (2). We then explain how to make a prediction based on an ensemble method for both TIL and CIL settings, and how to further improve prediction using some saved past data.

Contrastive Loss for Feature Learning.

This is step 1 in Fig. 1(b). Supervised contrastive learning is used to try to repel data of different classes and align data of the same class more closely to make it easier to classify them. A key operation is data augmentation via transformations.

Given a batch of samples, each sample is first duplicated and each version then goes through three initial augmentations (also see Data Augmentation in Sec. 4) to generate two different views and (they keep the same class label as ). Denote the augmented batch by , which now has samples. In Hendrycks et al. (2019); Tack et al. (2020), it was shown that using image rotations is effective in learning OOD detection models because such rotations can effectively serve as out-of-distribution (OOD) training data. For each augmented sample with class of a task, we rotate by to create three images, which are assigned three new classes , and , respectively. This results in a larger augmented batch . Since we generate three new images from each , the size of is . For each original class, we now have 4 classes. For a sample , let and let be a set consisting of the data of the same class as distinct from . The contrastive representation of a sample is , where is the current task. In learning, we minimize the supervised contrastive loss Khosla et al. (2020) of task .


where is a scalar temperature, is dot product, and is multiplication. The loss is reduced by repelling of different classes and aligning of the same class more closely. basically trains a feature extractor with good representations for learning an OOD classifier.

Learning the Classifier.

This is step 2 in Fig.1(b). Given the feature extractor trained with the loss in Eq. 3, we freeze and only fine-tune the linear classifier , which is trained to predict the classes of task and the augmented rotation classes. maps the feature representation to the label space in , where is the number of rotation classes including the original data with rotation and is the number of original classes in task . We minimize the cross-entropy loss,


where ft indicates fine-tune, and


where . The output includes the rotation classes. The linear classifier is trained to predict the original and the rotation classes.

Ensemble Class Prediction.

We describe how to predict a label (TIL) and (CIL) ( is the set of original classes of all tasks) We assume all tasks have been learned and their models are protected by masks, which we discuss in the next subsection.

We discuss the prediction of class label for a test sample in the TIL setting first. Note that the network in Eq. 5

returns logits for rotation classes (including the original task classes). Note also for each original class label

(original classes) of a task , we created three additional rotation classes. For class , the classifier will produce four output values from its four rotation class logits, i.e., , , , and , where 0, 90, 180, and 270 represent , and rotations respectively and is the original . We compute an ensemble output for each class of task ,


The final TIL class prediction is made as follows (note, in TIL, task-id is provided in testing),


We use Eq. 2 to make the CIL class prediction, where the final format for task

is the following vector:


Our method so far memorizes no training samples and it already outperforms baselines (see Sec. 4).

Output Calibration with Memory.

The proposed method can make incorrect CIL prediction even with a perfect OOD model (rejecting every test sample that does not belong any class of the task). This happens because the task models are trained independently, and the outputs of different tasks may have different magnitudes. We use output calibration to ensure that the outputs are of similar magnitudes to be comparable by using some saved examples in a memory with limited budget. At each task , we store a fraction of the validation data into for output calibration and update the memory by maintaining an equal number of samples per class. We will detail the memory budget in Sec. 4. Basically, we save the same number as the existing replay-based TIL/CIL methods.

After training the network for the current task , we freeze the model and use the saved data in to find the scaling and shifting parameters to calibrate the after-ensemble classification output (Eq. 8) (i.e., using for each task by minimizing the cross-entropy loss,


where is the set of all classes of all tasks seen so far, and is computed using softmax,


Clearly, parameters and do not change classification within task (TIL), but calibrates the outputs such that the ensemble outputs from all tasks are in comparable magnitudes. For CIL inference at the test time, we make prediction by (following Eq. 2),


Protecting OOD Model of Each Task Using Masks

We now discuss the task mask mechanism for protecting the OOD model of each task to deal with CF. In learning the OOD model for each task, CLOM at the same time also trains a mask or hard attention

for each layer. To protect the shared feature extractor from previous tasks, their masks are used to block those important neurons so that the new task learning will not interfere with the parameters learned for previous tasks.

The main idea is to use sigmoid to approximate a 0-1 step function as hard attention to mask (or block) or unmask (unblock) the information flow to protect the parameters learned for each previous task.

The hard attention (mask) at layer and task is defined as



is the sigmoid function,

is a scalar, and is a learnable embedding of the task-id input . The attention is element-wise multiplied to the output of layer as


as depicted in Fig. 1(b). The sigmoid function converges to a 0-1 step function as goes to infinity. Since the true step function is not differentiable, a fairly large is chosen to achieve a differentiable pseudo step function based on sigmoid (see Appendix D for choice of ). The pseudo binary value of the attention determines how much information can flow forward and backward between adjacent layers.

Denote , where ReLU is the rectifier function. For units (neurons) of attention with zero values, we can freely change the corresponding parameters in and without affecting the output . For units of attention with non-zero values, changing the parameters will affect the output

for which we need to protect from gradient flow in backpropagation to prevent forgetting.

Specifically, during training the new task , we update parameters according to the attention so that the important parameters for past tasks () are unmodified. Denote the accumulated attentions (masks) of all past tasks by


where is the hard attentions of layer of all previous tasks, and is an element-wise maximum111Some parameters from different tasks can be shared, which means some hard attention masks can be shared. and is a zero vector. Then the modified gradient is the following,


where indicates ’th unit of and . This reduces the gradient if the corresponding units’ attentions at layers and are non-zero222By construction, if becomes for all layers, the gradients are zero and the network is at maximum capacity. However, the network capacity can increase by adding more parameters.. We do not apply hard attention on the last layer because it is a task-specific layer.

To encourage sparsity in and parameter sharing with , a regularization () for attention at task is defined as



is a scalar hyperparameter. For flexibility, we denote

for each task . However, in practice, we use the same for all . The final objective to be minimized for task with hard attention is (see Fig. 1(b))



is the contrastive loss function (Eq. 

3). By protecting important parameters from changing during training, the neural network effectively alleviates CF.

4 Experiments

Evaluation Datasets: Four image classification CL benchmark datasets are used in our experiments. (1) MNIST :333 handwritten digits of 10 classes (digits) with 60,000 examples for training and 10,000 examples for testing. (2) CIFAR-10 (Krizhevsky and Hinton, 2009):444 kriz/cifar.html 60,000 32x32 color images of 10 classes with 50,000 for training and 10,000 for testing. (3) CIFAR-100 (Krizhevsky and Hinton, 2009):555 kriz/cifar.html 60,000 32x32 color images of 100 classes with 500 images per class for training and 100 per class for testing. (4) 


 Le and Yang (2015):666 120,000 64x64 color images of 200 classes with 500 images per class for training and 50 images per class for validation, and 50 images per class for testing. Since the test data has no labels in this dataset, we use the validation data as the test data as in Liu et al. (2020b).

Baseline Systems: We compare our CLOM with both the classic and the most recent state-of-the-art CIL and TIL methods. We also include CLOM(-c), which is CLOM without calibration (which already outperforms the baselines).

For CIL baselines, we consider seven replay methods, LwF.R (replay version of LwF Li and Hoiem (2016) with better results Liu et al. (2020a)), iCaRL Rebuffi et al. (2017), BiC Wu et al. (2019), A-RPS Rajasegaran et al. (2020a), Mnemonics Liu et al. (2020a), DER++ Buzzega et al. (2020), and CoL Cha et al. (2021); one pseudo-replay method CCG Abati et al. (2020)777iTAML Rajasegaran et al. (2020b) is not included as they require a batch of test data from the same task to predict the task-id. When each batch has only one test sample, which is our setting, it is very weak. For example, iTAML TIL/CIL accuracy is only 35.2%/33.5% on CIFAR100 10 tasks. Expert Gate (EG) Aljundi et al. (2017) is also weak. For example, its TIL/CIL accuracy is 87.2/43.2 on MNIST 5 tasks. Both iTAML and EG are much weaker than many baselines. ; one orthogonal projection method OWM Zeng et al. (2019); and a multi-classifier method MUC Liu et al. (2020b); and a prototype augmentation method PASS Zhu et al. (2021). OWM, MUC, and PASS do not save any samples from previous tasks.

TIL baselines include HAT Serrà et al. (2018), HyperNet von Oswald et al. (2020), and SupSup Wortsman et al. (2020). As noted earlier, CIL methods can also be used for TIL. In fact, TIL methods may be used for CIL too, but the results are very poor. We include them in our comparison

Training Details

For all experiments, we use 10% of training data as the validation set to grid-search for good hyperparameters. For minimizing the contrastive loss, we use LARS You et al. (2017)

for 700 epochs with initial learning rate 0.1. We linearly increase the learning rate by 0.1 per epoch for the first 10 epochs until 1.0 and then decay it by cosine scheduler

Loshchilov and Hutter (2016) after 10 epochs without restart as in Chen et al. (2020); Tack et al. (2020). For fine-tuning the classifier , we use SGD for 100 epochs with learning rate 0.1 and reduce the learning rate by 0.1 at 60, 75, and 90 epochs. The full set of hyperparameters is given in Appendix D.

We follow the recent baselines (A-RPS, DER++, PASS and CoL) and use the same class split and backbone architecture for both CLOM and baselines.

For MNIST and CIFAR-10, we split 10 classes into 5 tasks where each task has 2 classes in consecutive order. We save 20 random samples per class from the validation set for output calibration. This number is commonly used in replay methods Rebuffi et al. (2017); Wu et al. (2019); Rajasegaran et al. (2020a); Liu et al. (2020a). MNIST consists of single channel images of size 1x28x28. Since the contrastive learning Chen et al. (2020) relies on color changes, we copy the channel to make 3-channels. For MNIST and CIFAR-10, we use AlexNet-like architecture Krizhevsky et al. (2012) and ResNet-18 He et al. (2016) respectively for both CLOM and baselines.

For CIFAR-100, we conduct two experiments. We split 100 classes into 10 tasks and 20 tasks where each task has 10 and 5 classes, respectively, in consecutive order. We use 2000 memory budget as in Rebuffi et al. (2017), saving 20 random samples per class from the validation set for output calibration. We use the same ResNet-18 structure for CLOM and baselines, but we increase the number of channels twice to learn more tasks.

For Tiny-ImageNet, we follow

Liu et al. (2020b) and resize the original images of size 3x64x64 to 3x32x32 so that the same ResNet-18 of CIFAR-100 experiment setting can be used. We split 200 classes into 5 tasks (40 classes per task) and 10 tasks (20 classes per task) in consecutive order, respectively. To have the same memory budget of 2000 as for CIFAR-100, we save 10 random samples per class from the validation set for output calibration.

Data Augmentation. For baselines, we use data augmentations used in their original papers. For CLOM, following (Chen et al., 2020; Tack et al., 2020), we use three initial augmentations (see Sec. 3) (i.e., horizontal flip, color change (color jitter and grayscale), and Inception crop Szegedy et al. (2015)) and four rotations (see Sec. 3). Specific details about these transformations are given in Appendix C.

Method MNIST-5T CIFAR10-5T CIFAR100-10T CIFAR100-20T T-ImageNet-5T T-ImageNet-10T Average
CIL Systems
OWM 99.7
























60.1 36.5
MUC 99.9
























74.8 37.3
PASS 99.5
























71.8 38.9
LwF.R 99.9 0.09 85.0






















80.3 47.6
iCaRL 99.9 0.09 96.0






















78.6 54.0
Mnemonics 99.9 0.03 96.3






















78.5 54.1
BiC 99.9 0.04 85.1






















77.3 45.8
DER++ 99.7
























80.1 55.0
A-RPS 60.8 53.5
CCG 97.3 70.1
CoL 93.4 65.6
TIL Systems
HAT 99.9 0.02 81.9






















81.8 46.6
HyperNet 99.7
























67.8 26.7
SupSup 99.6
























82.4 25.8
CLOM(-c) 99.9 0.00 94.4


98.7 0.06 87.8


92.0 0.37 63.3


94.3 0.06 54.6


68.4 0.16 45.7


72.4 0.21 47.1


87.6 65.5
CLOM 99.9 0.00 96.9


98.7 0.06 88.0 0.48 92.0 0.37 65.2 0.71 94.3 0.06 58.0 0.45 68.4 0.16 51.7 0.37 72.4 0.21 47.6 0.32 87.6 67.9
Table 1: Average accuracy over all classes after the last task is learned. -xT: x number of tasks. : In their original paper, PASS and Mnemonics use the first half of classes to pre-train before CL. Their results are 50.1% and 53.5% on CIFAR100-10T respectively, but they are still lower than CLOM without pre-training. In our experiments, no pre-training is used for fairness. : iCaRL and Mnemonics give both the final average accuracy as here and the average incremental accuracy in the original papers. We report the average incremental accuracy and network size in Appendix A and B, respectively. The last two columns show the average TIL and CIL accuracy of each method over all datasets.

Results and Comparative Analysis

As in existing works, we evaluate each method by two metrics: average classification accuracy on all classes after training the last task, and average forgetting rate Liu et al. (2020a), , where is the ’th task’s accuracy of the network right after the ’th task is learned and is the accuracy of the network on the ’th task data after learning the last task . We report the forgetting rate after the final task . Our results are averages of 5 random runs.

We present the main experiment results in Tab. 1. The last two columns give the average TIL/CIL results of each system/row. For A-RPS, CCG, and CoL, we copy the results from their original papers as their codes are not released to the public or the public code cannot run on our system. The rows are grouped by CIL and TIL methods.

CIL Results Comparison. Tab. 1 shows that CLOM and CLOM(-c) achieve much higher CIL accuracy except for MNIST for which CLOM is slightly weaker than CCG by 0.4%, but CLOM’s result on CIFAR10-5T is about 18% greater than CCG. For other datasets, CLOM improves by similar margins. This is in contrast to the baseline TIL systems that are incompetent at the CIL setting when classes are predicted using Eq. 2. Even without calibration, CLOM(-c) already outperforms all the baselines by large margins.

TIL Results Comparison. The gains by CLOM and CLOM(-c) over the baselines are also great in the TIL setting. CLOM and CLOM(-c) are the same as the output calibration does not affect TIL performance. For the two large datasets CIFAR100 and T-ImageNet, CLOM gains by large margins. This is due to contrastive learning and the OOD model. The replay based CIL methods (LwF.R, iCaRL, Mnemonics, BiC, and DER++) perform reasonably in the TIL setting, but our CLOM and CLOM(-c) are much better due to task masks which can protect previous models better with little CF.

Figure 2: Average forgetting rate (%) in the TIL setting as CLOM is a TIL system. The lower the value, the better the method is. CIL/TIL systems are shaded in blue/red, respectively (best viewed in color). A negative value indicates the task accuracy has increased from the initial accuracy.

Comparison of Forgetting Rate. Fig. 2 shows the average forgetting rate of each method in the TIL setting. The CIL systems suffer from more forgetting as they are not designed for the TIL setting, which results in lower TIL accuracy (Tab. 1). The TIL systems are highly effective at preserving previous within-task knowledge. This results in higher TIL accuracy on large dataset such as T-ImageNet, but they collapse when task-id is not provided (the CIL setting) as shown in Tab. 1. CLOM is robust to forgetting as a TIL system and it also functions well without task-id.

We report only the forgetting rate in the TIL setting because our CLOM is essentially a TIL method and not a CIL system by design. The degrading CIL accuracy of CLOM is mainly because the OOD model for each task is not perfect.

Ablation Studies

SupSup 78.9 26.2 95.3 26.2 76.7 34.3 85.2 33.1
SupSup (OOD in CLOM) 88.9 82.3 97.2 81.5 84.9 63.7 90.0 62.1
CLOM (ODIN) 82.9 63.3 96.7 62.9 77.9 43.0 84.0 41.3
CLOM 92.2 88.5 98.7 88.0 85.0 66.8 92.0 65.2
CLOM (w/o OOD) 90.3 83.9 98.1 83.3 82.6 59.5 89.8 57.5
Table 2: TIL and CIL results improve with better OOD detection. Column AUC gives the average AUC score for the OOD detection method as used within each system on the left. Column TaskDR gives task detection rate. TIL and CIL results are average accuracy values. SupSup and CLOM variants are calibrated with 20 samples per class.

Better OOD for better continual learning. We show that (1) an existing CL model can be improved by a good OOD model and (2) CLOM’s results will deteriorate if a weaker OOD model is applied. To isolate effect of OOD detection on changes in CIL performance, we further define task detection and task detection rate. For a test sample from a class of task , if it is predicted to a class of task and , the task detection is correct for this test instance. The task detection rate , where is the total number of test instances in , is the rate of correct task detection. We measure the performance of OOD detection using AUC (Area Under the ROC Curve) averaged over all tasks. AUC is the main measure used in OOD detection papers. We conduct experiments on CIFAR10-5T and CIFAR100-10T.

For (1), we use the TIL baseline SupSup as it displays a strong TIL performance and is robust to forgetting like CLOM. We replace SupSup’s task learner with the OOD model in CLOM. Tab. 2 shows that the OOD method in CLOM improves SupSup (SupSup (OOD in CLOM)) greatly. It shows that our approach is applicable to different TIL systems.

For (2), we replace CLOM’s OOD method with a weaker OOD method ODIN Liang et al. (2018). We see in Tab. 2 that task detection rate, TIL, and CIL results all drop markedly with ODIN (CLOM (ODIN)).

CLOM without OOD detection. In this case, CLOM uses contrastive learning and data augmentation, but does not use the rotation classes in classification. Note that the rotation classes are basically regarded as OOD data in training and for OOD detection in testing. CLOM (w/o OOD) in Tab. 2 represents this CLOM variant. We see that CLOM (w/o OOD) is much weaker than the full CLOM. This indicates that the improved results of CLOM over baselines are not only due to contrastive learning and data augmentation but also significantly due to OOD detection.

(a) (b)
0 63.3




5 64.9




10 65.0




15 65.1




20 65.2




1 48.6 58.8 10.0
100 13.3 82.7 67.7
300 8.2 83.3 72.0
500 0.2 91.8 87.2
700 0.1 92.2 88.0
Table 3: (Left) shows changes of accuracy with the number of samples saved per class for output calibration. (a) and (b) are CIFAR100-10T and CIFAR100-20T, respectively. indicates samples are saved per class. (Right) shows that weaker forgetting mechanism results in larger forgetting and lower AUC, thus lower CIL. For , the pseudo-step function becomes the standard sigmoid, thus parameters are hardly protected. is the forgetting rate over 5 tasks.

Effect of the number of saved samples for calibration. Tab. 3 (left) reveals that the output calibration is still effective with a small number of saved samples per class (

). For both CIFAR100-10T and CIFAR100-20T, CLOM achieves competitive performance by using only 5 samples per class. The accuracy improves and the variance decreases with the number of saved samples.

Effect of in Eq. 12 on forgetting of CLOM. We need to use a strong forgetting mechanism for CLOM to be functional. Using CIFAR10-5T, we show how CLOM performs with different values of in hard attention or masking. The larger value, the stronger protection is used. Tab. 3 (right) shows that average AUC and CIL decrease as the forgetting rate increases. This also supports the result in Tab. 2 that SupSup improves greatly with the OOD method in CLOM as it is also robust to forgetting. PASS and CoL underperform despite they also use rotation or constrastive loss as their forgetting mechanisms are weak.

Hflip (93.1, 95.3) (49.1, 72.7) (77.6, 84.0) (31.1, 47.0)
Color (91.7, 94.6) (50.9, 70.2) (67.2, 77.4) (28.7, 41.8)
Crop (96.1, 97.3) (58.4, 79.4) (84.1, 89.3) (41.6, 60.3)
All (97.6, 98.7) (74.0, 88.0) (88.1, 92.0) (50.2, 65.2)
Table 4: Accuracy of CLOM variants when a single or all augmentations are applied. Hflip: horizontal flip; Color: color jitter and grayscale; Crop: Inception Szegedy et al. (2015). (num1, num2): accuracy without and with rotation.

Effect of data augmentations. For data augmentation, we use three initial augmentations (i.e., horizontal flip, color change (color jitter and grayscale), Inception crop Szegedy et al. (2015)), which are commonly used in contrastive learning to build a single model. We additionally use rotation for OOD data in training. To evaluate the contribution of each augmentation when task models are trained sequentially, we train CLOM using one augmentation. We do not report their effects on forgetting as we experience rarely any forgetting (Fig. 2 and Tab. 3). Tab. 4 shows that the performance is lower when only a single augmentation is applied. When all augmentations are applied, the TIL/CIL accuracies are higher. The rotation always improves the result when it is combined with other augmentations. More importantly, when we use crop and rotation, we achieve higher CIL accuracy (79.4/60.3% for CIFAR10-5T/CIFAR100-10T) than we use all augmentations without rotation (74.0/50.2%). This shows the efficacy of rotation in our system.

5 Conclusions

This paper proposed a novel continual learning (CL) method called CLOM based on OOD detection and task masking that can perform both task incremental learning (TIL) and class incremental learning (CIL). Regardless whether it is used for TIL or CIL in testing, the training process is the same, which is an advantage over existing CL systems as they focus on either CIL or TIL and have limitations on the other problem. Experimental results showed that CLOM outperforms both state-of-the-art TIL and CIL methods by very large margins. In our future work, we will study ways to improve efficiency and also accuracy.


Gyuhak Kim, Sepideh Esmaeilpour and Bing Liu were supported in part by two National Science Foundation (NSF) grants (IIS-1910424 and IIS-1838770), a DARPA Contract HR001120C0023, a KDDI research contract, and a Northrop Grumman research gift.


  • D. Abati, J. Tomczak, T. Blankevoort, S. Calderara, R. Cucchiara, and E. Bejnordi (2020) Conditional channel gated networks for task-aware continual learning. In CVPR, pp. 3931–3940. Cited by: §3, §4.
  • H. Ahn, S. Cha, D. Lee, and T. Moon (2019) Uncertainty-based continual learning with adaptive regularization. In NeurIPS, Cited by: §2.
  • R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In CVPR, Cited by: §2, §3, footnote 7.
  • S. Bulusu, B. Kailkhura, B. Li, P. K. Varshney, and D. Song (2020)

    Anomalous instance detection in deep learning: a survey

    arXiv preprint arXiv:2003.06979. Cited by: §1, §2.
  • P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. CALDERARA (2020) Dark experience for general continual learning: a strong, simple baseline. In NeurIPS, Cited by: §2, §4.
  • R. Camoriano, G. Pasquale, C. Ciliberto, L. Natale, L. Rosasco, and G. Metta (2017) Incremental robot learning of new objects with fixed update time. In ICRA, Cited by: §2.
  • F. M. Castro, M. J. Marín-Jiménez, N. Guil, C. Schmid, and K. Alahari (2018) End-to-end incremental learning. In ECCV, pp. 233–248. Cited by: §2.
  • H. Cha, J. Lee, and J. Shin (2021) Co2L: contrastive continual learning. In ICCV, Cited by: §2, §4.
  • A. Chaudhry, N. Khan, P. K. Dokania, and P. H. S. Torr (2020) Continual learning in low-rank orthogonal subspaces. External Links: 2010.11635 Cited by: §2.
  • A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019) Efficient lifelong learning with a-gem. In ICLR, Cited by: §2, §2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, Cited by: Appendix C, Appendix D, §1, §2, §3, §4, §4, §4.
  • C. de Masson d’Autume, S. Ruder, L. Kong, and D. Yogatama (2019) Episodic memory in lifelong language learning. In NeurIPS, pp. . Cited by: §2.
  • P. Dhar, R. V. Singh, K. Peng, Z. Wu, and R. Chellappa (2019) Learning without memorizing. In CVPR, Cited by: §1, §2.
  • C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra (2017) Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: §2, §2.
  • C. Geng, S. Huang, and S. Chen (2020) Recent advances in open set recognition: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • A. Gepperth and C. Karaoguz (2016) A bio-inspired incremental learning architecture for applied perceptual problems. Cognitive Computation 8 (5), pp. 924–934. Cited by: §2.
  • I. Golan and R. El-Yaniv (2018)

    Deep anomaly detection using geometric transformations

    In NeurIPS, Cited by: §1.
  • T. L. Hayes, K. Kafle, R. Shrestha, M. Acharya, and C. Kanan (2019) REMIND your neural network to prevent catastrophic forgetting. arXiv preprint arXiv:1910.02509. Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §1, §2, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: Appendix B, Figure 1, §4.
  • D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, pp. 15663–15674. Cited by: §1, §3.
  • S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In CVPR, pp. 831–839. Cited by: §2.
  • W. Hu, Z. Lin, B. Liu, C. Tao, Z. Tao, J. Ma, D. Zhao, and R. Yan (2019) Overcoming catastrophic forgetting for continual learning via model adaptation. In ICLR, Cited by: §2.
  • C. Hung, C. Tu, C. Wu, C. Chen, Y. Chan, and C. Chen (2019) Compacting, picking and growing for unforgetting continual learning. In NeurIPS, Vol. 32. Cited by: §2, §2.
  • H. Jung, J. Ju, M. Jung, and J. Kim (2016) Less-forgetting learning in deep neural networks. arXiv preprint arXiv:1607.00122. Cited by: §2.
  • N. Kamra, U. Gupta, and Y. Liu (2017) Deep Generative Dual Memory Network for Continual Learning. arXiv preprint arXiv:1710.10368. Cited by: §2.
  • Z. Ke, B. Liu, and X. Huang (2020) Continual learning of a mixed sequence of similar and dissimilar tasks. In NeurIPS, Cited by: §2, §2.
  • R. Kemker and C. Kanan (2018) FearNet: Brain-Inspired Model for Incremental Learning. In ICLR, Cited by: §2.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. arXiv preprint arXiv:2004.11362. Cited by: §3.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical Report TR-2009, University of Toronto, Toronto.. Cited by: §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In NIPS, Cited by: Appendix B, Figure 1, §4.
  • Y. Le and X. Yang (2015) Tiny imagenet visual recognition challenge. Cited by: Appendix C, §4.
  • K. Lee, K. Lee, J. Shin, and H. Lee (2019) Overcoming catastrophic forgetting with unlabeled data in the wild. In CVPR, Cited by: §2.
  • Z. Li and D. Hoiem (2016) Learning Without Forgetting. In ECCV, pp. 614–629. Cited by: §2, §4.
  • S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, External Links: Link Cited by: §4.
  • Y. Liu, B. Schiele, and Q. Sun (2021) Adaptive aggregation networks for class-incremental learning. In CVPR, Cited by: §2.
  • Y. Liu, Y. Su, A. Liu, B. Schiele, and Q. Sun (2020a) Mnemonics training: multi-class incremental learning without forgetting. In CVPR, Cited by: §4, §4, §4.
  • Y. Liu, S. Parisot, G. Slabaugh, X. Jia, A. Leonardis, and T. Tuytelaars (2020b) More classifiers, less forgetting: a generic multi-classifier paradigm for incremental learning. In ECCV, pp. 699–716. External Links: Document, Link Cited by: §2, §4, §4, §4.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient Episodic Memory for Continual Learning. In NeurIPS, pp. 6470–6479. Cited by: §2, §2.
  • I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: §4.
  • A. Mallya and S. Lazebnik (2017) PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. arXiv preprint arXiv:1711.05769. Cited by: §2, §2.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • F. Mi, L. Kong, T. Lin, K. Yu, and B. Faltings (2020) Generalized class incremental learning. In CVPR, Cited by: §2.
  • A. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. Cited by: §1.
  • O. Ostapenko, M. Puscas, T. Klein, P. Jahnichen, and M. Nabi (2019) Learning to remember: a synaptic plasticity driven framework for continual learning. In CVPR, pp. 11321–11329. Cited by: §2.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: A review. Neural Networks. Cited by: §1.
  • J. Rajasegaran, M. Hayat, S. Khan, F. S. Khan, L. Shao, and M. Yang (2020a) An adaptive random path selection approach for incremental learning. External Links: 1906.01120 Cited by: §2, §4, §4.
  • J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah (2020b) ITAML: an incremental task-agnostic meta-learning approach. In CVPR, Cited by: §3, footnote 7.
  • A. Rannen Ep Triki, R. Aljundi, M. Blaschko, and T. Tuytelaars (2017) Encoder based lifelong learning. In ICCV, Cited by: §2.
  • S. Rebuffi, A. Kolesnikov, and C. H. Lampert (2017) iCaRL: Incremental classifier and representation learning. In CVPR, pp. 5533–5542. Cited by: §2, §4, §4, §4.
  • H. Ritter, A. Botev, and D. Barber (2018) Online structured laplace approximations for overcoming catastrophic forgetting. In NeurIPS, Cited by: §2.
  • D. Rolnick, A. Ahuja, J. Schwarz, T. P. Lillicrap, and G. Wayne (2019) Experience replay for continual learning. In NeurIPS, Cited by: §2.
  • M. Rostami, S. Kolouri, and P. K. Pilly (2019) Complementary learning for overcoming catastrophic forgetting using experience replay. In IJCAI, Cited by: §2.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §2.
  • J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. arXiv preprint arXiv:1805.06370. Cited by: §2.
  • A. Seff, A. Beatson, D. Suo, and H. Liu (2017) Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395. Cited by: §2.
  • J. Serrà, D. Surís, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. In ICML, Cited by: §1, §2, §2, §4.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In NIPS, pp. 2994–3003. Cited by: §2.
  • P. Singh, V. K. Verma, P. Mazumder, L. Carin, and P. Rai (2020) Calibrating cnns for lifelong learning. NeurIPS. Cited by: §2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, External Links: Link Cited by: Appendix C, §4, §4, Table 4.
  • J. Tack, S. Mo, J. Jeong, and J. Shin (2020)

    Csi: novelty detection via contrastive learning on distributionally shifted instances

    In NeurIPS, Cited by: Appendix C, Appendix D, §1, §2, §3, §3, §4, §4.
  • G. M. van de Ven and A. S. Tolias (2019) Three scenarios for continual learning. arXiv preprint arXiv:1904.07734. Cited by: §1.
  • J. von Oswald, C. Henning, J. Sacramento, and B. F. Grewe (2020) Continual learning with hypernetworks. ICLR. Cited by: §2, §4.
  • M. Wortsman, V. Ramanujan, R. Liu, A. Kembhavi, M. Rastegari, J. Yosinski, and A. Farhadi (2020) Supermasks in superposition. In NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 15173–15184. External Links: Link Cited by: §2, §2, §4.
  • C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. (2018) Memory replay gans: learning to generate new categories without forgetting. In NeurIPS, Cited by: §2.
  • Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu (2019) Large scale incremental learning. In CVPR, Cited by: §2, §4, §4.
  • J. Xu and Z. Zhu (2018) Reinforced continual learning. In NeurIPS, Cited by: §2.
  • J. Yoon, S. Kim, E. Yang, and S. J. Hwang (2020) Scalable and order-robust continual learning with additive parameter decomposition. In ICLR, External Links: Link Cited by: §2.
  • Y. You, I. Gitman, and B. Ginsburg (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: §4.
  • G. Zeng, Y. Chen, B. Cui, and S. Yu (2019) Continuous learning of context-dependent processing in neural networks. Nature Machine Intelligence. Cited by: §4.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In ICML, pp. 3987–3995. Cited by: §2.
  • B. Zhao, X. Xiao, G. Gan, B. Zhang, and S. Xia (2020) Maintaining discrimination and fairness in class incremental learning. In CVPR, pp. 13208–13217. Cited by: §2.
  • F. Zhu, X. Zhang, C. Wang, F. Yin, and C. Liu (2021) Prototype augmentation and self-supervision for incremental learning. In CVPR, Cited by: §2, §4.

Appendix A Average Incremental Accuracy

In the paper, we reported the accuracy after all tasks have been learned. Here we give the average incremental accuracy. Let be the average accuracy over all tasks seen so far right after the task is learned. The average incremental accuracy is defined as , where is the last task. It measures the performance of a method throughout the learning process. Tab. 5 shows the average incremental accuracy for the TIL and CIL settings. Figures 3 and 4 plot the TIL and CIL accuracy at each task for every dataset, respectively. We can clearly see that our proposed method CLOM and CLOM(-c) outperform all others except for MNIST-5T, for which a few systems have the same results.

Method MNIST-5T CIFAR10-5T CIFAR100-10T CIFAR100-20T T-ImageNet-5T T-ImageNet-10T
CIL Systems
OWM 99.9 98.5 87.5 67.9 62.9 41.9 66.8 37.3 26.4 18.7 31.3 17.6
MUC 99.9 87.2 95.2 67.7 80.3 50.5 77.8 32.7 61.1 48.1 56.5 34.8
PASS 99.9 92.0 88.0 63.6 77.3 52.9 78.4 38.0 55.1 39.9 52.2 30.1
LwF.R 99.9 92.8 96.6 70.7 87.6 65.4 90.9 62.8 61.7 48.0 61.3 40.9
iCaRL 99.9 98.0 96.4 74.7 86.9 68.4 88.9 64.5 60.9 50.7 60.0 44.1
Mnemonics 99.9 98.3 96.4 75.2 86.4 67.7 88.9 64.5 61.0 50.7 60.4 44.5
BiC 99.9 95.3 93.9 74.9 88.9 68.7 91.5 61.9 52.7 36.7 57.4 35.5
DER++ 99.9 98.3 94.4 79.3 86.0 67.6 85.7 56.6 62.6 49.7 66.2 46.8
TIL Systems
HAT 99.9 90.8 96.7 73.0 84.3 55.6 85.5 41.8 61.4 48.0 63.1 40.2
HyperNet 99.8 71.5 95.0 63.5 77.0 44.4 82.1 33.8 23.5 13.8 28.8 12.2
SupSup 99.7 45.3 95.5 50.2 86.1 49.5 89.1 28.8 60.6 45.3 63.4 37.3
CLOM(-c) 99.9 97.0 98.7 91.9 92.3 75.4 94.3 70.1 68.5 57.0 72.0 56.1
CLOM 99.9 98.3 98.7 91.9 92.3 75.9 94.3 71.0 68.5 58.6 72.0 56.5
Table 5: Average incremental accuracy. Numbers in bold are the best results in each column.
Figure 3: TIL performance over number of classes. The dashed lines indicate the methods that do not save any samples from previous tasks. The calibrated version, CLOM, is omitted as its TIL accuracy is the same as CLOM(-c). Best viewed in color.
Figure 4: CIL performance over number of classes. The dashed lines indicate the methods that do not save any samples from previous tasks. Best viewed in color.

Appendix B Network Parameter Sizes

We use AlexNet-like architecture Krizhevsky et al. (2012) for MNIST and ResNet-18 He et al. (2016) for CIFAR10. For CIFAR100 and Tiny-ImageNet, we use the same ResNet-18 structure used for CIFAR10, but we double the number of channels of each convolution in order to learn more tasks.

We use the same backbone architecture for CLOM and baselines, except for OWM and HyperNet, where we use the same architecture as in their original papers. OWM uses an Alexnet-like structure for all datasets. OWN has difficulty to work with ResNet-18 because it is not obvious how to deal with batch normalization in OWM. HyperNet uses a fully-connected network for MNIST and ResNet-32 for other datasets. We found it very hard to change HyperNet because the network initialization requires some arguments which were not explained in the paper. In Tab. 6, we report the network parameter sizes after the final task in each experiment has been trained.

Due to hard attention embeddings and task specific heads, CLOM requires task specific parameters for each task. For MNIST, CIFAR10, CIFAR100-10T, CIFAR100-20T, Tiny-ImageNet-5T, and Tiny-ImageNet-10T, we add task specific parameters of size 7.7K, 17.6K, 68.0K, 47.5K, 191.0K, and 109.0K, respectively, after each task. The contrastive learning also introduces task specific parameters from the projection function . However, this can be discarded during deployment as it is not necessary for inference or testing.

Method MNIST-5T CIFAR10-5T CIFAR100-10T CIFAR100-20T T-ImageNet-5T T-ImageNet-10T
OWM 5.27 5.27 5.36 5.36 5.46 5.46
MUC-LwF 1.06 11.19 45.06 45.06 45.47 45.47
PASS 1.03 11.17 44.76 44.76 44.86 44.86
LwF.R 1.03 11.17 44.76 44.76 44.86 44.86
iCaRL 1.03 11.17 44.76 44.76 44.86 44.86
Mnemonics 1.03 11.17 44.76 44.76 44.86 44.86
BiC 1.03 11.17 44.76 44.76 44.86 44.86
DER++ 1.03 11.17 44.76 44.76 44.86 44.86
HAT 1.04 11.23 45.01 45.28 44.97 45.11
HyperNet 0.48 0.47 0.47 0.47 0.48 0.48
SupSup 0.58 11.16 44.64 44.64 44.67 44.65
CLOM 1.07 11.25 45.31 45.58 45.59 45.72
Table 6: Number of network parameters (million) after the final task has been learned.

Appendix C Details about Augmentations

We follow Chen et al. (2020); Tack et al. (2020) for the choice of data augmentations. We first apply horizontal flip, color change (color jitter and grayscale), and Inception crop Szegedy et al. (2015), and then four rotations (, , , and ). The details about each augmentation are the following.

Horizontal flip

: we flip an image horizontally with 50% of probability;

color jitter: we add a noise to an image to change the brightness, contrast, and saturation of the image with 80% of probability; grayscale: we change an image to grayscale with 20% of probability; Inception crop: we uniformly choose a resize factor from 0.08 to 1.0 for each image, and crop an area of the image and resize it to the original image size; rotation: we rotate an image by , , , and . In Fig. 5, we give an example of each augmentation by using an image from Tiny-ImageNet Le and Yang (2015).

(a) Original
(b) Hflip
(c) Color jitter
(d) Grayscale
(e) Crop
(f) Rotation ()
Figure 5: An original image and its view after each augmentation. Hflip and Crop refer to horizontal flip and Inception crop, respectively.

Appendix D Hyper-parameters

Here we report the hyper-parameters that we could not include in the main paper due to space limitations. We use the values chosen by Chen et al. (2020); Tack et al. (2020) to save time for hyper-parameter search. We first train the feature extractor and projection function for epochs, fine-tune the classifier for epochs. The for the pseudo step function in Eq. 7 of the main paper is set to 700. The temperature in contrastive loss is and the resize factor for Inception crop ranges from 0.08 to 1.0.

For other hyper-parameters in CLOM, we use 10% of training data as the validation data and select the set of hyper-parameters that gives the highest CIL accuracy on the validation set. We train the output calibration parameters for iterations with learning rate 0.01 and batch size 32. The following are experiment specific hyper-parameters found with hyper-parameter search.

  • For MNIST-5T, batch size = 256, the hard attention regularization hyper-parameters are , and .

  • For CIFAR10-5T, batch size = 128, the hard attention regularization hyper-parameters are , and .

  • For CIFAR100-10T, batch size = 128, the hard attention regularization hyper-parameters are , and .

  • For CIFAR100-20T, batch size = 128, the hard attention regularization hyper-parameters are , and .

  • For Tiny-ImageNet-5T, batch size = 128, the hard attention regularization hyper-parameters are .

  • For Tiny-ImageNet-10T, batch size = 128, the hard attention regularization hyper-parameters are , and .

We do not search hyper-parameter for each task . However, we found that larger than , , results in better accuracy. This is because the hard attention regularizer gives lower penalty in the earlier tasks than later tasks by definition. We encourage greater sparsity in task 1 by larger for similar penalty values across tasks.

For the baselines, we use the best hyper-parameters reported in their original papers or in their code. If some hyper-parameters are unknown, e.g., the baseline did not use a particular dataset, we search for the hyper-parameters as we do for CLOM.

We obtain the results by running the following codes