Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference

Despite rapid advances in continual learning, a large body of research is devoted to improving performance in the existing setups. While a handful of work do propose new continual learning setups, they still lack practicality in certain aspects. For better practicality, we first propose a novel continual learning setup that is online, task-free, class-incremental, of blurry task boundaries and subject to inference queries at any moment. We additionally propose a new metric to better measure the performance of the continual learning methods subject to inference queries at any moment. To address the challenging setup and evaluation protocol, we propose an effective method that employs a new memory management scheme and novel learning techniques. Our empirical validation demonstrates that the proposed method outperforms prior arts by large margins.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/22/2018

Unicorn: Continual Learning with a Universal, Off-policy Agent

Some real-world domains are best characterized as a single task, but for...
07/08/2019

Fine-Grained Continual Learning

Robotic vision is a field where continual learning can play a significan...
03/17/2022

Continual Learning Based on OOD Detection and Task Masking

Existing continual learning techniques focus on either task incremental ...
12/10/2018

Task-Free Continual Learning

Methods proposed in the literature towards continual deep learning typic...
08/05/2022

Task-agnostic Continual Hippocampus Segmentation for Smooth Population Shifts

Most continual learning methods are validated in settings where task bou...
10/21/2020

What is Wrong with Continual Learning in Medical Image Segmentation?

Continual learning protocols are attracting increasing attention from th...
06/06/2020

Coresets via Bilevel Optimization for Continual Learning and Streaming

Coresets are small data summaries that are sufficient for model training...

1 Introduction

Continual learning (CL) is a learning scenario where a model learns from a continuous and online stream of data and is regarded as a more realistic and practical learning setup than offline learning on a fixed dataset (he20incremental). However, many CL methods still focus on the offline setup (ewc; icarl; saha2021gradient) instead of the more realistic online setup. These methods assume access to a large storage, storing the entire data of the current task and iterating on it multiple times. On the other hand, we are interested extensively in the more realistic online setup where only a small memory is allowed as storage. Meanwhile, even for the online CL methods, we argue they have room for more practical and realistic improvements concerning multiple crucial aspects. The aspects include the class distributions such as the disjoint (icarl) or the blurry (aljundi2019gradient)

splits and the evaluation metric that focuses only on the task accuracy such as average task accuracy (

).

The two main assumptions on the class distributions in existing CL setups, i.e., the disjoint and blurry splits, are less realistic for the following reasons. The disjoint split assumes no classes overlap over different tasks; already observed classes will never appear again. This assumption is not plausible because already observed classes can still appear later on in real-world scenarios (see Fig. 2 of (bang2021rainbow)). On the other hand, in the blurry split (aljundi2019gradient) no new classes appear after the first task even though the split assumes overlapping classes over tasks. This is also not plausible as observing new classes is common in real-world scenarios.

The typical evaluation metric such as in which the accuracy is measured only at the task transition is also less realistic. It implicitly assumes that no inference queries occur in the middle of a task. However, in real-world scenarios, inference queries can occur at any-time. Moreover, there is no explicit task transition boundary in most real-world scenarios. Thus, it is desirable for CL models to provide good inference results at any time. To accurately evaluate whether a CL model is effective at such ‘any-time’ inference, we need a new metric for CL models.

In order to address the issues of the current CL setups, we propose a new CL setup that is more realistic and practical by considering the following criteria: First, the class distribution is comprised of the advantages from both blurry and disjoint. That is, we assume that the model continuously encounters new classes as tasks continue, i.e., class-incremental and that classes overlap across tasks, i.e., blurry task boundaries, while not suffering from the restrictions of blurry and disjoint. Second, the model is evaluated throughout training and inference such that it can be evaluated for any-time inference. We call this new continual learning setup ‘i-Blurry’.

For the i-Blurry setup, we first propose a plausible baseline that employs experience replay (ER) with reservoir sampling and a learning rate scheduling tuned for the online and task-free CL setting. While existing online CL methods are applicable to the i-Blurry setup, they perform only marginally better than our baseline or often worse.

To better handle the i-Blurry setup, we propose a novel continual learning method, which improves the baseline in three aspects. We design a new memory management scheme to discard samples using a per-sample importance score that reflects how useful a sample is for training. We then propose to draw training samples only from the memory instead of drawing them from both memory and the online stream as is done in ER. Finally, we propose a new learning rate scheduling to adaptively decide whether to increase or decrease the learning rate based on the loss trajectory, i.e. a data-driven manner. To evaluate the algorithms in the new setup, we evaluate methods by conventional metrics, and further define a new metric called ‘area under the curve of accuracy’ () which measures the model’s accuracy throughout training.

We summarize our contributions as follows:

  • [leftmargin=10pt]

  • Proposing a new CL setup called i-Blurry, which addresses a more realistic setting that is online, task-free, class-incremental, of blurry task boundaries, and subject to any-time inference.

  • Proposing a novel online and task-free CL method by a new memory management, memory usage, and learning rate scheduling strategy.

  • Outperforming existing CL models by large margins on multiple datasets and settings.

  • Proposing a new metric to better measure a CL model’s capability for the desirable any-time inference.

2 Related Work

Continual learning setups. There are many CL setups that have been proposed to reflect the real-world scenario of training a learning model from a stream of data (Prabhu2020GDumbAS). We categorize them in the following aspects for brevity.

First, we categorize them into (1) task-incremental (task-IL) and (2) class-incremental learning (class-IL), depending on whether the task-ID is given at test time. Task-IL, also called multi-head setup, assumes that task-ID is given at test time (LopezPaz2017GradientEM; Aljundi2018MemoryAS; AGEM). In contrast, in class-IL, or single-head setup, task-ID is not given at test time and has to be inferred (icarl; bic; aljundi2019online). Class-IL is more challenging than task-IL, but is also more realistic since task-ID will not likely be given in the real-world scenario (Prabhu2020GDumbAS). Most CL works assume that task ID is provided at training time, allowing CL methods to utilize the task ID to save model parameters at task boundaries (ewc; rwalk) for later use. However, this assumption is impractical (lee2019neural) since real-world data usually do not have clear task boundaries. To address this issue,, a task-free setup (aljundi2019task), where task-ID at training is not available, has been proposed. We focus extensively on the task-free setup as it is challenging and being actively investigated recently (kim20PRS; lee2019neural; aljundi2019gradient).

We now categorize CL setups into disjoint and blurry setup by how the data split is configured. In the disjoint task setup, each task consists of a set of classes disjoint from all other tasks. But the disjoint setup is less realistic as the classes in the real-world can appear at any time not only in a disjoint manner. Recently, to make the setup more realistic, a blurry task setup has been proposed and investigated (aljundi2019gradient; Prabhu2020GDumbAS; bang2021rainbow), where of the sampels are from the dominant class of the task and of the samples are from all classes, where is the blurry level (aljundi2019gradient). However, the blurry setup assumes no class is added in new tasks, i.e., not class-incremental, which makes the setup still not quite realistic.

Finally, depending on how many samples are streamed at a time, we categorize CL setups into online (ER; aljundi2019online; AGEM) and offline (bic; icarl; rwalk; castro2018eccv). In the offline setup, all data from the current task can be used an unlimited number of times. This is impractical since it requires additional memory of size equal to the current task’s data. For the online setup, there are many notions of ‘online’ that differs in each literature. Prabhu2020GDumbAS; bang2021rainbow refer online to a setup using each streamed sample only once to train a model while aljundi2019gradient; aljundi2019online refer online to a setup where only one or a few samples are streamed at a time. We follow the latter as the former allows storing the whole task’s data, which is similar to offline and more unrealistic than using the sample more than a few times.

Incorporating the pros and cons of existing setups, we propose a novel CL setup that is online, task-free, class-incremental, of blurry task boundaries, and subject to any-time inference as the most realistic setup for continual learning.

Continual learning methods.

Given neural networks would suffer from catastrophic forgetting 

(mccloskeyC89; ratcliff90), the online nature of streaming data in continual learning generally aggravates the issue. To alleviate the forgetting, there are various proposals to store the previous task information; (1) regularization, (2) replay, and (3) parameter isolation.

(1) Regularization methods (ewc; Zenke2017ContinualLT; Lee2017OvercomingCF; ebrahimi20uncertainty) store previous task information in the form of model priors and use it for regularizing the neural network currently being trained. (2) Replay methods store a subset of the samples from the previous tasks in an episodic memory (icarl; castro2018eccv; AGEM; bic) or keep a generative model that is trained to generate previous task samples (Shin2017ContinualLW; Wu2018MemoryRG; hu2018overcoming; Cong2020GANMW). The sampled or generated examplars are replayed on future tasks and used for distillation, constrained training, or joint training. (3) Parameter isolation methods augment the networks (Rusu2016ProgressiveNN; Lee2017LifelongLW; aljundi17expert) or decompose the network into subnetworks for each task (mallya17packnet; cheung19superposition; yoon2020apd).

Since (1), (2), and (3) all utilize different ways of storing information that incurs parameter storage costs, episodic memory requirement and increase in network size respectively, a fair comparison among the methods is not straighforward. We mostly compare our method with episodic memory-based methods (bic; aljundi2019online; bang2021rainbow), as they perform the best in various CL setups but also with methods that use both regularization and episodic memory (rwalk; bic).

Online continual learning. Despite being more realistic (losing18incremental; he20incremental), online CL setups have not been popular (Prabhu2020GDumbAS) due to the difficulty and subtle differences in the setups in the published literature. ER (ER) is a simple yet strong episodic memory-based online CL method. It employs reservoir sampling for memory management and jointly trains a model with half of the batch sampled from memory. Many online CL methods are based on ER (aljundi2019gradient; aljundi2019online). GSS (aljundi2019gradient)

selects samples using a score based on cosine similarity of gradients. MIR 

(aljundi2019online) retrieves maximally interfering samples from memory to use for training.

Different from ER, A-GEM (AGEM) uses the memory to enforce constraints on the loss trajectory of the stored samples. GDumb (Prabhu2020GDumbAS) only updates the memory during training phase and trains from scratch at the test time only using the memory.

Unlike these methods, recently proposed RM (bang2021rainbow)

uses an uncertainty-based memory sampling and two-stage training scheme where the model is trained for one epoch on the streamed samples and trains extensively only using the memory at the end of each task, effectively delaying most of the learning to the end of the tasks. Note that the uncertainty-based memory sampling cannot be implemented in the online CL setup and the two-stage training performs particularly worse in our i-Blurry setup. Our method outperforms all other online CL method introduced in this section while strictly adhering to the online and task-free restrictions.

3 Proposed Continual Learning Setup: i-Blurry

For a more realistic and practical CL setup, considering real-world scenarios, we strictly adhere to the online and task-free CL setup (lee2019neural; losing18incremental). Specifically, we propose a novel CL setup (named as i-Blurry), with two characteristics: 1) class distribution being class incremental and having blurry task boundaries and 2) allowing for any-time inference.

Figure 1: i-Blurry-- split. of classes are partitioned into the disjoint set and the rest into the Blurry set where denotes the blurry level (aljundi2019gradient). To form the i-Blurry--

task splits, we draw training samples from a uniform distribution from the ‘disjoint’ or the ‘Blurry

’ set (aljundi2019gradient). The ‘blurry’ classes always appear over the tasks while disjoint classes gradually appears.

i-Blurry-N-M Split. We partition the classes into groups where of the classes are for disjoint and the rest of of the classes are used for Blurry sampling (aljundi2019gradient), where is the blurry level. Once we determine the partition, we draw samples from the partitioned groups. We call the resulting sequence of tasks as i-Blurry-- split. The i-Blurry-- splits feature both class-incremental and blurry task boundaries. Note that the i-Blurry-- splits generalize previous CL setups. For instance, is the disjoint split as there are no blurry classes. is the Blurry split (aljundi2019gradient) as there are no disjoint classes. is the disjoint split as the blurry level is  (bang2021rainbow). We use multiple i-Blurry-- splits for reliable empirical validations and share the splits for reproducible research. Fig. 1 illustrates the i-Blurry-- split.


(a) Rainbow Memory (RM) (bang2021rainbow)                                       (b) CLIB

Figure 2: Comparison of with . (a) online version of RM (bang2021rainbow) (b) proposed CLIB. The two-stage method delays most of the training to the end of the task. The accuracy-to-{# of samples} plot shows that our method is more effective at any time inference than the two-stage method. The difference in for the two methods is much smaller than that of in , implying that captures the effectiveness at any-time inference better.

A New Metric: Area Under the Curve of Accuracy (). Average accuracy (i.e., where is the accuracy at the end of the ith task) is one of the widely used measures in continual learning. But only tells us how good a CL model is at the few discrete moments of task transitions ( times for most CL setups) when the model could be queried at any time. Thus, a CL method could be poor at any-time inference but the may be insufficient to deduce that conclusion due to its temporal sparsity of measurement. For example, Fig. 2 compares the online version of RM (bang2021rainbow), which conducts most of the training by iterating over the memory at the end of a task, with our method. RM is shown as it is a very recent method that performs well on but particularly poor at any-time inference. Only evaluating with might give the false sense that the difference between the two methods is not severe. However, the accuracy-to-{# of samples} curve reveals that our method shows much more consistently high accuracy during training, implying that our method is more suitable for any-time inference than RM. To alleviate the limitations of , we shorten the accuracy measuring frequency to after every samples are observed instead of at discrete task transitions. The new metric is equivalent to the area under the curve (AUC) of the accuracy-to-{# of samples} curve for CL methods when We call it area under the curve of accuracy ():

(1)

where the step size is the number of samples observed between inference queries and is the curve in the accuracy-to-{# of samples} plot. High corresponds to a CL method that consistently maintains high accuracy throughout training.

The large difference in between the two methods in Fig. 4 implies that delaying strategies like the two-stage training scheme are not effective for any-time inference, a conclusion which is harder to deduce with just .

4 Method

4.1 A Baseline for i-Blurry Setup

To address the realistic i-Blurry setup, we first establish a baseline by considering a memory management policy and a learning rate scheduling scheme for the challenging online and task-free i-Blurry setup. For the memory management policy, we use reservoir sampling (reservoir) as it is widely used in the online and task-free setups with good performance. Note that in online CL, memory management policies that use the entire task’s samples at once, such as herding selection (icarl), mnemonics (mnemonics), and rainbow memory (bang2021rainbow) are inapplicable. For the memory usage, we use the experience replay (ER) method of drawing half of the training batch from the stream and the other half from the memory, following a large number of online CL methods based on ER with good performance (mai2021online).

For the LR scheduling, other CL methods use either (1) use exponential decay (icarl; ewc; mirzadeh2020understanding) or (2) constant LR. We do not use (1) as it is hyper-parameter sensitive; the decay rate that works for CIFAR10 decayed the LR too quickly for larger datasets such as CIFAR100. If the LR is decayed too fast, the LR becomes too small to learn about new classes that are introduced in the future. Thus, we use an exponential LR schedule but reset the LR when a new class is encountered. Comparing with the constant LR, we obtain slightly better performance for EWC++ and our baseline on CIFAR10, as shown in Table 8. Thus, we denote this LR schedule as the exponential with reset and use it in our baseline.

Note that the above baseline still has room to improve. The reservoir sampling does not consider whether one sample could be more useful for training than the others. ER uses samples from the stream directly, which can skew the training of CL models to recently observed samples. While the exponential with reset does increase the LR periodically, the sudden changes may disrupt the CL model. Thus, we discuss how we can improve the baseline in the following sections. The final method with all the improvements is illustrated in Fig. 

3.

Figure 3: Overview of the proposed CLIB. We compute sample-wise importance during training to manage our memory. Note that we only draw training samples from the memory whereas ER based methods draw them from both the memory and the online stream.

4.2 Sample-wise Importance Based Memory Management

In reservoir sampling, the samples are removed from the memory at random. Arguing that considering all samples equal in the removal process would lead to training efficacy decreases, we propose a new sampling strategy that removes samples from the memory based on sample-wise importance as following Theorem 1.

Theorem 1.

Let , be a memory from the previous time step, be the newly encountered sample, be the loss and be the model parameters. Assuming that the model trained with the optimal memory, will induce maximal loss decrease on , the optimal memory is given by with

(2)
Proof.

Please see Sec. A.1 for the proof. ∎

We solve Eq. 2 by keeping track of the sample-wise importance

(3)

Intuitively, is the expected loss decrease when the associated sample is used for training. We update associated with the samples used for training after every training iteration.

For computational efficiency, we use empirical estimates instead of precise computation of the expectation. The empirical estimates are calculated by the discounted sum of the differences between the actual loss decrease and the predicted loss decrease (see Alg. 

1). The memory is updated whenever a new sample is encountered, and the full process is given in Alg. 2. The memory management strategy significantly outperforms reservoir sampling, especially with memory only training.

4.3 Memory Only Training

ER uses joint training where half of the training batch is obtained from the online stream and the other half from memory. However, we argue that using the streamed samples directly will cause the model to be easily affected by them, which skews the training to favor the recent samples more. Thus, we propose to use samples only from the memory for training, without using the streamed samples. The memory works as a distribution stabilizer for streamed samples through the memory update process (see Sec. 4.2), and samples are used for training only with the memory. We observe that the memory only training improves the performance despite its simple nature. Note that this is different from Prabhu2020GDumbAS as we train with the memory during the online stream but Prabhu2020GDumbAS does not.

4.4 Adaptive Learning Rate Scheduling

The exponential with reset scheduler resets the LR to the initial value when a new class is encountered. As the reset occurs regardless of the current LR value, it could result in a large change in LR value. We argue that such abrupt changes may harm the knowledge learned from previous samples.

Instead, we propose a new data-driven LR scheduling scheme that adaptively increases or decreases the LR based on how good the LR is for optimizing over the memory. Specifically, it decides which of the increasing or decreasing direction for the LR is better in a data-driven manner. The loss decrease is measured for a high and low value of LR and the direction in which the loss decreased more is chosen using a Student’s -test. The LR is altered in the chosen direction by a small amount. We depict this scheme in Alg. 3 in Sec. A.3.

With all the proposed components, we call our method Continual Learning for i-Blurry or (CLIB).

5 Experiments

Experimental Setup. We use the CIFAR10, CIFAR100, and TinyImageNet datasets for empirical validations. We use the i-Blurry setup with and (i-Blurry-50-10) for our experiments unless otherwise stated. All results are averaged over 3 independent runs. For metrics, we use the average accuracy () and the proposed (see Sec. 2). Additional discussion with other metrics such as the forgetting measure () can be found in Sec. A.8.

Implementation Details. For all methods, we fix the batch size and the number of updates per streamed samples observed when possible. For CIFAR10, we use a batch size of 16 and 1 updates per streamed sample. When using ER, this translates to 8 updates using the same streamed batch, since each batch contains 8 streamed samples. For CIFAR100, we use a batch size of 16 and 3 updates per streamed sample. For TinyImageNet, we use a batch size of 32 and 3 updates per streamed sample.

We use ResNet-18 as the model for CIFAR10 and ResNet-34 for CIFAR100 and TinyImageNet. For all methods, we apply AutoAugment (AutoAugment) and CutMix (yun2019cutmix) as data augmentation, following RM (bang2021rainbow). For memory size, we use 500, 2000, 4000 for CIFAR10, CIFAR100, TinyImageNet, respectively. Additional analysis on the sample memory size can be found in Sec. A.5. Adam optimizer with initial LR of 0.0003 is used. Exponential with reset LR schedule is applied for all methods except ours and GDumb, with

for CIFAR datasets and

for TinyImageNet. Ours use adaptive LR with for all datasets, GDumb (Prabhu2020GDumbAS) and RM (bang2021rainbow) follow original settings in their respective paper. All code and i-Blurry-- splits (see Supp.) will be publicly released.

Baselines. We compare our method with both online CL methods and ones that can be extended to the online setting; EWC++ (rwalk), BiC (bic), GDumb (Prabhu2020GDumbAS), A-GEM (AGEM), MIR (aljundi2019online) and RM (bang2021rainbow). indicates that the two-stage training scheme (bang2021rainbow) is used.For details of the online versions of these methods, see Sec. A.4. Note that A-GEM performs particularly worse (also observed in (Prabhu2020GDumbAS; mai2021online)) as A-GEM was designed for the task-incremental setup and our setting is task-free. We discuss the comparisons to A-GEM in Sec. A.6.

Methods CIFAR10 CIFAR100 TinyImageNet
EWC++ (ewc) 57.061.63 59.963.18 35.261.82 39.351.50 20.631.02 22.891.11
BiC (bic) 52.351.85 54.941.81 26.260.92 29.901.03 16.681.95 20.191.35
ER-MIR (aljundi2019online) 57.652.49 61.153.74 34.492.29 37.952.05 21.070.89 22.890.71
GDumb (Prabhu2020GDumbAS) 53.201.93 55.272.69 32.840.45 34.030.89 18.170.19 18.690.45
RM (bang2021rainbow) 22.771.55 60.731.03 8.620.64 32.090.99 3.670.03 11.350.12
Baseline-ER (Sec. 4.1) 56.961.97 60.223.04 34.261.66 38.991.31 21.371.01 23.651.22
CLIB (Ours) 70.321.22 74.000.72 46.731.13 49.531.18 23.780.66 24.790.21
Table 1: Comparison of various online CL methods on the i-Blurry setup for CIFAR10, CIFAR100, and TinyImageNet are shown. CLIB outperforms all other CL methods by large margins on both the and the .

(a) CIFAR10                               (b) CIFAR100                               (c) TinyImageNet

Figure 4: Accuracy-to-{number of samples} for various CL methods on CIFAR10, CIFAR100, and TinyImageNet. Our CLIB is consistent at maintaining high accuracy throughout inference while other CL methods are not as consistent.

5.1 Results on the i-Blurry Setup

In all our experiments, we denote the best result for each of the metrics in bold.

We first compare various online CL methods in the i-Blurry setup on CIFAR10, CIFAR100, and TinyImageNet in Table 1. On CIFAR10, proposed CLIB outperforms all other CL methods by large margins; at least in and in . On CIFAR100, CLIB also outperforms all other methods by large margins; at least in and in . On TinyImageNet, all methods score very low; the i-Blurry setup for TinyImageNet is challenging as it has classes instead of of CIFAR100 but has the same number of samples per class and the total training duration and task length is doubled from CIFAR10 and CIFAR100 which makes continually learning more difficult. Nonetheless, CLIB outperforms other CL methods by at least in and in . Since, CLIB uses memory only training scheme, the distribution of the training samples are stabilized through the memory (see Sec. 4.3) which is helpful in the i-Blurry setup where the model encounters samples from more varied classes.

Note that Rainbow Memory (RM) (bang2021rainbow) exhibits a very different trend than other compared methods. On TinyImageNet, it performs poorly. We conjecture that the delayed learning from the two-stage training is particularly detrimental in larger datasets with longer training duration and tasks such as TinyImageNet. On CIFAR10 and CIFAR100, RM performs reasonably well in but poorly in . This verifies that its two-stage training method delays most of the learning to the end of each task, resulting in a poor any-time inference performance measured by . Note that fails to capture this; on CIFAR10, the difference in for CLIB and RM is which is similar to other methods but the difference for is which is noticeably larger than other methods. Similar trends can be found on CIFAR100 as well.

We also show the accuracy-to-{# of samples} curve for the CL methods on CIFAR10, CIFAR100, and TinyImageNet for comprehensive analysis throughout training in Fig 4. Interestingly, RM shows a surged accuracy only at the task transitions due to the two-stage training method it uses and the accuracy is overall low. Additionally, GDumb show a decreasing trend in accuracy as tasks progress as opposed to other methods. It is because the ‘Dumb Learner’ trains from scratch at every inference query leading to accuracy degradation. In contrast, CLIB not only outperforms other methods but also shows the most consistent accuracy at all times.

Varying (Blurry) (i-Blurry) (Disjoint)
EWC++ 52.320.76 56.411.18 57.061.63 59.963.18 77.572.18 78.342.33
BiC 45.730.86 49.081.40 52.351.85 54.941.81 76.470.87 78.842.06
ER-MIR 51.791.25 56.840.34 57.652.49 61.153.74 76.561.78 78.362.25
GDumb 45.860.80 46.372.09 53.201.93 55.272.69 65.271.54 66.742.39
RM 20.812.85 50.612.21 22.771.55 60.731.03 34.064.80 64.015.33
Baseline-ER 52.550.71 56.841.38 56.961.97 60.223.04 77.652.17 78.432.29
CLIB (Ours) 68.640.66 72.990.68 70.321.22 74.000.72 78.372.09 78.001.48
Varying
EWC++ 57.061.63 59.963.18 65.902.57 69.833.07 68.440.65 73.711.31
BiC 52.351.85 54.941.81 59.312.91 63.894.06 61.551.02 67.811.07
ER-MIR 57.652.49 61.153.74 66.581.78 71.222.93 68.180.44 73.350.04
GDumb 53.201.93 55.272.69 54.731.54 54.632.39 53.860.59 52.821.24
RM 22.771.55 60.731.03 26.502.43 61.363.12 27.832.65 58.433.13
Baseline-ER 56.961.97 60.223.04 65.942.53 70.742.70 68.780.64 73.921.36
CLIB (Ours) 70.321.22 74.000.72 74.841.93 77.841.40 74.941.22 78.191.87
Table 2: Analysis on various values of (top) and (bottom) in the i-Blurry-- setup using CIFAR10 dataset. For varying , we use . For varying , we use . Note that corresponds to the blurry split and corresponds to the disjoint split. For , CLIB outperforms or performs on par with other CL methods. For , the gap between CLIB and other methods widens. For , CLIB again outperforms all comparisons by large margins. For varying , CLIB outperforms all comparisons excepting only the when .

5.2 Analysis on Disjoint Class Percentages () and Blurry Levels ().

We further investigate the effect of different and values in the i-Blurry-N-M splits with various CL methods and summarize the results for varying values of the disjoint class percentages such as in Table 2 (top). For , CLIB outperforms other methods by at least in and in . For , the performance is similar for the majority of the methods, with CLIB being the best in . For , CLIB outperforms all comparisons by at least in and in . Even though CLIB was designed with the i-Blurry setup in mind, it also outperforms other CL methods in conventional setups such as the (disjoint) or the (blurry) setups. It implies that CLIB is generally applicable to online CL setups and not restricted to just the i-Blurry setup. Meanwhile, except GDumb, all methods show similar performance on the (disjoint) setup. The results imply that the i-Blurry setup differentiates CL methods better than the more traditionally used disjoint setup.

Following bang2021rainbow, we additionally summarize the results for varying values of the blurry level such as in Table 2 (bottom). We observe that CLIB again outperforms or performs on par with other CL methods on various blurry levels.

5.3 Ablation Studies

We investigate the benefit of each proposed component by ablation studies in Table 3. We use both CIFAR10 and CIFAR100.

Methods CIFAR10 CIFAR100
CLIB 70.321.22 74.000.72 46.731.13 49.531.18
  w/o Sample Importance Mem. (Sec.4.2) 53.752.11 56.312.56 36.591.22 38.591.21
  w/o Memory-only training (Sec.4.3) 67.061.51 71.651.87 44.630.22 48.660.39
  w/o Adaptive LR scheduling (Sec.4.4) 69.701.34 73.061.35 45.010.22 48.660.39
Table 3: Ablations for proposed components of our method using CIFAR10 and CIFAR100 dataset. All proposed components improve the performance, with sample-wise importance based memory management providing the biggest gains. While adaptive LR scheduling provides small gains in CIFAR10, the gains increase in the more challenging CIFAR100.

Sample-wise Importance Based Memory Management. We replace the ‘sample-wise importance memory management’ module with the reservoir sampling. As shown in the table, the removal of our memory management strategy degrades the performance in both and on both CIFAR10 and CIFAR100. As explained in Sec. 4.2, reservoir sampling removes samples at random, hence samples are discarded without considering if some samples are more important than others. Thus, using sample-wise importance to select which sample to discard greatly contributes to performance.

Memory Only Training. We replace the memory usage strategy from our memory only training with ER. Training with ER means that samples from both the online stream and the memory are used. Without the proposed memory only training scheme, the performance degrades across the board by fair margins. As the streamed samples are being used directly without the sample memory acting as a distribution regularizer (see Sec. 4.3), the CL model is more influenced by the recently observed samples, skewing the training and resulting in worse performance.

Adaptive Learning Rate Scheduling. We change the LR scheduling from adaptive LR to exponential with reset. Although performance drop is small on CIFAR10 but gets bigger in the more challenging CIFAR100. CIFAR100 induces more training iterations which makes the ablated model suffer more from a lack of good adaptive LR scheduling.

6 Conclusion

We question the practicality of existing continual learning setups for real-world application and propose a novel CL setup named i-Blurry. It is online, task-free, class-incremental, has blurry task boundaries, and is subject to any-time inference. To address this realistic CL setup, we propose a method which uses per-sample memory management, memory only training, and adaptive LR scheduling, named Continual Learning for i-Blurry (CLIB). Additionally, we propose a new metric to better evaluate the effectiveness of any-time inference. Our proposed CLIB consistently outperforms existing CL methods in multiple datasets and setting combinations by large margins.

References

Appendix A Appendix

a.1 Proof of Theorem 1

We give the proof of Theorem 1 below. Our assumption is that when selecting memory from a set of candidates , we should select so that optimizing on maximizes the loss decrease on . In equations, optimal memory is

(4)
(5)
(6)

where is the model parameter,

is the loss function, and

is the memory size. Since we perform memory update after every streamed sample, the problem reduces to selecting one sample to remove when the memory is full. Thus, . The optimal removed sample would be

(7)

a.2 Details on Sample-wise Importance Based Memory Management

We describe the details of our sample-wise importance based memory management here. We update , the estimate of sample-wise importance for episodic memory, after every model update. The details are in Alg. 1. With the sample-wise importance scores, we update the memory everytime a new sample is encountered. The details are in Alg. 2.

1:Input model , memory , sample-wise importance , previous loss , indices used for training , update coefficient
2: Obtain memory loss
3: Obtain memory loss decrease
4: Memory loss decrease prediction using current
5:for  do
6:     Update Update for samples used for training
7:end for
8:Update
9:Output ,
Algorithm 1 Update Sample-wise Importance
1:Input model , memory , memory size , sample , per-sample criterion , previous loss
2:if  then If the memory is not full
3:     Update Append the sample to the memory
4:     
5:else If the memory is already full
6:      Find the most frequent label
7:     
8:      Find the sample with the lowest importance
9:     Update
10:     Update Replace that sample with the new sample
11:end if
12:Update
13:
14:Update Initialize the importance for the new sample
15:Output , ,
Algorithm 2 Sample-wise Importance Based Memory Update

a.3 Adaptive Learning Rate Scheduler

We describe the adaptive LR schedule from 4.4 in Alg. A.3. We fix the significance level to the commonly used . Our adaptive LR scheduling decreases or increases the LR based on its current value. Thus, the rate in which the LR can change is bounded and sudden changes in LR do not happen.

1:Input current LR , current base LR , loss before applying current LR , current loss , LR performance history and , LR step , history length , significance level
2: Obtain loss decrease
3:Update
4:if  then If LR is higher than base LR
5:     Update Append loss decrease in high LR history
6:     if  then
7:         Update
8:     end if
9:else If LR is lower than base LR
10:     Update Append loss decrease in low LR history
11:     if  then
12:         Update
13:     end if
14:end if
15:if  then If both histories are full
16:     
17: Perform one-sided Student’s -test with alternative hypothesis
18:     if  then If pvalue is significantly low
19:         Update Decrease base LR
20:         Update Reset histories
21:     else if  then If pvalue is significantly high
22:         Update Increase base LR
23:         Update
24:     end if
25:end if
26:if  then Alternately apply high and low LR (note that )
27:     Update
28:else
29:     Update
30:end if
31:Output
Algorithm 3 Adaptive Learning Rate Scheduler

a.4 Details on the Online Versions of Compared CL Methods

We implemented online versions of RM (bang2021rainbow), EWC++ (rwalk), BiC (bic), GDumb (Prabhu2020GDumbAS), A-GEM (AGEM), GSS (aljundi2019gradient) and MIR (aljundi2019online) by incorporating ER and exponential decay with reset to the methods whenever possible. There is no specified memory management strategy for EWC++, and BiC uses herding selection from iCaRL(icarl). However, herding selection is not possible in online since it requires whole task data for calculating class mean, so we attach reservoir memory to both methods instead. EWC++ does not require any other modification.

Additional modification should be applied to bias correction stage of BiC. BiC originally performs bias correction at end of each task, but since evaluation is also performed at the middle of task in our setup, we modified the method to perform bias correction whenever the model receives inference query.

In RM, their memory management strategy based on uncertainty is not applicable in an online setup, since it requires uncertainty rankings of the whole task samples. Thus, we replace their sampling strategy with balanced random sampling, while keeping their two-stage training scheme. Methods that were converted from offline to online, namely EWC++, BiC, and RM, may have suffered some performance drop due to deviation from their original methods.

a.5 Analysis on Sample Memory Size

We conduct analysis over various sample memory sizes () and summarize results in Table 4. We observe that CLIB outperforms other CL methods in both and no matter the memory size. It is interesting to note that CLIB with a memory size of outperforms other CL methods with a memory size of in the and performs on par in . Thus, CLIB is the only method using memory only training scheme but is the least sensitive to memory size. It implies that our memory management policy is the most effective, which shows the superiority of our per-sample memory management method.

Methods K=200 K=500 K=1000
EWC++ 51.291.05 53.842.34 57.061.63 59.963.18 60.261.17 65.251.77
BiC 49.651.78 51.581.11 52.351.85 54.941.81 54.781.99 58.041.94
ER-MIR 50.791.55 54.082.11 57.652.49 61.153.74 60.862.37 66.100.93
GDumb 42.542.01 43.992.28 53.201.93 55.272.69 66.551.10 69.211.29
RM 21.590.85 47.530.93 22.771.55 60.731.03 25.822.19 71.152.92
Baseline-ER 51.411.07 53.702.40 56.961.97 60.223.04 60.551.21 65.471.76
CLIB (Ours) 64.271.67 66.171.58 70.321.22 74.000.72 73.201.16 77.881.87
Table 4: Analysis on various sample memory sizes () using CIFAR10. The i-Blurry-50-10 splits are used. The results are averaged over 3 runs. CLIB outperforms all other CL methods by large margins for all the memory sizes. CLIB uses the given memory budget most effectively, showing the superiority of our per-sample memory management method.

a.6 Additional Comparisons with A-GEM

We present additional comparisons to A-GEM. Note that as A-GEM was designed for the task-incremental setting, it performs very poorly in our i-Blurry setup which is task-free. Notably, it achieves only and on CIFAR100, but other works (Prabhu2020GDumbAS; mai2021online) have also reported very poor performance for A-GEM in their studies as well.

Methods CIFAR10 CIFAR100
EWC++ 57.061.63 59.963.18 35.261.82 39.351.50
BiC 52.351.85 54.941.81 26.260.92 29.901.03
ER-MIR 57.652.49 61.153.74 34.492.29 37.952.05
A-GEM 39.292.88 44.854.70 4.620.23 6.940.51
GDumb 53.201.93 55.272.69 32.840.45 34.030.89
RM 22.771.55 60.731.03 8.620.64 32.090.99
Baseline-ER 56.961.97 60.223.04 34.261.66 38.991.31
CLIB (Ours) 70.321.22 74.000.72 46.731.13 49.531.18
Table 5: Additional comparisons to A-GEM with various online CL methods on the i-Blurry setup for CIFAR10 and CIFAR100 are shown. The i-Blurry-50-10 splits are used for all the datasets and the results are averaged over 3 runs. A-GEM performs very poorly, especially on CIFAR100 as it was designed for the task-incremental setting whereas i-Blurry setup is task-free. CLIB outperforms all other CL methods by large margins on both the and the .

a.7 Comparisons to Other Memory Management CL Methods

We present additional comparison to CL methods that use a different memory management strategy in Table 6. GSS (aljundi2019gradient) is added as an additional comparison while Baseline-ER is used to represent the reservoir sampling. CLIB outperforms both methods by large margins in both and , implying that the sample-wise importance memory management method is better than reservoir or GSS-greedy.

Methods Mem. Management CIFAR10 CIFAR100
GSS GSS-Greedy 55.513.33 59.274.36 30.091.38 35.061.43
Baseline-ER Reservoir 56.961.97 60.223.04 34.261.66 38.991.31
CLIB (Ours) Sample-wise Importance 70.321.22 74.000.72 46.731.13 49.531.18
Table 6: Comparisons to other CL methods with different memory management strategies in the i-Blurry setup for CIFAR10 and CIFAR100 are shown. The i-Blurry-50-10 splits are used for all the datasets and the results are averaged over 3 runs. CLIB outperforms all other CL methods by large margins on both the and the implying that the sample-wise importance memory management method is effective.

a.8 Additional Results with the Measure

We report the result of the forgetting measure (rwalk) here. Note that while forgetting is a useful metric for analyzing stability-plasticity of the method, lower forgetting does not necessarily mean that a CL method is better. For example, if a method do not train with the new task at all, its forgetting will be .

Also, we do not propose a new forgetting measure for anytime inference. It is because forgetting is measured with the best accuracy of each class, and best accuracy usually occur at the end of each task. Thus, measuring the best accuracy among all inferences would not be much different from best accuracy among the inferences at the end of each task.

Note that the values are roughly similar across all methods excepting BiC. BiC show a particularly lower forgetting, since they use distillation to prevent forgetting in a hard way. However, in an online setting where multiple epoch training over the whole task is not possible, it hinders the model from learning new knowledge which is why BiC’s and are generally low.

Methods CIFAR10 CIFAR100 TinyImageNet
EWC++ 57.061.63 59.963.18 18.513.69 35.261.82 39.351.50 10.792.75 20.631.02 22.891.11 14.331.31
BiC 52.351.85 54.941.81 8.925.40 26.260.92 29.901.03 -0.970.51 16.681.95 20.191.35 5.991.60
ER-MIR 57.652.49 61.153.74 20.382.81 34.492.29 37.952.05 10.882.53 21.070.89 22.890.71 15.302.15
GDumb 53.201.93 55.272.69 18.501.96 32.840.45 34.030.89 8.780.51 18.170.19 18.690.45 8.611.81
RM 22.771.55 60.731.03 5.600.85 8.620.64 32.090.99 7.220.64 3.670.03 11.350.12 -0.770.26
Baseline-ER 56.961.97 60.223.04 17.904.85 34.261.66 38.991.31 9.222.61 21.371.01 23.651.22 14.131.64
CLIB (Ours) 70.321.22 74.000.72 17.201.72 46.731.13 49.531.18 12.931.93 23.780.66 24.790.21 13.171.47
Table 7: Additional comparisons including the measure of various online CL methods on the i-Blurry setup for CIFAR10, CIFAR100, and TinyImageNet are shown. The i-Blurry-50-10 splits are used for all the datasets and the results are averaged over 3 runs. Ours outperforms all other CL methods by large margins on both the and the . The value is roughly the same for all methods except BiC. The best result for each of the metrics is shown in bold.

a.9 Performance of Exponential with Reset LR Schedule

We show brief results for the LR schedule used in our baseline in Table 8. We compare the constant LR with the exponential with reset used in our baseline. The exponential with reset is better than the constant LR, which is why we used it in our baseline.

Methods LR Schedule
Baseline-ER Constant 56.731.73 58.183.41
Exp w/ Reset 56.961.97 60.223.04
EWC++ Constant 56.541.57 57.923.26
Exp w/ Reset 57.061.63 59.963.18
Table 8: Comparison between exponential with reset schedule and constant LR on CIFAR10 are shown. It shows that our baseline LR schedule, exponential with reset, is reasonable. It shows better performance than constant LR, especially in metric.