Incremental Learning with Unlabeled Data in the Wild

03/29/2019 ∙ by Kibok Lee, et al. ∙ 12

Deep neural networks are known to suffer from catastrophic forgetting in class-incremental learning, where the performance on previous tasks drastically degrades when learning a new task. To alleviate this effect, we propose to leverage a continuous and large stream of unlabeled data in the wild. In particular, to leverage such transient external data effectively, we design a novel class-incremental learning scheme with (a) a new distillation loss, termed global distillation, (b) a learning strategy to avoid overfitting to the most recent task, and (c) a sampling strategy for the desired external data. Our experimental results on various datasets, including CIFAR and ImageNet, demonstrate the superiority of the proposed methods over prior methods, particularly when a stream of unlabeled data is accessible: we achieve up to 9.3 method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved remarkable success in many machine learning applications, e.g., classification

[8], generation [25], object detection [7]

, and reinforcement learning

[33]. However, in the real world where the number of tasks continues to grow, the entire tasks cannot be given at once; rather, it may be given as a sequence of tasks. The goal of class-incremental learning [28] is to enrich the ability of a model dealing with such a case, by aiming to perform both previous and new tasks well.111

In class-incremental learning, a set of classes is given in each task. In evaluation, it aims to classify data in any class learned so far without task boundaries.

In particular, it has gained much attention recently as DNNs tend to forget previous tasks easily when learning new tasks, which is a phenomenon called catastrophic forgetting [5, 24].

The main reason of catastrophic forgetting is the limited resources for scalability: all training data of previous tasks cannot be stored in a limited size of memory as the number of tasks increases. Prior works in class-incremental learning focused on learning in a closed environment, i.e., a model can only see the given labeled training dataset during training [2, 10, 20, 21, 28]. However, in the real world, as we live with a continuous and large stream of data, a number of unlabeled data is easily obtainable on the fly or transiently, for example, by data mining on social media [23] and web data [14]. Motivated by this, we propose to leverage such a large stream of unlabeled external data. We remark that our setup on unlabeled data is similar to self-taught learning [27]

rather than semi-supervised learning, because we do not assume any correlation between unlabeled data and classification tasks of interest.

Figure 1: We propose to leverage a large stream of unlabeled data in the wild for class-incremental learning. At each stage, a confidence-based sampling strategy is applied to build an external dataset. Specifically, some of unlabeled data are sampled based on the prediction of the model learned in the previous stage for alleviating catastrophic forgetting, and some of them are randomly sampled for confidence calibration. Under the combination of the labeled training dataset and the unlabeled external dataset, a teacher model first learns the current task, and then the new model learns both the previous and the current tasks by distilling the knowledge of and .

Contribution. Under the new class-incremental setup, our contribution is three-fold (see Figure 1 for an overview):

  • We propose a new training loss, termed global distillation, which utilizes data to distill the knowledge of previous tasks effectively.

  • We design a 3-step learning scheme to improve the effectiveness of global distillation: (i) training a teacher specialized for the current task, (ii) training a model by distilling the knowledge of the previous model and the teacher learned in (i), and (iii) fine-tuning to avoid overfitting to the current task.

  • We propose a sampling scheme with a confidence-calibrated model to effectively leverage a large stream of unlabeled data.

For the contribution , global distillation encourages the model to learn knowledge over all previous tasks, while prior works only applied a task-wise local distillation [2, 10, 21, 28]: e.g., local distillation does not distill the knowledge of how to distinguish two classes learned in different previous tasks. We note that global distillation is effective even without external data, but its performance gain is more significant if the unlabeled data is available to use.

In the contribution , the first two steps (i), (ii) of the proposed learning scheme are designed to keep the knowledge of the previous tasks, as well as to learn the current task. On the other hand, the purpose of the last step (iii) is to avoid overfitting to the current task: due to the scalability issue, only a small portion of data in the previous tasks are kept and replayed during training [2, 26, 28]. This inevitably incurs bias in the prediction of the learned model, being favorable for the current task. To mitigate the issue of imbalanced training, we fine-tune the model based on the statistics of data in the previous and the current tasks.

Finally, the contribution is motivated from the intuition that as the data distribution of unlabeled data is more similar to that of the previous tasks, it prevents the model from catastrophic forgetting more. Since the unlabeled external data might not necessarily be related to tasks of interest, it is far from being clear whether they contain an useful information for alleviating catastrophic forgetting. Therefore, we also propose to sample an external dataset by a principled sampling strategy. To sample an effective external dataset from a large stream of unlabeled data, we propose to train a confidence-calibrated model [16, 17] by utilizing irrelevant data as out-of-distribution (OOD)222Out-of-distribution refers to the data distribution being far from those of the tasks learned so far. samples. We show that unlabeled data from OOD should also be sampled for maintaining the model to be confidence-calibrated.

Our experimental results demonstrate the superiority of the proposed methods over prior methods. In particular, when unlabeled external data are available, we show that the performance gain is more significant in the proposed methods. For example, under our experiment setups on ImageNet [4], our method achieves 6.3% of relative improvement compared to the state-of-the-art method (E2EiL) [2] with an external dataset (and 4.1% of relative improvement without an external dataset). In total, we improve the accuracy of the bare E2EiL by up to 9.3% via utilizing external unlabeled data and the proposed learning scheme.

2 Approach

In this section, we propose a new learning method for class-incremental learning. In Section 2.1, we further describe the scenario and learning objectives of interest. In Section 2.2, we propose a novel learning objective, termed global distillation. In Section 2.3, we propose a confidence-based sampling strategy to build an external dataset from a large stream of unlabeled data.

2.1 Preliminaries: Class-incremental Learning

Formally, let be a data and its label in a dataset , and let be a supervised task mapping to . We denote if is in the range of such that is the number of class labels in . For the -th task , let be the corresponding training dataset, and be a coreset333A coreset is a small dataset kept in a limited amount of memory used to replay previous tasks. Initially, . containing representative data of previous tasks , such that is the labeled training dataset available at the -th stage. Let be the set of learnable parameters of a model, where and indicate shared and task-specific parameters, respectively (subscription indicates the task index).444

If task-specific parameters of multiple tasks are given, then the logits for all classes in the tasks are concatenated for the prediction without task boundaries, i.e., all classes in the tasks are candidates at test time, and our formulation does not prevent classes from appearing in multiple tasks.

In class-incremental learning, the goal at the -th stage is to train a model to perform the current task as well as the previous tasks without task boundaries, i.e., all class labels in are candidates at test time. To this end, a small coreset and the previous model are transferred from the previous stage. We also assume that a large stream of unlabeled data is accessible, and we would like to sample an external dataset denoted by . We do not assume any correlation between the stream of unlabeled data and the tasks. The outcome at the -th stage is the model that can perform all observed tasks , and the coreset for learning in subsequent stages.

Learning objectives. We minimize cross-entropy loss to train our model. Specifically, to train model parameters , we minimize the following classification loss under a dataset :

At the -th stage, a standard approach to train a model is to optimize the classification loss as follows:

(1)

We call this the global classification (GC) method. This method is sufficient for the model to perform all tasks well as long as the training dataset contains enough information for both the previous and the current tasks. However, in our setting, the limited size of the coreset makes the learned model suffer from catastrophic forgetting. To alleviate this effect, the previous model is available to learn the knowledge of the previous tasks better. To leverage the previous model, we consider optimizing the following distillation loss with a reference (or a teacher) model :

where the probabilities can be smoothed for better distillation (see

[9] or Appendix).

2.2 Global Distillation

Figure 2: An illustration of how a model learns with GD in Eq. (4) at the -th stage. With a labeled data, the model learns by minimizing the classification loss (blue). With any data, regardless of whether it is labeled or not, the model learns by minimizing the distillation losses (green and red), where the outputs of reference models and

are used as a soft-target. Colored shapes show parameters updated by gradients backpropagated from each loss.

In addition to the global classification loss in Eq. (1), we propose to train a model by optimizing a distillation loss. Specifically, it distills the predictive distribution of the previous model over the previous tasks :

(2)

However, learning with Eq. (2) would cause another bias, because the range of the knowledge learned by the classification and the distillation loss is different: while the classification loss is designed to learn to distinguish classes in all tasks, the distillation loss only distills the knowledge about the previous tasks. Hence, the knowledge within the current task would not be properly learned with Eq. (2), i.e., the performance on the current task would be unnecessarily sacrificed. To account for this, we propose to train another teacher model specialized for the current task , which can be trained with the standard cross-entropy loss:555Initially, , as there is no previous model.

(3)

Note that only the dataset of the current task is used, because is specialized for the current task only.

With the previous model and the model specialized for the current task learned with Eq. (3), we define the learning objective of the global distillation (GD) method:

(4)

Figure 2 illustrates how a model learns with GD, and the effect of can be found in Table 2. Note that GD is fundamentally different from prior distillation losses [2, 10, 21, 28]: while global distillation transfers the knowledge over previous tasks, local distillation used in prior works transfers only the knowledge within each of the tasks. We also discuss the related works in more detail in Section 3.

Balanced fine-tuning. The statistics of class labels in the training dataset is also an information learned during training. Since the number of data from the previous tasks is much smaller than that of the current task, the prediction of the model is biased to the current task. To remove the bias, we further fine-tune the model after training. When fine-tuning, for each loss with and , we scale the gradient computed from a data with label by the following:

(5)

Since scaling a gradient is equivalent to feeding the same data multiple times, we call this method data weighting. We also normalize the weights by multiplying them with , such that they are all one if is balanced.

We only fine-tune the task-specific parameters with data weighting, because all training data would be equally useful for representation learning, i.e., shared parameters , while the bias in the data distribution of the training dataset should be removed when training a classifier, i.e., . The effect of balanced fine-tuning can be found in Table 3.

1:  
2:  while true do
3:     Input: previous model , coreset ,        training dataset , unlabeled data stream
4:     Output: new coreset , model
5:     
6:     ,
7:     Sample from using Algorithm 2
8:     Train a teacher for the current task with Eq. (6)
9:     if  then
10:        Train with GD in Eq. (7)
11:        Fine-tune with GD in Eq. (7), with data weighting in Eq. (5)
12:     else
13:        
14:     end if
15:     Randomly sample such that for
16:     
17:  end while
Algorithm 1 3-step learning with GD.

3-step learning algorithm. In summary, our training strategy has three steps: training specialized for the current task , training with the two reference models and , and fine-tuning the task-specific parameters with data weighting. Algorithm 1 describes the 3-step learning scheme. Here, for coreset management, we build a balanced coreset by randomly selecting data for each class. We note that other more sophisticated selection algorithms like herding [28] do not perform significantly better than random selection, as reported in prior works [2, 36].

1:  Input: previous model ,           unlabeled data stream , sample size ,           number of unlabeled data to be retrieved
2:  Output: sampled external dataset
3:  ,
4:  
5:  
6:  while  do
7:     Get and update
8:  end while
9:  
10:  while  do
11:     Get and compute the prediction of :   ,  
12:     if  then
13:        
14:     else
15:        Replace the least probable data in class :  
16:        if  then
17:           
18:        end if
19:     end if
20:     
21:  end while
22:  Return
Algorithm 2 Sampling external dataset.

2.3 Sampling External Dataset

Although a large amount of unlabeled data would be easily obtainable, there are two issues when using them for knowledge distillation: (a) training on a large-scale external dataset is expensive, and (b) most of the data would not be helpful, because they would be irrelevant to the tasks the model learns. To overcome these issues, we propose to sample an external dataset useful for knowledge distillation from a large stream of unlabeled data. Note that the sampled external dataset does not require an additional permanent memory; it is discarded after learning.

Confidence calibration for sampling. In order to alleviate catastrophic forgetting caused by the imbalanced training dataset, sampling external data that are expected to be in previous tasks is desired. Since the previous model is expected to produce an output with high confidence if the data is likely to be in previous tasks, the output of can be used for sampling. However, modern DNNs are highly overconfident [6, 16], thus a model learned with a discriminative loss would produce a prediction with high confidence even if the data is not from any of the previous tasks. Since most of the unlabeled data would not be relevant to any of the previous tasks, i.e., they are considered to be from out-of-distribution (OOD), it is important to avoid overconfident prediction on such irrelevant data. To achieve this, the model should learn to be confidence-calibrated by learning with a certain amount of OOD data as well as data of the previous tasks. When sampling OOD data, we propose to randomly sample the data rather than relying on the confidence of the previous model, as OOD is widely distributed over the data space. The effect of this sampling strategy can be found in Table 4. Algorithm 2 describes our sampling strategy. This sampling algorithm can take a long time, but we limit the number of retrieved unlabeled data in our experiment by 1M, i.e., .

Global distillation with external data. For confidence calibration, we consider the following confidence loss to make the model produce confidence-calibrated outputs for data which are not relevant to the task of interest:

where

is a uniform distribution over the labels in

and the confidence is defined as the Kullback-Leibler (KL) divergence from the uniform distribution [16, 17].

During the 3-step learning, only the first step for training has no reference model, so it should learn with the confidence loss. For , is from OOD if . Namely, by optimizing the confidence loss under the coreset of the previous tasks and the external dataset , the model learns to produce a prediction with low confidence for OOD data, i.e., uniformly distributed probabilities over class labels. Thus, learns by optimizing the following:

(6)

Note that the model does not require an additional confidence calibration, because the outputs distilled by reference models are already confidence-calibrated: the previous model is expected to be confidence-calibrated in the previous stage, and learns with Eq. (6). Therefore, the confidence-calibrated output of the reference models are distilled to the model . The effect of confidence loss can be found in Table 2.

Finally, with the external dataset, the learning objective of GD in Eq. (4) can be rewritten as:

(7)

Note that is an unlabeled dataset, so it is not usable when optimizing the classification loss.

3 Related Work

Continual learning. Many recent works have addressed catastrophic forgetting with different assumptions. Broadly speaking, there are three different types of works [35]: one is class-incremental learning [2, 28, 36], where the number of class labels keeps growing. Another is task-incremental learning [21, 10], where the boundaries among tasks are assumed to be clear and the information about the task under test is given. The last can be seen as data-incremental learning, which is the case when the set of class labels or actions are the same for all tasks [13, 29, 30].

These works can be summarized as continual learning, and recent works on continual learning have studied two types of approaches to overcome catastrophic forgetting: model-based and data-based. Model-based approaches [1, 11, 13, 18, 22, 26, 29, 30, 31, 38] keep the knowledge of previous tasks by penalizing the change of parameters crucial for previous tasks, i.e., the updated parameters are constrained to be around the original values, and the update is scaled down by the importance of parameters on previous tasks. However, since DNNs have many local optima, there would be better local optima for both the previous and the new tasks, which cannot be found by model-based approaches.

On the other hand, data-based approaches [2, 10, 21, 28] keep the knowledge of the previous tasks by knowledge distillation [9], which minimizes the distance between the manifold of the latent space in the previous and the new models. In contrast to model-based approaches, they require to feed data to get features on the latent space. Therefore, the amount of knowledge kept by knowledge distillation depends on the degree of similarity between the data distribution used to learn the previous tasks in the previous stages and the one used to distill the knowledge in the later stages. To guarantee to have a certain amount of similar data, some prior works [2, 26, 28] reserved a small amount of memory to keep a coreset, and others [19, 32, 35, 36] trained a generative model and replay the generated data when training a new model. Note that the model-based and the data-based approaches are orthogonal in most cases, thus they can be combined for better performance [12].

Knowledge distillation in prior works. Our proposed method is a data-based approach, but it is different from prior works [2, 10, 21, 28], because their model commonly learns with the following task-wise local distillation loss:

(8)

We emphasize that local distillation only preserves the knowledge within each of the previous tasks, while global distillation preserves the knowledge over the tasks as well.

Similar to our 3-step learning, [30] and [10] utilized the idea of learning with two teachers. However, their strategy to keep the knowledge of the previous tasks is different: [30] applied a model-based approach, and [10] distilled the task-wise knowledge for task-incremental learning.

On the other hand, [2] had a similar fine-tuning, but they built a balanced dataset by discarding most of the data of the current task and updated the whole networks. However, such undersampling sacrifices the diversity of the frequent classes, which decreases the performance. Oversampling may solve the issue, but it makes the training not scalable: the size of the oversampled dataset increases proportional to the number of tasks learned so far. Instead, we propose to apply data weighting.

Scalability. Early works on continual learning were not scalable since they kept all previous models [13, 21, 29]. However, recent works considered the scalability by minimizing the amount of task-specific parameters [28, 30]. In addition, data-based methods require to keep either a coreset or a generative model to replay previous tasks. Our method is a data-based approach, but it does not suffer from the scalability issue since we utilize an external dataset sampled from a large stream of unlabeled data. We note that unlike coreset, our external dataset does not require a permanent memory: it is discarded after learning.

4 Experiments

In this section, we describe our experimental setup to show the effectiveness of our proposed methods and report the performance. The code and the datasets will be released.

4.1 Experimental Setup

Compared algorithms. To provide an upper bound of the performance, we compare an oracle method, which stores all training data of previous tasks and replays them during training. In this method, learning with GC in Eq. (1) is sufficient. As a baseline, we provide the performance of GC with a coreset without any external data. Among prior works, three state-of-the-art methods are compared: learning without forgetting (LwF) [21], distillation and retrospection (DR) [10], and end-to-end incremental learning (E2EiL) [2]. Among them, since LwF and DR are task-incremental learning methods, their classification loss is limited to the current task. Thus, for fair comparison, we extend the range of the classification loss, i.e., we use the same learning objective in Eq. (8) for them, and use the same size of coreset with others. Also, when an external dataset is available, we optimize the distillation loss in prior works under the external dataset sampled with our proposed strategy for fair comparison, in addition to the labeled training dataset. To see only the effect of global distillation loss, we also compare the local distillation version of Eq. (7):

(9)

which we call the local distillation (LD) method.

We do not compare model-based methods, because data-based methods are known to outperform them in class-incremental learning [19, 35], and they are orthogonal to data-based methods such that they can potentially be combined with our approaches for better performance [12].

Datasets. We evaluate the compared methods on CIFAR-100 [15] and ImageNet ILSVRC 2012 [4], where all images are downsampled to [3]. For CIFAR-100, similar to prior works [2, 28], we shuffle the classes uniformly at random and split the classes to build a sequence of tasks. For ImageNet, we first sample 500 images per 100 randomly chosen classes for each trial, and then split the classes. To evaluate the compared methods under the environment with a large stream of unlabeled data, we take two large datasets: the TinyImages dataset [34] with 80M images and the entire ImageNet 2011 dataset with 14M images. The classes appeared in CIFAR-100 and ILSVRC 2012 are excluded to avoid any potential advantage from them. At each stage, our sampling algorithm gets unlabeled data from them uniformly at random to form an external dataset, until the number of retrieved samples is 1M.

Following the prior works, we divide the classes into splits of 5, 10, and 20 classes, such that there are 20, 10, and 5 tasks, respectively. For each task size, we evaluate the compared methods five times with different class orders (different set of classes in the case of ImageNet) and report the mean and standard deviation of the performance.

Dataset CIFAR-100 ImageNet
Task size 5 10 20 5 10 20
Metric ACC () FGT () ACC () FGT () ACC () FGT () ACC () FGT () ACC () FGT () ACC () FGT ()
Oracle 78.3 0.4 3.3 0.3 77.3 0.4 3.3 0.3 75.5 0.4 3.0 0.3 67.7 0.6 3.6 0.6 66.5 0.8 3.1 0.7 64.7 0.9 2.9 0.7
GC 57.0 1.2 20.9 0.7 56.4 0.4 19.7 0.6 55.5 0.5 18.2 0.3 43.8 0.6 23.6 0.5 43.8 0.8 21.7 1.0 44.2 0.5 18.9 1.2
Without an external dataset
LwF [21] 58.0 1.0 19.1 0.8 59.0 0.2 17.1 0.4 59.8 0.4 14.8 0.4 45.6 0.6 21.8 0.6 46.8 0.6 18.5 0.7 48.3 0.3 15.6 0.9
DR [10] 58.8 0.9 19.6 0.6 60.3 0.5 17.1 0.6 61.5 0.4 14.6 0.2 46.2 0.4 22.4 0.8 48.2 0.6 18.9 1.0 50.1 0.4 15.7 1.0
E2EiL [2] 59.7 0.8 16.6 0.8 62.1 0.2 12.9 0.4 65.1 0.4 8.9 0.3 47.7 0.9 17.7 0.3 50.4 0.6 13.5 0.5 53.7 0.5 9.0 0.5
LD (Ours) 60.9 0.8 16.8 0.6 63.5 0.5 13.2 0.4 66.3 0.4 9.2 0.2 48.4 0.5 18.5 0.7 51.9 0.5 14.1 0.6 55.3 0.8 9.4 0.8
GD (Ours) 61.7 0.8 15.3 0.6 64.6 0.5 12.2 0.4 67.0 0.5 8.7 0.3 49.7 0.5 16.8 0.6 53.2 0.5 12.9 0.7 56.0 0.7 8.7 0.8
With an external dataset
LwF [21] 59.5 0.5 19.4 0.5 60.6 0.3 17.2 0.5 60.2 0.5 15.0 0.4 47.2 0.6 21.5 0.6 49.1 0.7 18.5 0.7 49.2 0.7 15.9 0.9
DR [10] 59.7 0.7 19.5 0.5 61.9 0.4 16.5 0.4 63.1 0.4 13.7 0.1 47.1 0.6 22.0 0.8 50.2 1.0 18.4 0.8 51.8 0.7 14.9 0.8
E2EiL [2] 61.3 0.6 16.3 0.8 63.8 0.6 12.7 0.6 66.0 0.5 9.1 0.2 49.0 0.9 17.3 0.6 52.5 0.6 13.1 0.3 55.0 0.5 9.2 0.5
LD (Ours) 61.7 0.7 16.2 0.5 64.6 0.2 12.3 0.4 66.9 0.3 8.7 0.2 49.7 0.7 17.5 0.9 53.2 0.8 13.2 0.5 56.2 0.7 8.8 0.5
GD (Ours) 64.0 0.7 13.6 0.6 66.6 0.5 10.3 0.3 68.2 0.6 7.4 0.2 52.1 0.9 14.7 0.7 55.6 0.5 10.8 0.6 57.4 0.8 7.7 0.4
Table 1: Performance of compared methods on CIFAR-100 and ImageNet. We report the mean and the standard deviation of five trials with different random seeds in %. () indicates that the higher (lower) number is the better.
(a) Average accuracy (ACC)
(b) Average forgetting (FGT)
(c) Task-wise accuracy at the last stage
Figure 3: Experimental results on CIFAR-100 with an external data when the task size is 10. We compare (a) the average incremental accuracy, (b) the average forgetting, and (c) the task-wise accuracy at the end of class-incremental learning on the test dataset. We report the mean of the accuracy of five trials and the standard deviation for (a) and (b). The results on ImageNet and other task sizes can be found in Appendix.

Evaluation metric. We report the performance of the compared methods in two metrics: the average incremental accuracy (ACC) and the average forgetting (FGT). For simplicity, we assume that the number of test data is the same over all classes. For a test data from the -th task , let be the label predicted by the -th model, such that

measures the accuracy of the -th model at the -th task, where . Note that prediction is done without task boundaries: for example, at the -th stage, the expected accuracy of random guess is , not . At the -th stage, ACC is defined as:

Note that the performance of the first stage is not considered, as it is not class-incremental learning. While ACC measures the overall performance directly, FGT measures the amount of catastrophic forgetting, by averaging the performance decay:

which is essentially the negative of the backward transfer [22]. Note that smaller FGT is better, which implies that the model less-forgets about the previous tasks.

Hyperparameters. Our model is based on wide residual networks [37] with 16 layers, a widen factor of 2, and a dropout rate of 0.3. The last fully connected layer is considered to be a task-specific layer, and whenever a task with new classes comes in, the layer is extended by adding more parameters to produce a prediction for the classes. The number of parameters in the task-specific layer is small compared to the rest of the layers (about 2% in maximum in our model), which are considered to be shared parameters. The size of the coreset is set to 2000. Due to the scalability issue, the size of the sampled external dataset is set to the size of the labeled dataset, i.e., in Algorithm 2. For more details, see Appendix.

Loss weight. We balance the contribution of each loss by the relative size of each task learned in the loss: for each loss for learning , the loss weight at the -th stage is

(10)

We note that the loss weight can be tuned as it is a hyperparameter, but we find that this loss weight performs better than other values in general, as it follows the statistics of the test dataset: all classes are equally likely to be appeared.

4.2 Evaluation

Comparison of methods. Table 1 and Figure 3 compare our proposed methods with the state-of-the-art methods. First, LD outperforms the prior methods, LwF, DR, and E2EiL, which shows the effectiveness of the proposed 3-step learning scheme. Specifically, DR does not have balanced fine-tuning, E2EiL lacks the teacher for the current task and fine-tunes the whole networks with a small dataset, and LwF has neither nor fine-tuning. On the other hand, GD outperforms the others including LD, which shows the effectiveness of global distillation.

Learning with an external dataset improves the performance consistently, but the improvement is more significant in GD. For example, in the case of ImageNet with a task size of 5, the relative performance gain by learning with an external dataset in E2EiL is 2.8% (from 47.7% to 49.0%) while it is 4.9% (from 49.7% to 52.1%) in GD. Overall, with our proposed learning scheme and the usage of external data, GD shows 9.3% (from 47.7% to 52.1%) of the relative performance improvement from E2EiL, which shows the best performance among the state-of-the-art method.

Figure 3(c) compares the task-wise performance at the end of class-incremental learning. While LwF and DR overfit to the current task, i.e., only the performance on the last task is high, E2EiL, LD, and GD show more balanced performances over tasks because of the fine-tuning. The effect of different balanced training strategies is compared in fair condition in Table 3. Further, since GD keeps the knowledge over previous tasks as well as the knowledge within each task, its performance on previous tasks is significantly better than the other local distillation-based methods.

Teacher Confidence ACC () FGT ()
62.4 0.7 15.0 0.8
cls 62.0 0.5 15.0 0.5
cls cnf 65.0 0.4 11.7 0.4
dst 65.6 0.6 11.3 0.4
dst cnf 66.6 0.5 10.3 0.3
Table 2: Comparison of models learned with different teachers of the current task when the task size is 10. “cls” and “dst” stand for learning with the objective of the teacher for the current task directly (Eq. (11)), and learning by distilling the knowledge of the teacher for the current task (Eq. (4)), respectively. The teacher of a model with “cnf” learns with the confidence loss.
Balancing ACC () FGT ()
63.0 0.4 15.6 0.5
DW 63.6 0.4 11.0 0.5
FT-DSet 65.8 0.5 11.0 0.6
FT-DW 66.6 0.5 10.3 0.3
Table 3: Comparison of different balanced training strategies when the task size is 10. “DW,” “FT-DSet,” and “FT-DW” stand for training with data weighting in Eq. (5) for the entire training, fine-tuning with a training dataset balanced by removing data of the current task, and fine-tuning with data weighting, respectively.

Effect of the teacher for the current task. As an ablation study, Table 2 compares the models with and without the teacher for the current task . In addition to the baseline without , we also replace the distillation from the output of with its learning objective in Eq. (6):

(11)

This is equivalent to the case when produces a fixed predictive distribution based on the ground truth label. Here, if the confidence loss is discarded from the learning objective ( without in Table 2), then the model learns the current task with one-hot labels, which can be considered as a highly overconfident predictive distribution. In other cases, learning with a teacher for the current task is beneficial, because it helps to learn the knowledge within the current task, as discussed in Section 2.2. We note that learning by optimizing the confidence loss improves the performance, which follows our intuition that confidence-calibrated output leads to the better performance when an unlabeled external dataset is available.

Effect of balanced fine-tuning. Table 3 shows the effect of balanced training. First, balanced learning strategies improve FGT in general. If we train the model with data weighting in Eq. (5) from scratch (DW), then the performance is not better than having balanced fine-tuning on task-specific parameters only (FT-DW), as discussed in Section 2.2. Note that data weighting (FT-DW) is better than removing the data of the current task to construct a small balanced dataset (FT-DSet) proposed in [2], because all training data are useful for finding the correct decision boundary.

Prev OOD ACC () FGT ()
64.6 0.5 12.2 0.4
Random 66.1 0.3 11.5 0.0
Pred 66.0 0.4 10.3 0.5
Pred Pred 65.1 0.3 11.2 0.4
Pred Random 66.6 0.5 10.3 0.3
Table 4: Comparison of different external data sampling strategies when the task size is 10. “Prev” and “OOD” columns describe the sampling method for data of previous tasks and out-of-distribution data, where “Pred” and “Random” stand for sampling based on the prediction of the previous model and random sampling, respectively. In particular, for when sampling OOD by “Pred,” we sample data minimizing the confidence loss . When only Prev or OOD is sampled, the number of sampled data for each purpose is doubled for fair comparison.

Effect of external data sampling. Table 4 compares different external data sampling strategies. Unlabeled data are beneficial in all cases, but the performance gain is different over sampling strategies. First, observe that randomly sampled data are useful, because their predictive distribution would be diverse such that it helps to learn the diverse knowledge of the reference models, which makes the model confidence-calibrated. However, while the performance of the random sampling strategy is comparable to the sampling based on the prediction of the previous model, it shows high FGT. This implies that the unlabeled data sampled based on the prediction of the previous model prevents the model from catastrophic forgetting more. As discussed in Section 2.3, our proposed sampling strategy, the combination of the above two strategies shows the best performance, under the condition that the same amount of unlabeled data are sampled in total. Finally, note that sampling OOD data based on the prediction of the previous model is not beneficial, because “data most likely to be from OOD” would not be useful. OOD data sampled based on the prediction of the previous model have almost uniform predictive distribution, which would be locally distributed. However, the concept of OOD is a kind of complement set of the data distribution of interest. Thus, to learn to discriminate OOD well in our case, the model should learn with data widely distributed outside of the data distribution of the previous tasks.

5 Conclusion

We propose to leverage a large stream of unlabeled data in the wild for class-incremental learning. A novel global distillation loss is proposed, which is in particular effective when unlabeled data is available. The key property of the proposed loss is that it distills the knowledge of the reference model without task boundaries, leading better knowledge distillation. Our 3-step learning scheme effectively leverages the external dataset sampled with the proposed sampling strategy from the stream of unlabeled data.

References

Appendix A Global Distillation and Local Distillation

Figure A.1: An illustration of knowledge distillation from the previous model in class-incremental learning, where Task I and II are seen when training the previous model and Task III is the new task. The proposed global distillation utilizes the knowledge over all previous tasks without task boundaries (green and blue arrows), while local distillation in prior works takes account of the knowledge within each of the tasks (blue arrows). Note that no knowledge about Task III is distilled from the previous model, because Task III did not exist when training the previous model.

As illustrated in Figure A.1, with our proposed global distillation, the model learns the knowledge over all previous tasks (green and blue arrows), while it learns only the knowledge within each of the tasks (blue arrows) with local distillation commonly used in prior works [2, 10, 21, 28]. For example, if a dog-like image is given for knowledge distillation, the model would learn how to discriminate Dog vs. {Toucan, Cat, Goose} and calibrate its confidence on Toucan vs. Cat vs. Goose under the proposed global distillation. On the other hand, under local distillation, the model would learn how to discriminate Dog vs. Toucan and calibrate its confidence on Cat vs. Goose.

Appendix B More on Experiments

b.1 Details on Experimental Setup

Hyperparameters.

We use mini-batch training with a batch size of 128 and 200 epochs for each training to ensure convergence. The initial learning rate is 0.1 and it decays by 0.1 after 120, 160, 180 epochs when there is no fine-tuning. When fine-tuning is applied, the main training is performed in 180 epochs where the learning rate decays after 120, 160, 170 epochs, and finetuning is performed in 20 epochs, where the learning rate starts at 0.01 and decays by 0.1 after 10, 15 epochs. We note that 20 epochs are enough for convergence even when fine-tuning the whole networks for some methods. We update the model parameters by stochastic gradient decent with a momentum 0.9 and an L2 weight decay of 0.0005. The size of the coreset is set to 2000. Due to the scalability issue, the size of the sampled external dataset is set to the size of the labeled dataset. For all experiments, the temperature for smoothing the probabilities for distillation is set to 2. In more detail about how to scale probabilities, let

be the set of outputs (or logits). Then, with a temperature , the probabilities are computed as following:

Scalability of methods. We note that all compared methods are scalable and they are compared in a fair condition. We do not compare generative replay methods with ours, because the coreset approach is known to outperform them in class-incremental learning in a scalable setting: in particular, it has been reported that continual learning for a generative model is a challenging problem on datasets of natural images like CIFAR-100 [19, 36].

b.2 Plots

(a) Average accuracy (ACC)
(b) Average forgetting (FGT)
(c) Task-wise accuracy at the last stage
Figure B.1: Experimental results on CIFAR-100 with an external data when the task size is 5. We compare (a) the average incremental accuracy, (b) the average forgetting, and (c) the task-wise accuracy at the end of class-incremental learning on the test dataset. We report the mean of the accuracy of five trials and the standard deviation for (a) and (b).
(a) Average accuracy (ACC)
(b) Average forgetting (FGT)
(c) Task-wise accuracy at the last stage
Figure B.2: Experimental results on CIFAR-100 with an external data when the task size is 20. We compare (a) the average incremental accuracy, (b) the average forgetting, and (c) the task-wise accuracy at the end of class-incremental learning on the test dataset. We report the mean of the accuracy of five trials and the standard deviation for (a) and (b).
(a) Average accuracy (ACC)
(b) Average forgetting (FGT)
(c) Task-wise accuracy at the last stage
Figure B.3: Experimental results on ImageNet with an external data when the task size is 5. We compare (a) the average incremental accuracy, (b) the average forgetting, and (c) the task-wise accuracy at the end of class-incremental learning on the test dataset. We report the mean of the accuracy of five trials and the standard deviation for (a) and (b).
(a) Average accuracy (ACC)
(b) Average forgetting (FGT)
(c) Task-wise accuracy at the last stage
Figure B.4: Experimental results on ImageNet with an external data when the task size is 10. We compare (a) the average incremental accuracy, (b) the average forgetting, and (c) the task-wise accuracy at the end of class-incremental learning on the test dataset. We report the mean of the accuracy of five trials and the standard deviation for (a) and (b).
(a) Average accuracy (ACC)
(b) Average forgetting (FGT)
(c) Task-wise accuracy at the last stage
Figure B.5: Experimental results on ImageNet with an external data when the task size is 20. We compare (a) the average incremental accuracy, (b) the average forgetting, and (c) the task-wise accuracy at the end of class-incremental learning on the test dataset. We report the mean of the accuracy of five trials and the standard deviation for (a) and (b).