Online Continual Learning with Contrastive Vision Transformer

07/24/2022
by   Zhen Wang, et al.
The University of Sydney
56

Online continual learning (online CL) studies the problem of learning sequential tasks from an online data stream without task boundaries, aiming to adapt to new data while alleviating catastrophic forgetting on the past tasks. This paper proposes a framework Contrastive Vision Transformer (CVT), which designs a focal contrastive learning strategy based on a transformer architecture, to achieve a better stability-plasticity trade-off for online CL. Specifically, we design a new external attention mechanism for online CL that implicitly captures previous tasks' information. Besides, CVT contains learnable focuses for each class, which could accumulate the knowledge of previous classes to alleviate forgetting. Based on the learnable focuses, we design a focal contrastive loss to rebalance contrastive learning between new and past classes and consolidate previously learned representations. Moreover, CVT contains a dual-classifier structure for decoupling learning current classes and balancing all observed classes. The extensive experimental results show that our approach achieves state-of-the-art performance with even fewer parameters on online CL benchmarks and effectively alleviates the catastrophic forgetting.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/24/2022

Probing Representation Forgetting in Supervised and Unsupervised Continual Learning

Continual Learning research typically focuses on tackling the phenomenon...
06/28/2021

Co^2L: Contrastive Continual Learning

Recent breakthroughs in self-supervised learning show that such algorith...
03/22/2021

Supervised Contrastive Replay: Revisiting the Nearest Class Mean Classifier in Online Class-Incremental Continual Learning

Online class-incremental continual learning (CL) studies the problem of ...
02/07/2022

Dataset Condensation with Contrastive Signals

Recent studies have demonstrated that gradient matching-based dataset sy...
06/03/2022

Effects of Auxiliary Knowledge on Continual Learning

In Continual Learning (CL), a neural network is trained on a stream of d...
03/08/2022

New Insights on Reducing Abrupt Representation Change in Online Continual Learning

In the online continual learning paradigm, agents must learn from a chan...
09/03/2021

Complementary Calibration: Boosting General Continual Learning with Collaborative Distillation and Self-Supervision

General Continual Learning (GCL) aims at learning from non independent a...

1 Introduction

One of the major challenges in research on artificial neural networks is developing the ability to accumulate knowledge over time from a non-stationary data stream 

[7, 29, 38]

. Although most successful deep learning techniques can achieve excellent results on pre-collected and fixed datasets, they are incapable of adapting their behavior to the non-stationary environment over time 

[18, 73]. When streaming data comes in continuously, training on the new data can severely interfere with the model’s previously learned knowledge of past data, resulting in a drastic drop in performance on the previous tasks. This phenomenon is known as catastrophic forgetting or catastrophic interference [54, 61]. Continual learning (also known as lifelong learning or incremental learning) [62, 63, 79, 73, 80, 95, 18] aims to solve this problem by maintaining and accumulating the acquired knowledge over time from a stream of non-stationary data.

Continual learning requires neural networks to be both stable to prevent forgetting as well as plastic to learn new concepts, which is referred to as the stability-plasticity dilemma [55, 27]. The early works have focused on the Task-aware protocol, which selects the corresponding classifier using oracle knowledge of the task identity at inference time [2, 46, 65, 69, 18]. For example, regularization-based methods penalize the changes of important parameters when learning new tasks and typically assign a separate classifier for each task [12, 42, 64, 85, 93]. Recent studies have focused on a more practical Task-free protocol, which evaluates the network on all classes observed during training without requiring the identity of the task [3, 81, 8, 95, 37, 90, 13]. Among them, rehearsal-based methods that store a small set of seen examples in a limited memory for replaying have demonstrated promising results [4, 62, 9, 76]. This paper focuses on a more realistic and challenging setting: online continual learning (online CL) [34, 50, 67], where the model learns a sequence of classification tasks with a single pass over the data and without task boundaries. There are also Task-free and Task-aware protocols in online CL.

Inspired by recent breakthroughs in self-supervised learning 

[16, 11, 30, 72], we find that the knowledge learned by supervised contrastive learning [31, 41, 28] reveals greater robustness and transferability. The general and transferable knowledge is the essence of what online CL seeks to explore, which could effectively help mitigate forgetting. Unfortunately, it is challenging to employ contrastive learning in the continual setting. There are two main issues as follows: 1) Contrastive learning requires informative negative samples to learn and distinguish different clusters. However, in online CL, previous task data is unavailable or very limited, which causes severe imbalanced contrast between past and new classes. 2) The contrastively learned knowledge may suffer from forgetting since the distribution of the data stream is continually changing.

Figure 1: The overall framework of Contrastive Vision Transformer (CVT).

Considering the superior modeling capability of Vision Transformers [21]

recently demonstrated on computer vision tasks 

[57, 84, 26, 92, 17], we take the utilization of the potential of attention mechanism [75] to develop online CL. Overall, we strategically integrate contrastive learning and transformer to model the online data stream. We propose a novel framework, Contrastive Vision Transformer (CVT), to alleviate the forgetting problem and tackle the above imbalance issue of contrastive learning in online CL. An overview of the framework is illustrated in Fig. 1. Specifically, we newly design an effective and efficient transformer architecture with external attention to implicitly capture previous tasks’ information and reduce the number of parameters. CVT contains learnable focuses for each class, which could accumulate the knowledge of previous classes to alleviate forgetting. Based on the learnable focuses, we design a focal contrastive loss at the attention level to rebalance contrastive learning between new and past classes and improve the inter-class distinction and intra-class aggregation. Moreover, CVT contains a dual-classifier structure: an injection classifier is used to inject representation of stream data into the model, mitigating interference with previous knowledge; and an accumulation classifier focuses on integrating the previous and new knowledge in a balanced manner.

We systematically compare state-of-the-art and well-established methods for the online CL problem in both the Task-free and Task-aware protocols. Experimental results show that the CVT framework significantly outperforms other approaches in terms of accuracy and forgetting, even with fewer parameters. Ablation study validates each component of the proposed framework.

The main contributions of this paper are three-fold:

  • We propose a novel framework Contrastive Vision Transformer (CVT) to achieve a better stability-plasticity trade-off for online CL. CVT contains class-wise learnable focuses, which can accumulate the knowledge of previous classes to alleviate forgetting.

  • We design a focal contrastive loss to rebalance contrastive learning between new and past classes and learn more robust representations.

  • The extensive experimental results show that CVT achieves state-of-the-art performance with even fewer parameters on online continual learning benchmarks.

2 Related Works

2.1 Continual Learning Methods

Continual learning (CL) methods have been developed to alleviate catastrophic forgetting in neural networks. These methods can be divided into three main categories: expansion-based, regularization-based, and rehearsal-based methods.

As new tasks arrive, expansion-based methods dynamically expand networks and keep sub-networks related to previous tasks fixed [69, 89, 1, 59, 86, 45, 82, 53]

. However, most expansion-based methods require task identity during inference in order to allocate distinct sets of parameters to distinct tasks. Regularization-based methods limit the changes in important parameters during the learning of new tasks by estimating the importance of each network parameter for prior tasks 

[12, 42, 2, 64, 43, 66, 93]. These works differ in how they compute the importance of network parameters.

Rehearsal-based methods [8, 37, 81, 60, 15, 22, 49, 79, 68, 52] alleviate catastrophic forgetting by replaying a subset of data of past tasks stored in limited buffer. iCaRL [62] trains a nearest-class-mean classifier while limiting the change of the representation in later tasks through a self-distillation loss. In addition to replaying past experiences, HAL [13] keeps predictions on some anchor points by an additional objective. IExpressNet [95] introduces representative expression memory while employing a novel Center-expression-distilled loss and having shown satisfactory performance for the task of facial expression recognition. DER++ [9] combines rehearsal with distillation loss [36, 83, 47] to retrain past experience and obtain state-of-the-art performance. RM [6] proposes an uncertainty-based sampling approach by using uncertainty and data augmentation to improve rehearsal. The proposed method in this paper belongs to the rehearsal-based method.

2.2 Online Continual Learning

Online continual learning (online CL) is a more realistic [50, 78, 87, 96, 67, 49] and difficult setup [34], where the model learns from a non-i.i.d data stream online, without the help of task identifiers or task boundaries both at training and inference stages. Online CL methods are mainly based on rehearsal.

Experience Replay (ER) [63] employs reservoir sampling for memory management and jointly optimizes a network by mixing memory data with online stream data. ERT [10] improves ER by balanced sampling and bias control. AGEM [14] and GEM [49] use episodic memory to compute past task gradients to constrain the online update step. GSS [4] presents a gradient-based sampling to store diversified data for learning more information. ASER [67] is based on the Shapley Value theory to improve the memory buffer update and sampling. CLS [5] uses two extra models to maintain long-term and short-term semantic memories for knowledge consolidation. SCR [52] uses supervised contrastive loss for representation learning and employs the nearest class mean to classify.

2.3 Vision Transformers

Transformer model is firstly applied into machine translation tasks [75]

, and then, Transformers have become the state-of-the-art models for most natural language processing tasks 

[20, 58, 24, 51, 25]. Attention modules are the core components of transformers, which aggregate information from the entire input sequence. Recently, Vision Transformer (ViT) [21] is proposed to makes Transformer architecture scalable for image classification when the data is large enough. Since then, a lot of effort has been dedicated to improving Vision Transformers’ data efficiency and model efficiency [92, 32, 91, 40], where an effective direction is to strategically integrate properties of convolution into the Transformer architecture [74, 91, 48, 84, 26, 94, 23, 17]. CoAT [88] proposes a conv-attention module that realizes relative position embeddings with convolutions. LeViT [26] builds pyramid attention structure with pooling to learn convolutional-like features. By leveraging sequence pooling and convolutions, CCT [33] eliminates the need for class tokens and positional embeddings.

Nevertheless, current vision transformers may not be applicable to modeling the online data stream, and existing continual learning algorithms developed for CNNs may not be ideal for vision transformers as well. To this end, we propose a lightweight Contrastive Vision Transformer (CVT) with a focal contrastive loss for online continual learning and achieve better performance than other transformers and CNN baselines.

3 Preliminary

3.0.1 Problem Setup.

Formally, an online continual learning problem is split in a sequence of supervised learning tasks , , where is the total number of tasks. Let be the corresponding online data stream, where is the dataset of task . For task , input samples and the corresponding ground truth labels are drawn from the i.i.d. distribution . A mini-batch of training data from comes gradually in an online stream (each sample is seen only once). Besides, a limited memory buffer saves a small set of training data of seen tasks. The model is trained on at each iteration. At task , The label space of the model is all observed classes , and the model is expected to predict well on all classes at the inference stage.

3.0.2 Supervised Contrastive Learning (SCL).

SCL [41, 28, 16] aims to push the representation of samples with different classes farther apart while tightly clustering representation of samples with the same class. Suppose that the classification model can be decomposed into two components: an encoder and a classifier . Encoder maps an image sample to a vectorial embedding (representation) . Classifier maps the representation

to a classification vector

. Without training , SCL focuses on training as follows: given a batch of samples , SCL first generates an augmented batch by making two random augmentations of , with . The SCL loss takes the following form:

(1)

where represents the set of indices of ; is the set of representations of samples in except for that of ; is the set of representations of positive samples (i.e., samples with the same labels) with respect to the anchor ;

is a temperature hyperparameter;

is its cardinality.

Although SCL could learn the transferable representation to help prevent forgetting in online CL, SCL will face new challenges: 1) previous task data is unavailable or very limited due to the streaming fashion, which causes severe imbalanced contrast between past and new classes; 2) the contrastively learned knowledge may suffer from forgetting since the distribution of the data stream is continually changing. For visualization clarity we use a 2D feature space. As illustrated in Fig. 2(a1) and Fig. 2(a2), the imbalanced data stream in online CL makes the representation of previous tasks drift and difficult to be accurately clustered by SCL.

4 Methodology

Figure 2: Continually arriving samples are clustered on the hypersphere by contrastive learning (for visualization clarity, we use a 2D representation space). Different colors represent different classes. Supervised contrastive learning loss on online stream data fails to obtain good inter-class distinction and intra-class aggregation caused by the class imbalance. Focal contrastive loss effectively mitigates class imbalance in online CL and accumulates class-wise knowledge by the learnable focuses.

To alleviate the forgetting problem in online continue learning, we propose a framework Contrastive Vision Transformer (CVT), which designs a new focal contrastive learning strategy based on the transformer architecture. An overview of the framework is depicted in Fig. 1. CVT plays the strengths of the attention mechanism in online CL, which design an effective transformer architecture with external attention. We tackle the imbalance issue of SCL in online CL by proposing a focal contrastive loss at the attention level. The learnable focuses in CVT can accumulate class-specific knowledge to alleviate the forgetting of previous tasks. Besides, a dual-classifier is used to decouple learning current classes and balancing all seen classes, improving the stability-plasticity trade-off.

In the following, we will go through the description of CVT in terms of both model architecture (Sec. 4.1) and focal contrastive continual learning strategy (Sec. 4.2), respectively.

4.1 Model Architecture

Fig. 3 illustrates the CVT architecture. The major contributing components in the architecture include 1) external attention, which implicitly captures previous tasks’ information and reduces the number of parameters, and 2) learnable focuses, which could maintain and accumulate the knowledge of previous classes.

4.1.1 External Attention.

CVT plays the strengths of the attention mechanism in online CL. Unlike the vanilla self-attention in vision transformers [21, 26] that derives the attention map by computing similarity between self-queries and self-keys [75], we introduce an external attention mechanism [79] to obtain the attention map by computing the affinities between self-queries and a learnable external key with an attention bias , which implicitly injects previous task information in the attention mechanism. Moreover, the proposed architecture can save the number of parameters compared to self-attention.

Let the input tensor be

, we apply linear transformation with weights

and to get the vanilla self-query and self-value , respectively. We employ a linear layer to replace the input-depended self-key, and explicitly add a learnable attention bias to attention maps. Consider attention heads, which are uniformly split into segments , and . The external attention mechanism computes the head-specific attention map and concatenates the multi-head attention as follows:

(2)

where Norm() denotes batch normalization;

is the dimension of the key.

Figure 3: The architecture of Contrastive Vision Transformer (CVT). CVT architecture is composed of stacked transformer blocks after a simple convolutional block. Shrink module performs downsampling to reduce the resolution of the activation maps and increase their number of channels between CVT stages. Focuses are a set of learnable attention vectors. After a projection layer, two classifiers serve for knowledge injection and accumulation, respectively.

4.1.2 Learnable Focuses.

CVT contains learnable focuses for each class, which could maintain and accumulate the knowledge of previous classes to alleviate forgetting for online CL. The class-wise learnable focuses is a set of learnable attention vectors, as shown in Fig. 3, where a focus corresponds to class , and is the number of the seen classes. The size of the learnable focuses is negligible in relation to the overall model.

When a new class appears in the data stream, the corresponding focus to class starts to participate in the training (refer to Eq. 3). Even if class no longer appears in the data stream afterward, the focus always participates in the online CL training and serves as a negative sample for the other classes, as illustrated in Fig. 2(b1) and Fig. 2(b2). Thus, focuses preserve and accumulate the previously learned class-specific knowledge and acts as a forgetting mitigation in online continual learning.

4.2 Focal Contrastive Continual Learning

We propose a rehearsal-based focal contrastive learning scheme to 1) tackle the imbalance issue of SCL in online CL and 2) accumulate class-specific knowledge, to alleviate the interference with previous tasks. The learning scheme includes two losses: a focal contrastive loss and a dual-classifier loss, as following.

4.2.1 Focal Contrastive Loss.

For learning representation continually, we propose a focal contrastive loss in online CL. As mentioned in Sec. 3, during the training phase, the model observes a mini-batch at a time sampled from task in the data stream . An input batch for the model is composed of and sampled from the memory buffer . The input batch and its augmented view are encoded by CVT blocks to generate attention , as shown in Fig. 3. As mentioned previously, a set of class-wise learnable focuses is utilized by focal contrastive loss , where a focus is a learnable attention vector for class and

is the number of the seen classes. The FC loss function is defined as:

(3)

where

(4)

and is the weight of focuses. We set to make focuses play a more important role in contrastive learning; and are the same with supervised contrastive learning in Eq. 1; is a temperature hyperparameter.

The benefits of using focal contrastive loss are two-fold. First, it alleviates the imbalance issue in online CL by employing the class-wise focuses. Second, it accumulates the previous knowledge by the learnable focuses which will continually serve as the prototypes of classes to maintain class-specific information. With a proposed focal contrastive loss in training, CVT rebalances contrastive learning between new and past classes and improve the inter-class distinction and intra-class aggregation, as illustrated in Fig. 2. In Sec. 5.3, we empirically observe that outperforms the original and boost the online CL.

4.2.2 Dual-classifier Loss.

We propose a dual-classifier structure to decouple learning current classes and balancing all seen classes, which contains an injection classifier to inject new task representation into the model, alleviating interference to previously learned knowledge, and an accumulation classifier to integrate past and new knowledge in a balanced manner.

Let be the representation of a sample outputted from the Projection of CVT before the classifier. When new data stream batch arrives, we utilize the output from an independent injection classifier to compute a classification loss:

(5)

where denotes the injection classifier and adopts a cross-entropy loss. is only trained on stream data and does not participate in the inference stage.

Besides, we employ an accumulation classifier to focus on improving the stability-plasticity trade-off by integrating previous and new knowledge in a balanced manner. The accumulation classifier is used at the inference stage for outputting the prediction. Rehearsing the limited memory data during learning new tasks is a crucial way to maintain previous knowledge. We can replay the exemplars stored in the memory buffer with their ground truth labels. In addition, the accumulation classifier also needs a supervised signal from current task data. Therefore, we give the accumulation classifier loss:

(6)

where denotes the accumulation classifier; and are the coefficients balancing knowledge consolidation. We approximate the expectation by computing gradients on batches sampled from the memory buffer.

Overall, the total loss used in CVT is the sum of Eq. (3), Eq. (5), and Eq. (6):

(7)

where is the coefficient balancing . After updating the whole CVT model on a mini-batch , the memory buffer will be updated with by Reservoir sampling [77, 9].

5 Experiment

Memory
Buffer
Method Paras 10 splits 20 splits
Task-free Task-aware Task-free Task-aware
SGD 11.2 5.77 37.44 3.53 36.81
500 ER [63] 11.2 15.59 59.72 12.51 62.72
GEM [49] 11.2 14.34 50.39 5.98 57.15
AGEM [14] 11.2 6.35 39.18 3.62 39.55
iCaRL [62] 11.2 15.18 48.95 12.79 60.53
FDR [8] 11.2 5.97 32.18 3.60 39.98
GSS [4] 11.2 10.91 59.10 6.33 62.80
DER++ [9] 11.2 15.72 54.45 11.29 63.62
HAL [13] 22.4 10.51 33.70 7.09 52.89
ERT [10] 11.2 16.28 60.11 17.92 68.08
ASER [67] 11.2 12.42 53.77 9.63 58.91
RM [6] 11.2 14.32 58.76 13.73 64.73
CLS [5] 33.8 15.06 59.82 14.84 65.74
CVT (ours) 8.9 24.45 62.52 21.81 71.23
1000 ER [63] 11.2 20.41 63.39 17.02 68.52
GEM [49] 11.2 16.49 52.30 8.40 62.59
AGEM [14] 11.2 6.57 40.38 3.74 42.39
iCaRL [62] 11.2 16.31 50.49 13.03 61.13
FDR [8] 11.2 6.58 36.99 3.72 42.45
GSS [4] 11.2 12.38 60.75 7.40 66.06
DER++ [9] 11.2 21.27 61.80 13.42 71.26
HAL [13] 22.4 11.81 39.67 13.14 60.03
ERT [10] 11.2 23.43 62.25 24.58 72.61
ASER [67] 11.2 14.38 58.91 12.79 62.47
RM [6] 11.2 22.41 61.82 18.91 67.30
CLS [5] 33.8 19.73 62.54 17.06 70.08
CVT (ours) 8.9 28.83 65.86 28.15 75.76
Table 1: Results (overall accuracy %) on CIFAR100 benchmark which is averaged over five runs. #Paras means the number of parameters in the model, which is counted by million.

5.1 Experimental Setup and Implementation

We consider a strict evaluation setting [38, 73] for online continual learning, including task-aware protocol [56] and task-free protocol [39]. For task-aware protocol [56], the task identities are required for each time evaluation. For task-free protocol [39], the task identities are unavailable at inference time.

Datasets.

Online continual learning benchmarks evaluate the capacity of an algorithm to learn on not independent and identically distributed (non-iid) data. CIFAR-100 

[44] contains 100 classes and each class has 100 testing and 500 training color images. TinyImageNet [70] consists 200 classes that include 100,000 images for training and 10,000 images for testing. ImageNet100 [62] contains 100 classes randomly chosen from ILSVRC [19], including about 120,000 images for training and 5,000 images for validation.

Baselines. We compare CVT with state-of-the-art and well-established Online CL baselines, including 11 rehearsal-based methods (ER [63], GEM [49], AGEM [14], GSS [4], FDR [8], HAL [13], ASER [67], ERT [10], RM [6], SCR [52], and CLS [5]), 2 methods leveraging Knowledge Distillation (iCaRL [62] and DER++ [9]). Besides, we also compare vision transformers (ViT [21], LeViT [26], CoAT [88], and CCT [33]) with rehearsal strategy for continual learning. We also provide the results of simply performing SGD without any countermeasure to alleviate forgetting.

Memory Buffer Method Paras TinyImageNet Paras ImageNet100
Task-free Task-aware Task-free Task-aware
SGD 11.2 4.54 26.25 11.2 3.98 19.05
500 ER [63] 11.2 9.71 42.76 11.2 9.88 32.38
AGEM [14] 11.2 4.63 27.86 11.2 3.38 21.80
iCaRL [62] 11.2 6.17 27.22 11.2 7.70 20.45
FDR [8] 11.2 5.19 28.23 11.2 3.34 19.24
DER++ [9] 11.2 9.56 40.52 11.2 10.30 29.20
ERT [10] 11.2 9.95 40.42 11.2 10.28 28.53
ASER [67] 11.2 9.22 41.09 11.2 9.75 31.71
RM [6] 11.2 8.39 41.63 11.2 8.53 28.30
SCR [52] 11.2 9.08 39.85 11.2 8.81 29.62
CVT (ours) 9.0 14.71 43.93 9.4 14.82 36.74
1000 ER [63] 11.2 12.46 45.50 11.2 10.42 34.26
AGEM [14] 11.2 4.92 28.38 11.2 3.66 23.56
iCaRL [62] 11.2 6.91 28.56 11.2 8.93 22.37
FDR [8] 11.2 5.27 28.94 11.2 3.58 21.28
DER++ [9] 11.2 12.97 47.21 11.2 13.94 40.02
ERT [10] 11.2 13.84 44.65 11.2 12.26 33.88
ASER [67] 11.2 12.26 46.02 11.2 11.38 35.76
RM [6] 11.2 11.73 45.89 11.2 11.85 32.72
SCR [52] 11.2 10.19 43.58 11.2 10.74 31.84
CVT (ours) 9.0 16.54 48.50 9.4 18.02 42.61
Table 2: Results (overall accuracy %) on TinyImageNet and ImageNet100, which are averaged over three runs. #Paras means the number of parameters in the model, which is counted by million.

Metrics. We evaluate online CL methods in terms of accuracy and forgetting following [12, 14, 9]. The accuracy is defined by =, and the forgetting is defined by =, where is the inference accuracy on task when the model finished learning task .

Implementation Details.

In order to compare each method fairly, we train all networks using stochastic gradient descent (SGD) optimizer. The images used for training are randomly cropped and flipped for each method following 

[9, 10, 68]

. We adopt 1 epoch with mini-batch size of 10 for all datasets, following 

[62, 9, 10, 93]. Online continual learning baselines use ResNet18 [35] as backbone and cross-entropy as the classification loss, following [14, 68, 6, 9, 13, 71]. The implementation of transformer block is based on LeViT [26] and ViT [21]. CVT framework employs GELU activation and dropout in transformer blocks and uses a global average pooling to the last activation map.

5.2 Comparison to State-of-the-Art Methods

Evaluation on CIFAR100. Following the setting proposed in [62, 89], we trains all 100 classes in several splits, including 10, 20 incremental tasks. Table 1 summarizes the overall accuracy on CIFAR100 with 500 and 1000 memory sizes. It is demonstrated that CVT outperforms other baselines by a considerable margin in different incremental splits, e.g., CVT can improve the accuracy of continual learning by more than 8% in 10-split with 500 memory capacity. Especially in the case of small memory, the advantage of CVT is more obvious, which indicates CVT can effectively alleviate the imbalance issue in online CL. It is worth noting that although CVT uses fewer parameters (8.9M) than other methods (11.2M33.8M), it can still achieve superior performance. One reason is that CVT inherits the merits of transformers for modeling the stream of tasks without stacking a lot of parameters, and besides, the number of parameters for the proposed learnable focuses is extremely small.

Figure 4: Forgetting results (%) on CIFAR100 (lower is better).

Evaluation on ImageNet datasets.

Table 2 summarizes the evaluation results for the TinyImageNet and ImageNet100 datasets with 10 splits. It is demonstrated that CVT consistently surpasses other methods with a considerable margin for Task-free and Task-aware on TinyImageNet and ImageNet100 datasets. Specifically, our method outperforms the state-of-the-art with about 4.3% for the Task-aware accuracy on the ImageNet100 benchmark. For TinyImageNet benchmark, the Task-free accuracy is improved from 9.95% to 14.71%(+4.76%). Moreover, CVT takes fewer parameters compared to other CNN-based methods.

Figure 5: Incremental performance evaluated on tasks observed so far. [] higher is better, [] lower is better (best seen in color).

Forgetting. To compare the alleviating forgetting capability, we assess the average forgetting [9, 12] that measures the performance degradation in subsequent tasks. As shown in Fig. 4

, CVT suffers from less forgetting than all the other baselines in both of Task-free and Task-aware settings with memory buffer 1000 on CIFAR100. This is because CVT utilizes the focal contrastive loss and dual-classifier loss, which improve the stability of the vision transformer network.

Incremental Performance. We also evaluate the average incremental performance [62, 9] under the Task-free protocol with 500 memory buffer, which is the result of evaluating on all the tasks observed so far after completing each task. As illustrated in Fig. 5, the results are curves of accuracy and forgetting after each task. It is observed that during the learning process, most methods degrade rapidly as new tasks arrive, while our method consistently outperforms the state-of-the-art methods in both accuracy and forgetting.

5.3 Ablation Study and Analysis

Comparison to Transformer and CNN Backbone. We compare CVT to Vision Transformer networks (ViT [21], LeViT [26], CoAT [88], and CCT [33]) and the CNN benchmark ResNet18 [35] under the proposed rehearsal strategy in online continual learning. Table 3 demonstrates the results of accuracy and forgetting on CIFAR100 and TinyImageNet with 500 memory. It is observed that ViT is not up to the task of online continual learning, since it is “data-hungry” and only fits i.i.d. large datasets. Besides, LeViT, CoAT, and CCT contain CNN structures to obtain inductive biases, which still suffer from catastrophic forgetting in online CL. We can find that using Vision Transformer directly for online CL can not consistently outperform CNN-based networks. Our proposed CVT architecture essentially inherits the merits of CNN and transformers and thus works well in online streaming data and modeling long-dependencies in the input data. Moreover, CVT even takes fewer parameters to achieve better performance for online CL, which also benefits from the focal contrastive loss and the dual-classifier structure.

Effect of Each Component. To assess the effects of the components in CVT, we perform ablation study in terms of accuracy and forgetting. From Table 3 we can observe that the proposed focal contrastive loss plays an important role in alleviating catastrophic forgetting and accumulating knowledge. However, if we simply use supervised contrastive learning loss to replace , we find that the forgetting problem is not mitigated compared to not using any contrastive loss. This is because using directly could cause a severe imbalance between new and past classes in online CL, which limits the learning of transferable representation. While can overcome the issue by utilizing the learnable focuses to boost the performance of online CL. This supports that can rebalance contrastive learning between new and past classes and improve the inter-class distinction and intra-class aggregation. Besides, we can see the dual-classifier loss obtains 2.37% and 7.97% gain in terms of accuracy and forgetting on CIFAR100, respectively. The results of Table 3 demonstrate the effectiveness of each component of CVT.

Method Paras CIFAR100 Paras TinyImageNet
Accuracy[] Forgetting[] Accuracy[] Forgetting[]
ViT [21] 16.2 8.48 35.56 16.3 7.91 42.26
LeViT [26] 10.9 14.55 44.53 12.1 9.02 41.41
CoAT [88] 10.3 13.17 47.64 11.3 8.78 39.80
CCT [33] 4.3 13.86 51.06 4.4 9.97 40.21
ResNet18 [35] 11.2 15.72 43.82 11.2 9.56 42.13
ResNet18 + 11.3 17.49 38.73 11.3 10.17 36.49
ResNet18 + dual-classifier 11.2 18.84 35.38 11.2 10.91 33.55
CVT - 8.8 19.92 26.16 8.9 12.47 19.43
CVT - + 8.9 20.73 25.52 9.0 12.59 20.47
CVT - dual-classifier 8.9 22.08 29.81 9.0 13.63 18.95
CVT (ours) 8.9 24.45 21.86 9.0 14.71 16.32
Table 3: Ablation study on backbone and each component of CVT. “-” indicates the removal operation. “+” represents an add component operation.

6 Conclusion

In this paper, we propose a novel attention-based framework, Contrastive Vision Transformer (CVT), to effectively mitigate the catastrophic forgetting for online CL. To the best of our knowledge, this paper is the first in the literature to design a Transformer for online CL. CVT contains external attention and learnable focuses to accumulate previous knowledge and maintain class-specific information. With a proposed focal contrastive loss in training, CVT rebalances contrastive continual learning between new and past classes and improves the inter-class distinction and intra-class aggregation. Moreover, CVT designs a dual-classifier structure to decouple learning current classes and balancing all seen classes. Extensive experimental results show that our approach significantly outperforms current state-of-the-art methods with fewer parameters. Ablation study validates the effectiveness of each proposed component.

6.0.1 Acknowledgments

This work is supported by ARC FL-170100117, DP-180103424, IC-190100031, and LE-200100049.

References

  • [1] D. Abati, J. Tomczak, T. Blankevoort, S. Calderara, R. Cucchiara, and B. E. Bejnordi (2020) Conditional channel gated networks for task-aware continual learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3931–3940. Cited by: §2.1.
  • [2] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars (2018)

    Memory aware synapses: learning what (not) to forget

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154. Cited by: §1, §2.1.
  • [3] R. Aljundi, P. Chakravarty, and T. Tuytelaars (2017) Expert gate: lifelong learning with a network of experts. In CVPR, pp. 3366–3375. Cited by: §1.
  • [4] R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019) Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, Cited by: §1, §2.2, §5.1, Table 1.
  • [5] E. Arani, F. Sarfraz, and B. Zonooz (2022) Learning fast, learning slow: a general continual learning method based on complementary learning system. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §5.1, Table 1.
  • [6] J. Bang, H. Kim, Y. Yoo, J. Ha, and J. Choi (2021) Rainbow memory: continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8218–8227. Cited by: §2.1, §5.1, §5.1, Table 1, Table 2.
  • [7] Y. Bengio, I. Goodfellow, and A. Courville (2017) Deep learning. Vol. 1, MIT press Massachusetts, USA:. Cited by: §1.
  • [8] A. S. Benjamin, D. Rolnick, and K. P. Kording (2019) Measuring and regularizing networks in function space. International Conference on Learning Representations. Cited by: §1, §2.1, §5.1, Table 1, Table 2.
  • [9] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara (2020) Dark Experience for General Continual Learning: a Strong, Simple Baseline. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.1, §4.2.2, §5.1, §5.1, §5.1, §5.2, §5.2, Table 1, Table 2.
  • [10] P. Buzzega, M. Boschini, A. Porrello, and S. Calderara (2021) Rethinking experience replay: a bag of tricks for continual learning. In International Conference on Pattern Recognition (ICPR), pp. 2180–2187. Cited by: §2.2, §5.1, §5.1, Table 1, Table 2.
  • [11] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, pp. 9912–9924. Cited by: §1.
  • [12] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §1, §2.1, §5.1, §5.2.
  • [13] A. Chaudhry, A. Gordo, P. K. Dokania, P. H. Torr, and D. Lopez-Paz (2021) Using hindsight to anchor past knowledge in continual learning. In AAAI, Cited by: §1, §2.1, §5.1, §5.1, Table 1.
  • [14] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2019) Efficient Lifelong Learning with A-GEM. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §5.1, §5.1, §5.1, Table 1, Table 2.
  • [15] S. Chen, L. Wang, Z. Wang, Y. Yan, D. Wang, and S. Zhu (2022) Learning meta-adversarial features via multi-stage adaptation network for robust visual object tracking. Neurocomputing 491, pp. 365–381. External Links: ISSN 0925-2312 Cited by: §2.1.
  • [16] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    ,
    pp. 1597–1607. Cited by: §1, §3.0.2.
  • [17] Z. Dai, H. Liu, Q. V. Le, and M. Tan (2021) CoAtNet: marrying convolution and attention for all data sizes. In Advances in Neural Information Processing Systems, Cited by: §1, §2.3.
  • [18] M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §1, §1.
  • [19] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §5.1.
  • [20] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.3.
  • [21] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1, §2.3, §4.1.1, §5.1, §5.1, §5.3, Table 3.
  • [22] A. Douillard, M. Cord, C. Ollion, T. Robert, and E. Valle (2020) Podnet: pooled outputs distillation for small-tasks incremental learning. In European Conference on Computer Vision (ECCV), pp. 86–102. Cited by: §2.1.
  • [23] A. Douillard, A. Ramé, G. Couairon, and M. Cord (2022) DyTox: transformers for continual learning with dynamic token expansion. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.3.
  • [24] Y. Duan, Z. Wang, J. Wang, Y. Wang, and C. Lin (2022)

    Position-aware image captioning with spatial relation

    .
    Neurocomputing 497, pp. 28–38. External Links: ISSN 0925-2312 Cited by: §2.3.
  • [25] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019) Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 244–253. Cited by: §2.3.
  • [26] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and M. Douze (2021-10) LeViT: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12259–12269. Cited by: §1, §2.3, §4.1.1, §5.1, §5.1, §5.3, Table 3.
  • [27] S. Grossberg (2013) Adaptive resonance theory: how a brain learns to consciously attend, learn, and recognize a changing world. Neural networks 37, pp. 1–47. Cited by: §1.
  • [28] B. Gunel, J. Du, A. Conneau, and V. Stoyanov (2020) Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:2011.01403. Cited by: §1, §3.0.2.
  • [29] J. Guo, M. Gong, T. Liu, K. Zhang, and D. Tao (2020) LTF: a label transformation framework for correcting label shift. In ICML, Vol. 119, pp. 3843–3853. Cited by: §1.
  • [30] J. Guo, J. Li, H. Fu, M. Gong, K. Zhang, and D. Tao (2022)

    Alleviating semantics distortion in unsupervised low-level image-to-image translation via structure consistency constraint

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18249–18259. Cited by: §1.
  • [31] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1.
  • [32] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao (2020) A survey on visual transformer. ArXiv abs/2012.12556. Cited by: §2.3.
  • [33] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi (2021) Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704. Cited by: §2.3, §5.1, §5.3, Table 3.
  • [34] J. He, R. Mao, Z. Shao, and F. Zhu (2020) Incremental learning in online scenario. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13926–13935. Cited by: §1, §2.2.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §5.1, §5.3, Table 3.
  • [36] G. Hinton, O. Vinyals, and J. Dean (2014) Distilling the Knowledge in a Neural Network. In NeurIPS workshop, Cited by: §2.1.
  • [37] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin (2019) Learning a unified classifier incrementally via rebalancing. In CVPR, pp. 831–839. Cited by: §1, §2.1.
  • [38] Y. Hsu, Y. Liu, A. Ramasamy, and Z. Kira (2018) Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines. In NeurIPS Continual learning Workshop, Cited by: §1, §5.1.
  • [39] X. Jin, A. Sadhu, J. Du, and X. Ren (2021) Gradient-based editing of memory examples for online task-free continual learning. Advances in Neural Information Processing Systems 34. Cited by: §5.1.
  • [40] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah (2021) Transformers in vision: a survey. arXiv preprint arXiv:2101.01169. Cited by: §2.3.
  • [41] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. Advances in Neural Information Processing Systems 33. Cited by: §1, §3.0.2.
  • [42] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.1.
  • [43] Y. Kong, L. Liu, J. Wang, and D. Tao (2021) Adaptive curriculum learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5067–5076. Cited by: §2.1.
  • [44] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.1.
  • [45] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong (2019) Learn to grow: a continual structure learning framework for overcoming catastrophic forgetting. In International Conference on Machine Learning (ICML), Cited by: §2.1.
  • [46] Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (12). Cited by: §1.
  • [47] X. Liu, X. Wang, and S. Matwin (2018) Improving the interpretability of deep neural networks with knowledge distillation. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Cited by: §2.1.
  • [48] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022. Cited by: §2.3.
  • [49] D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1, §2.2, §2.2, §5.1, Table 1.
  • [50] V. Losing, B. Hammer, and H. Wersing (2018) Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275, pp. 1261–1274. Cited by: §1, §2.2.
  • [51] T. Luong, H. Pham, and C. D. Manning (2015-09)

    Effective approaches to attention-based neural machine translation

    .
    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §2.3.
  • [52] Z. Mai, R. Li, H. Kim, and S. Sanner (2021) Supervised contrastive replay: revisiting the nearest class mean classifier in online class-incremental continual learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3584–3594. Cited by: §2.1, §2.2, §5.1, Table 2.
  • [53] A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [54] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. G. H. Bower (Ed.), Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • [55] M. Mermillod, A. Bugaiska, and P. Bonin (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology 4, pp. 504. Cited by: §1.
  • [56] Q. Pham, C. Liu, and S. Hoi (2021) DualNet: continual learning, fast and slow. Advances in Neural Information Processing Systems 34. Cited by: §5.1.
  • [57] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §1.
  • [58] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.3.
  • [59] J. Rajasegaran, M. Hayat, S. H. Khan, F. S. Khan, and L. Shao (2019) Random path selection for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1.
  • [60] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars (2017) Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1320–1328. Cited by: §2.1.
  • [61] R. Ratcliff (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §1.
  • [62] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §5.1, §5.1, §5.1, §5.2, §5.2, Table 1, Table 2.
  • [63] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro (2019) Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.2, §5.1, Table 1, Table 2.
  • [64] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & Compress: A scalable framework for continual learning. In International Conference on Machine Learning, Cited by: §1, §2.1.
  • [65] J. Serra, D. Suris, M. Miron, and A. Karatzoglou (2018) Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, Vol. 80, pp. 4548–4557. Cited by: §1.
  • [66] Y. Shi, L. Yuan, Y. Chen, and J. Feng (2021) Continual learning via bit-level information preserving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16674–16683. Cited by: §2.1.
  • [67] D. Shim, Z. Mai, J. Jeong, S. Sanner, H. Kim, and J. Jang (2021) Online class-incremental continual learning with adversarial shapley value. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 35, pp. 9630–9638. Cited by: §1, §2.2, §2.2, §5.1, Table 1, Table 2.
  • [68] C. Simon, P. Koniusz, and M. Harandi (2021) On learning the geodesic path for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1591–1600. Cited by: §2.1, §5.1.
  • [69] P. Singh, P. Mazumder, P. Rai, and V. P. Namboodiri (2021) Rectification-based knowledge retention for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15282–15291. Cited by: §1, §2.1.
  • [70] Stanford (2015) Tiny ImageNet Challenge (CS231n). Note: http://tiny-imagenet.herokuapp.com/ Cited by: §5.1.
  • [71] S. Tang, D. Chen, J. Zhu, S. Yu, and W. Ouyang (2021) Layerwise optimization by gradient decomposition for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9634–9643. Cited by: §5.1.
  • [72] Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
  • [73] G. M. van de Ven and A. S. Tolias (2018) Three continual learning scenarios. NeurIPS Continual Learning Workshop. Cited by: §1, §5.1.
  • [74] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman, and J. Shlens (2021) Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12894–12904. Cited by: §2.3.
  • [75] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.3, §4.1.1.
  • [76] E. Verwimp, M. De Lange, and T. Tuytelaars (2021) Rehearsal revealed: The limits and merits of revisiting samples in continual learning. In ICCV, Cited by: §1.
  • [77] J. S. Vitter (1985) Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11 (1), pp. 37–57. Cited by: §4.2.2.
  • [78] Z. Wang, Y. Duan, L. Liu, and D. Tao (2021) Multi-label few-shot learning with semantic inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 15917–15918. Cited by: §2.2.
  • [79] Z. Wang, L. Liu, Y. Duan, Y. Kong, and D. Tao (2022-06) Continual learning with lifelong vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 171–181. Cited by: §1, §2.1, §4.1.1.
  • [80] Z. Wang, L. Liu, Y. Duan, and D. Tao (2021) Continual learning with embeddings: algorithm and analysis. In ICML 2021 Workshop on Theory and Foundation of Continual Learning, Cited by: §1.
  • [81] Z. Wang, L. Liu, Y. Duan, and D. Tao (2022) Continual learning through retrieval and imagination. Proceedings of the AAAI Conference on Artificial Intelligence 36 (8), pp. 8594–8602. Cited by: §1, §2.1.
  • [82] Z. Wang, L. Liu, Y. Duan, and D. Tao (2022) SIN: semantic inference network for few-shot streaming label learning. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14. Cited by: §2.1.
  • [83] Z. Wang, L. Liu, and D. Tao (2020) Deep streaming label learning. In International Conference on Machine Learning (ICML), Vol. 119, pp. 9963–9972. Cited by: §2.1.
  • [84] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang (2021-10) CvT: introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22–31. Cited by: §1, §2.3.
  • [85] H. Xi, D. Aussel, W. Liu, S. T. Waller, and D. Rey (2022) Single-leader multi-follower games for the regulation of two-sided mobility-as-a-service markets. European Journal of Operational Research. Cited by: §1.
  • [86] H. Xi, L. He, Y. Zhang, and Z. Wang (2020) Bounding the efficiency gain of differentiable road pricing for evs and gvs to manage congestion and emissions. PloS one 15 (7), pp. e0234204. Cited by: §2.1.
  • [87] H. Xi, W. Liu, D. Rey, S. T. Waller, and P. Kilby (2020) Incentive-compatible mechanisms for online resource allocation in mobility-as-a-service systems. arXiv preprint arXiv:2009.06806. Cited by: §2.2.
  • [88] W. Xu, Y. Xu, T. Chang, and Z. Tu (2021-10) Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9981–9990. Cited by: §2.3, §5.1, §5.3, Table 3.
  • [89] S. Yan, J. Xie, and X. He (2021) DER: dynamically expandable representation for class incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §5.2.
  • [90] L. Yu, B. Twardowski, X. Liu, L. Herranz, K. Wang, Y. Cheng, S. Jui, and J. v. d. Weijer (2020) Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6982–6991. Cited by: §1.
  • [91] P. Yu, Y. Chen, Y. Jin, and Z. Liu (2021) Improving vision transformers for incremental learning. arXiv preprint arXiv:2112.06103. Cited by: §2.3.
  • [92] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E.H. Tay, J. Feng, and S. Yan (2021-10) Tokens-to-token vit: training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 558–567. Cited by: §1, §2.3.
  • [93] F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning (ICML), Cited by: §1, §2.1, §5.1.
  • [94] X. Zhao, J. Zhu, B. Luo, and Y. Gao (2021) Survey on facial expression recognition: history, applications, and challenges. IEEE MultiMedia 28 (4), pp. 38–44. Cited by: §2.3.
  • [95] J. Zhu, B. Luo, S. Zhao, S. Ying, X. Zhao, and Y. Gao (2020) IExpressNet: facial expression recognition with incremental classes. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2899–2908. Cited by: §1, §1, §2.1.
  • [96] J. Zhu, Y. Wei, Y. Feng, X. Zhao, and Y. Gao (2019) Physiological signals-based emotion recognition via high-order correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15 (3s), pp. 1–18. Cited by: §2.2.