One of the major challenges in research on artificial neural networks is developing the ability to accumulate knowledge over time from a non-stationary data stream[7, 29, 38]
. Although most successful deep learning techniques can achieve excellent results on pre-collected and fixed datasets, they are incapable of adapting their behavior to the non-stationary environment over time[18, 73]. When streaming data comes in continuously, training on the new data can severely interfere with the model’s previously learned knowledge of past data, resulting in a drastic drop in performance on the previous tasks. This phenomenon is known as catastrophic forgetting or catastrophic interference [54, 61]. Continual learning (also known as lifelong learning or incremental learning) [62, 63, 79, 73, 80, 95, 18] aims to solve this problem by maintaining and accumulating the acquired knowledge over time from a stream of non-stationary data.
Continual learning requires neural networks to be both stable to prevent forgetting as well as plastic to learn new concepts, which is referred to as the stability-plasticity dilemma [55, 27]. The early works have focused on the Task-aware protocol, which selects the corresponding classifier using oracle knowledge of the task identity at inference time [2, 46, 65, 69, 18]. For example, regularization-based methods penalize the changes of important parameters when learning new tasks and typically assign a separate classifier for each task [12, 42, 64, 85, 93]. Recent studies have focused on a more practical Task-free protocol, which evaluates the network on all classes observed during training without requiring the identity of the task [3, 81, 8, 95, 37, 90, 13]. Among them, rehearsal-based methods that store a small set of seen examples in a limited memory for replaying have demonstrated promising results [4, 62, 9, 76]. This paper focuses on a more realistic and challenging setting: online continual learning (online CL) [34, 50, 67], where the model learns a sequence of classification tasks with a single pass over the data and without task boundaries. There are also Task-free and Task-aware protocols in online CL.
Inspired by recent breakthroughs in self-supervised learning[16, 11, 30, 72], we find that the knowledge learned by supervised contrastive learning [31, 41, 28] reveals greater robustness and transferability. The general and transferable knowledge is the essence of what online CL seeks to explore, which could effectively help mitigate forgetting. Unfortunately, it is challenging to employ contrastive learning in the continual setting. There are two main issues as follows: 1) Contrastive learning requires informative negative samples to learn and distinguish different clusters. However, in online CL, previous task data is unavailable or very limited, which causes severe imbalanced contrast between past and new classes. 2) The contrastively learned knowledge may suffer from forgetting since the distribution of the data stream is continually changing.
Considering the superior modeling capability of Vision Transformers 
recently demonstrated on computer vision tasks[57, 84, 26, 92, 17], we take the utilization of the potential of attention mechanism  to develop online CL. Overall, we strategically integrate contrastive learning and transformer to model the online data stream. We propose a novel framework, Contrastive Vision Transformer (CVT), to alleviate the forgetting problem and tackle the above imbalance issue of contrastive learning in online CL. An overview of the framework is illustrated in Fig. 1. Specifically, we newly design an effective and efficient transformer architecture with external attention to implicitly capture previous tasks’ information and reduce the number of parameters. CVT contains learnable focuses for each class, which could accumulate the knowledge of previous classes to alleviate forgetting. Based on the learnable focuses, we design a focal contrastive loss at the attention level to rebalance contrastive learning between new and past classes and improve the inter-class distinction and intra-class aggregation. Moreover, CVT contains a dual-classifier structure: an injection classifier is used to inject representation of stream data into the model, mitigating interference with previous knowledge; and an accumulation classifier focuses on integrating the previous and new knowledge in a balanced manner.
We systematically compare state-of-the-art and well-established methods for the online CL problem in both the Task-free and Task-aware protocols. Experimental results show that the CVT framework significantly outperforms other approaches in terms of accuracy and forgetting, even with fewer parameters. Ablation study validates each component of the proposed framework.
The main contributions of this paper are three-fold:
We propose a novel framework Contrastive Vision Transformer (CVT) to achieve a better stability-plasticity trade-off for online CL. CVT contains class-wise learnable focuses, which can accumulate the knowledge of previous classes to alleviate forgetting.
We design a focal contrastive loss to rebalance contrastive learning between new and past classes and learn more robust representations.
The extensive experimental results show that CVT achieves state-of-the-art performance with even fewer parameters on online continual learning benchmarks.
2 Related Works
2.1 Continual Learning Methods
Continual learning (CL) methods have been developed to alleviate catastrophic forgetting in neural networks. These methods can be divided into three main categories: expansion-based, regularization-based, and rehearsal-based methods.
. However, most expansion-based methods require task identity during inference in order to allocate distinct sets of parameters to distinct tasks. Regularization-based methods limit the changes in important parameters during the learning of new tasks by estimating the importance of each network parameter for prior tasks[12, 42, 2, 64, 43, 66, 93]. These works differ in how they compute the importance of network parameters.
Rehearsal-based methods [8, 37, 81, 60, 15, 22, 49, 79, 68, 52] alleviate catastrophic forgetting by replaying a subset of data of past tasks stored in limited buffer. iCaRL  trains a nearest-class-mean classifier while limiting the change of the representation in later tasks through a self-distillation loss. In addition to replaying past experiences, HAL  keeps predictions on some anchor points by an additional objective. IExpressNet  introduces representative expression memory while employing a novel Center-expression-distilled loss and having shown satisfactory performance for the task of facial expression recognition. DER++  combines rehearsal with distillation loss [36, 83, 47] to retrain past experience and obtain state-of-the-art performance. RM  proposes an uncertainty-based sampling approach by using uncertainty and data augmentation to improve rehearsal. The proposed method in this paper belongs to the rehearsal-based method.
2.2 Online Continual Learning
Online continual learning (online CL) is a more realistic [50, 78, 87, 96, 67, 49] and difficult setup , where the model learns from a non-i.i.d data stream online, without the help of task identifiers or task boundaries both at training and inference stages. Online CL methods are mainly based on rehearsal.
Experience Replay (ER)  employs reservoir sampling for memory management and jointly optimizes a network by mixing memory data with online stream data. ERT  improves ER by balanced sampling and bias control. AGEM  and GEM  use episodic memory to compute past task gradients to constrain the online update step. GSS  presents a gradient-based sampling to store diversified data for learning more information. ASER  is based on the Shapley Value theory to improve the memory buffer update and sampling. CLS  uses two extra models to maintain long-term and short-term semantic memories for knowledge consolidation. SCR  uses supervised contrastive loss for representation learning and employs the nearest class mean to classify.
2.3 Vision Transformers
Transformer model is firstly applied into machine translation tasks 
, and then, Transformers have become the state-of-the-art models for most natural language processing tasks[20, 58, 24, 51, 25]. Attention modules are the core components of transformers, which aggregate information from the entire input sequence. Recently, Vision Transformer (ViT)  is proposed to makes Transformer architecture scalable for image classification when the data is large enough. Since then, a lot of effort has been dedicated to improving Vision Transformers’ data efficiency and model efficiency [92, 32, 91, 40], where an effective direction is to strategically integrate properties of convolution into the Transformer architecture [74, 91, 48, 84, 26, 94, 23, 17]. CoAT  proposes a conv-attention module that realizes relative position embeddings with convolutions. LeViT  builds pyramid attention structure with pooling to learn convolutional-like features. By leveraging sequence pooling and convolutions, CCT  eliminates the need for class tokens and positional embeddings.
Nevertheless, current vision transformers may not be applicable to modeling the online data stream, and existing continual learning algorithms developed for CNNs may not be ideal for vision transformers as well. To this end, we propose a lightweight Contrastive Vision Transformer (CVT) with a focal contrastive loss for online continual learning and achieve better performance than other transformers and CNN baselines.
3.0.1 Problem Setup.
Formally, an online continual learning problem is split in a sequence of supervised learning tasks , , where is the total number of tasks. Let be the corresponding online data stream, where is the dataset of task . For task , input samples and the corresponding ground truth labels are drawn from the i.i.d. distribution . A mini-batch of training data from comes gradually in an online stream (each sample is seen only once). Besides, a limited memory buffer saves a small set of training data of seen tasks. The model is trained on at each iteration. At task , The label space of the model is all observed classes , and the model is expected to predict well on all classes at the inference stage.
3.0.2 Supervised Contrastive Learning (SCL).
SCL [41, 28, 16] aims to push the representation of samples with different classes farther apart while tightly clustering representation of samples with the same class. Suppose that the classification model can be decomposed into two components: an encoder and a classifier . Encoder maps an image sample to a vectorial embedding (representation) . Classifier maps the representation
to a classification vector. Without training , SCL focuses on training as follows: given a batch of samples , SCL first generates an augmented batch by making two random augmentations of , with . The SCL loss takes the following form:
where represents the set of indices of ; is the set of representations of samples in except for that of ; is the set of representations of positive samples (i.e., samples with the same labels) with respect to the anchor ;
is a temperature hyperparameter;is its cardinality.
Although SCL could learn the transferable representation to help prevent forgetting in online CL, SCL will face new challenges: 1) previous task data is unavailable or very limited due to the streaming fashion, which causes severe imbalanced contrast between past and new classes; 2) the contrastively learned knowledge may suffer from forgetting since the distribution of the data stream is continually changing. For visualization clarity we use a 2D feature space. As illustrated in Fig. 2(a1) and Fig. 2(a2), the imbalanced data stream in online CL makes the representation of previous tasks drift and difficult to be accurately clustered by SCL.
To alleviate the forgetting problem in online continue learning, we propose a framework Contrastive Vision Transformer (CVT), which designs a new focal contrastive learning strategy based on the transformer architecture. An overview of the framework is depicted in Fig. 1. CVT plays the strengths of the attention mechanism in online CL, which design an effective transformer architecture with external attention. We tackle the imbalance issue of SCL in online CL by proposing a focal contrastive loss at the attention level. The learnable focuses in CVT can accumulate class-specific knowledge to alleviate the forgetting of previous tasks. Besides, a dual-classifier is used to decouple learning current classes and balancing all seen classes, improving the stability-plasticity trade-off.
4.1 Model Architecture
Fig. 3 illustrates the CVT architecture. The major contributing components in the architecture include 1) external attention, which implicitly captures previous tasks’ information and reduces the number of parameters, and 2) learnable focuses, which could maintain and accumulate the knowledge of previous classes.
4.1.1 External Attention.
CVT plays the strengths of the attention mechanism in online CL. Unlike the vanilla self-attention in vision transformers [21, 26] that derives the attention map by computing similarity between self-queries and self-keys , we introduce an external attention mechanism  to obtain the attention map by computing the affinities between self-queries and a learnable external key with an attention bias , which implicitly injects previous task information in the attention mechanism. Moreover, the proposed architecture can save the number of parameters compared to self-attention.
Let the input tensor be
, we apply linear transformation with weightsand to get the vanilla self-query and self-value , respectively. We employ a linear layer to replace the input-depended self-key, and explicitly add a learnable attention bias to attention maps. Consider attention heads, which are uniformly split into segments , and . The external attention mechanism computes the head-specific attention map and concatenates the multi-head attention as follows:
where Norm() denotes batch normalization;is the dimension of the key.
4.1.2 Learnable Focuses.
CVT contains learnable focuses for each class, which could maintain and accumulate the knowledge of previous classes to alleviate forgetting for online CL. The class-wise learnable focuses is a set of learnable attention vectors, as shown in Fig. 3, where a focus corresponds to class , and is the number of the seen classes. The size of the learnable focuses is negligible in relation to the overall model.
When a new class appears in the data stream, the corresponding focus to class starts to participate in the training (refer to Eq. 3). Even if class no longer appears in the data stream afterward, the focus always participates in the online CL training and serves as a negative sample for the other classes, as illustrated in Fig. 2(b1) and Fig. 2(b2). Thus, focuses preserve and accumulate the previously learned class-specific knowledge and acts as a forgetting mitigation in online continual learning.
4.2 Focal Contrastive Continual Learning
We propose a rehearsal-based focal contrastive learning scheme to 1) tackle the imbalance issue of SCL in online CL and 2) accumulate class-specific knowledge, to alleviate the interference with previous tasks. The learning scheme includes two losses: a focal contrastive loss and a dual-classifier loss, as following.
4.2.1 Focal Contrastive Loss.
For learning representation continually, we propose a focal contrastive loss in online CL. As mentioned in Sec. 3, during the training phase, the model observes a mini-batch at a time sampled from task in the data stream . An input batch for the model is composed of and sampled from the memory buffer . The input batch and its augmented view are encoded by CVT blocks to generate attention , as shown in Fig. 3. As mentioned previously, a set of class-wise learnable focuses is utilized by focal contrastive loss , where a focus is a learnable attention vector for class and
is the number of the seen classes. The FC loss function is defined as:
and is the weight of focuses. We set to make focuses play a more important role in contrastive learning; and are the same with supervised contrastive learning in Eq. 1; is a temperature hyperparameter.
The benefits of using focal contrastive loss are two-fold. First, it alleviates the imbalance issue in online CL by employing the class-wise focuses. Second, it accumulates the previous knowledge by the learnable focuses which will continually serve as the prototypes of classes to maintain class-specific information. With a proposed focal contrastive loss in training, CVT rebalances contrastive learning between new and past classes and improve the inter-class distinction and intra-class aggregation, as illustrated in Fig. 2. In Sec. 5.3, we empirically observe that outperforms the original and boost the online CL.
4.2.2 Dual-classifier Loss.
We propose a dual-classifier structure to decouple learning current classes and balancing all seen classes, which contains an injection classifier to inject new task representation into the model, alleviating interference to previously learned knowledge, and an accumulation classifier to integrate past and new knowledge in a balanced manner.
Let be the representation of a sample outputted from the Projection of CVT before the classifier. When new data stream batch arrives, we utilize the output from an independent injection classifier to compute a classification loss:
where denotes the injection classifier and adopts a cross-entropy loss. is only trained on stream data and does not participate in the inference stage.
Besides, we employ an accumulation classifier to focus on improving the stability-plasticity trade-off by integrating previous and new knowledge in a balanced manner. The accumulation classifier is used at the inference stage for outputting the prediction. Rehearsing the limited memory data during learning new tasks is a crucial way to maintain previous knowledge. We can replay the exemplars stored in the memory buffer with their ground truth labels. In addition, the accumulation classifier also needs a supervised signal from current task data. Therefore, we give the accumulation classifier loss:
where denotes the accumulation classifier; and are the coefficients balancing knowledge consolidation. We approximate the expectation by computing gradients on batches sampled from the memory buffer.
|Method||Paras||10 splits||20 splits|
5.1 Experimental Setup and Implementation
We consider a strict evaluation setting [38, 73] for online continual learning, including task-aware protocol  and task-free protocol . For task-aware protocol , the task identities are required for each time evaluation. For task-free protocol , the task identities are unavailable at inference time.
Online continual learning benchmarks evaluate the capacity of an algorithm to learn on not independent and identically distributed (non-iid) data. CIFAR-100 contains 100 classes and each class has 100 testing and 500 training color images. TinyImageNet  consists 200 classes that include 100,000 images for training and 10,000 images for testing. ImageNet100  contains 100 classes randomly chosen from ILSVRC , including about 120,000 images for training and 5,000 images for validation.
Baselines. We compare CVT with state-of-the-art and well-established Online CL baselines, including 11 rehearsal-based methods (ER , GEM , AGEM , GSS , FDR , HAL , ASER , ERT , RM , SCR , and CLS ), 2 methods leveraging Knowledge Distillation (iCaRL  and DER++ ). Besides, we also compare vision transformers (ViT , LeViT , CoAT , and CCT ) with rehearsal strategy for continual learning. We also provide the results of simply performing SGD without any countermeasure to alleviate forgetting.
Metrics. We evaluate online CL methods in terms of accuracy and forgetting following [12, 14, 9]. The accuracy is defined by =, and the forgetting is defined by =, where is the inference accuracy on task when the model finished learning task .
In order to compare each method fairly, we train all networks using stochastic gradient descent (SGD) optimizer. The images used for training are randomly cropped and flipped for each method following[9, 10, 68]
. We adopt 1 epoch with mini-batch size of 10 for all datasets, following[62, 9, 10, 93]. Online continual learning baselines use ResNet18  as backbone and cross-entropy as the classification loss, following [14, 68, 6, 9, 13, 71]. The implementation of transformer block is based on LeViT  and ViT . CVT framework employs GELU activation and dropout in transformer blocks and uses a global average pooling to the last activation map.
5.2 Comparison to State-of-the-Art Methods
Evaluation on CIFAR100. Following the setting proposed in [62, 89], we trains all 100 classes in several splits, including 10, 20 incremental tasks. Table 1 summarizes the overall accuracy on CIFAR100 with 500 and 1000 memory sizes. It is demonstrated that CVT outperforms other baselines by a considerable margin in different incremental splits, e.g., CVT can improve the accuracy of continual learning by more than 8% in 10-split with 500 memory capacity. Especially in the case of small memory, the advantage of CVT is more obvious, which indicates CVT can effectively alleviate the imbalance issue in online CL. It is worth noting that although CVT uses fewer parameters (8.9M) than other methods (11.2M33.8M), it can still achieve superior performance. One reason is that CVT inherits the merits of transformers for modeling the stream of tasks without stacking a lot of parameters, and besides, the number of parameters for the proposed learnable focuses is extremely small.
Evaluation on ImageNet datasets.
Evaluation on ImageNet datasets.Table 2 summarizes the evaluation results for the TinyImageNet and ImageNet100 datasets with 10 splits. It is demonstrated that CVT consistently surpasses other methods with a considerable margin for Task-free and Task-aware on TinyImageNet and ImageNet100 datasets. Specifically, our method outperforms the state-of-the-art with about 4.3% for the Task-aware accuracy on the ImageNet100 benchmark. For TinyImageNet benchmark, the Task-free accuracy is improved from 9.95% to 14.71%(+4.76%). Moreover, CVT takes fewer parameters compared to other CNN-based methods.
, CVT suffers from less forgetting than all the other baselines in both of Task-free and Task-aware settings with memory buffer 1000 on CIFAR100. This is because CVT utilizes the focal contrastive loss and dual-classifier loss, which improve the stability of the vision transformer network.
Incremental Performance. We also evaluate the average incremental performance [62, 9] under the Task-free protocol with 500 memory buffer, which is the result of evaluating on all the tasks observed so far after completing each task. As illustrated in Fig. 5, the results are curves of accuracy and forgetting after each task. It is observed that during the learning process, most methods degrade rapidly as new tasks arrive, while our method consistently outperforms the state-of-the-art methods in both accuracy and forgetting.
5.3 Ablation Study and Analysis
Comparison to Transformer and CNN Backbone. We compare CVT to Vision Transformer networks (ViT , LeViT , CoAT , and CCT ) and the CNN benchmark ResNet18  under the proposed rehearsal strategy in online continual learning. Table 3 demonstrates the results of accuracy and forgetting on CIFAR100 and TinyImageNet with 500 memory. It is observed that ViT is not up to the task of online continual learning, since it is “data-hungry” and only fits i.i.d. large datasets. Besides, LeViT, CoAT, and CCT contain CNN structures to obtain inductive biases, which still suffer from catastrophic forgetting in online CL. We can find that using Vision Transformer directly for online CL can not consistently outperform CNN-based networks. Our proposed CVT architecture essentially inherits the merits of CNN and transformers and thus works well in online streaming data and modeling long-dependencies in the input data. Moreover, CVT even takes fewer parameters to achieve better performance for online CL, which also benefits from the focal contrastive loss and the dual-classifier structure.
Effect of Each Component. To assess the effects of the components in CVT, we perform ablation study in terms of accuracy and forgetting. From Table 3 we can observe that the proposed focal contrastive loss plays an important role in alleviating catastrophic forgetting and accumulating knowledge. However, if we simply use supervised contrastive learning loss to replace , we find that the forgetting problem is not mitigated compared to not using any contrastive loss. This is because using directly could cause a severe imbalance between new and past classes in online CL, which limits the learning of transferable representation. While can overcome the issue by utilizing the learnable focuses to boost the performance of online CL. This supports that can rebalance contrastive learning between new and past classes and improve the inter-class distinction and intra-class aggregation. Besides, we can see the dual-classifier loss obtains 2.37% and 7.97% gain in terms of accuracy and forgetting on CIFAR100, respectively. The results of Table 3 demonstrate the effectiveness of each component of CVT.
|ResNet18 + dual-classifier||11.2||18.84||35.38||11.2||10.91||33.55|
|CVT - +||8.9||20.73||25.52||9.0||12.59||20.47|
|CVT - dual-classifier||8.9||22.08||29.81||9.0||13.63||18.95|
In this paper, we propose a novel attention-based framework, Contrastive Vision Transformer (CVT), to effectively mitigate the catastrophic forgetting for online CL. To the best of our knowledge, this paper is the first in the literature to design a Transformer for online CL. CVT contains external attention and learnable focuses to accumulate previous knowledge and maintain class-specific information. With a proposed focal contrastive loss in training, CVT rebalances contrastive continual learning between new and past classes and improves the inter-class distinction and intra-class aggregation. Moreover, CVT designs a dual-classifier structure to decouple learning current classes and balancing all seen classes. Extensive experimental results show that our approach significantly outperforms current state-of-the-art methods with fewer parameters. Ablation study validates the effectiveness of each proposed component.
This work is supported by ARC FL-170100117, DP-180103424, IC-190100031, and LE-200100049.
Conditional channel gated networks for task-aware continual learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3931–3940. Cited by: §2.1.
Memory aware synapses: learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154. Cited by: §1, §2.1.
-  (2017) Expert gate: lifelong learning with a network of experts. In CVPR, pp. 3366–3375. Cited by: §1.
-  (2019) Gradient based sample selection for online continual learning. In Advances in Neural Information Processing Systems, Cited by: §1, §2.2, §5.1, Table 1.
-  (2022) Learning fast, learning slow: a general continual learning method based on complementary learning system. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §5.1, Table 1.
-  (2021) Rainbow memory: continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8218–8227. Cited by: §2.1, §5.1, §5.1, Table 1, Table 2.
-  (2017) Deep learning. Vol. 1, MIT press Massachusetts, USA:. Cited by: §1.
-  (2019) Measuring and regularizing networks in function space. International Conference on Learning Representations. Cited by: §1, §2.1, §5.1, Table 1, Table 2.
-  (2020) Dark Experience for General Continual Learning: a Strong, Simple Baseline. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2.1, §4.2.2, §5.1, §5.1, §5.1, §5.2, §5.2, Table 1, Table 2.
-  (2021) Rethinking experience replay: a bag of tricks for continual learning. In International Conference on Pattern Recognition (ICPR), pp. 2180–2187. Cited by: §2.2, §5.1, §5.1, Table 1, Table 2.
-  (2020) Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, pp. 9912–9924. Cited by: §1.
-  (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §1, §2.1, §5.1, §5.2.
-  (2021) Using hindsight to anchor past knowledge in continual learning. In AAAI, Cited by: §1, §2.1, §5.1, §5.1, Table 1.
-  (2019) Efficient Lifelong Learning with A-GEM. In International Conference on Learning Representations (ICLR), Cited by: §2.2, §5.1, §5.1, §5.1, Table 1, Table 2.
-  (2022) Learning meta-adversarial features via multi-stage adaptation network for robust visual object tracking. Neurocomputing 491, pp. 365–381. External Links: Cited by: §2.1.
A simple framework for contrastive learning of visual representations.
International conference on machine learning, pp. 1597–1607. Cited by: §1, §3.0.2.
-  (2021) CoAtNet: marrying convolution and attention for all data sizes. In Advances in Neural Information Processing Systems, Cited by: §1, §2.3.
-  (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §1, §1.
-  (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §5.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.3.
-  (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §1, §2.3, §4.1.1, §5.1, §5.1, §5.3, Table 3.
-  (2020) Podnet: pooled outputs distillation for small-tasks incremental learning. In European Conference on Computer Vision (ECCV), pp. 86–102. Cited by: §2.1.
-  (2022) DyTox: transformers for continual learning with dynamic token expansion. Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.3.
Position-aware image captioning with spatial relation. Neurocomputing 497, pp. 28–38. External Links: Cited by: §2.3.
-  (2019) Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 244–253. Cited by: §2.3.
-  (2021-10) LeViT: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12259–12269. Cited by: §1, §2.3, §4.1.1, §5.1, §5.1, §5.3, Table 3.
-  (2013) Adaptive resonance theory: how a brain learns to consciously attend, learn, and recognize a changing world. Neural networks 37, pp. 1–47. Cited by: §1.
-  (2020) Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:2011.01403. Cited by: §1, §3.0.2.
-  (2020) LTF: a label transformation framework for correcting label shift. In ICML, Vol. 119, pp. 3843–3853. Cited by: §1.
Alleviating semantics distortion in unsupervised low-level image-to-image translation via structure consistency constraint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18249–18259. Cited by: §1.
-  (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1.
-  (2020) A survey on visual transformer. ArXiv abs/2012.12556. Cited by: §2.3.
-  (2021) Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704. Cited by: §2.3, §5.1, §5.3, Table 3.
-  (2020) Incremental learning in online scenario. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13926–13935. Cited by: §1, §2.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §5.1, §5.3, Table 3.
-  (2014) Distilling the Knowledge in a Neural Network. In NeurIPS workshop, Cited by: §2.1.
-  (2019) Learning a unified classifier incrementally via rebalancing. In CVPR, pp. 831–839. Cited by: §1, §2.1.
-  (2018) Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines. In NeurIPS Continual learning Workshop, Cited by: §1, §5.1.
-  (2021) Gradient-based editing of memory examples for online task-free continual learning. Advances in Neural Information Processing Systems 34. Cited by: §5.1.
-  (2021) Transformers in vision: a survey. arXiv preprint arXiv:2101.01169. Cited by: §2.3.
-  (2020) Supervised contrastive learning. Advances in Neural Information Processing Systems 33. Cited by: §1, §3.0.2.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.1.
-  (2021) Adaptive curriculum learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5067–5076. Cited by: §2.1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §5.1.
-  (2019) Learn to grow: a continual structure learning framework for overcoming catastrophic forgetting. In International Conference on Machine Learning (ICML), Cited by: §2.1.
-  (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40 (12). Cited by: §1.
-  (2018) Improving the interpretability of deep neural networks with knowledge distillation. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Cited by: §2.1.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022. Cited by: §2.3.
-  (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1, §2.2, §2.2, §5.1, Table 1.
-  (2018) Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275, pp. 1261–1274. Cited by: §1, §2.2.
Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §2.3.
-  (2021) Supervised contrastive replay: revisiting the nearest class mean classifier in online class-incremental continual learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3584–3594. Cited by: §2.1, §2.2, §5.1, Table 2.
-  (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.1.
-  (1989) Catastrophic interference in connectionist networks: the sequential learning problem. G. H. Bower (Ed.), Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §1.
-  (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology 4, pp. 504. Cited by: §1.
-  (2021) DualNet: continual learning, fast and slow. Advances in Neural Information Processing Systems 34. Cited by: §5.1.
-  (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §1.
-  (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.3.
-  (2019) Random path selection for continual learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1.
-  (2017) Encoder based lifelong learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1320–1328. Cited by: §2.1.
-  (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §1.
-  (2017) icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.1, §5.1, §5.1, §5.1, §5.2, §5.2, Table 1, Table 2.
-  (2019) Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.2, §5.1, Table 1, Table 2.
-  (2018) Progress & Compress: A scalable framework for continual learning. In International Conference on Machine Learning, Cited by: §1, §2.1.
-  (2018) Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, Vol. 80, pp. 4548–4557. Cited by: §1.
-  (2021) Continual learning via bit-level information preserving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16674–16683. Cited by: §2.1.
Online class-incremental continual learning with adversarial shapley value.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 9630–9638. Cited by: §1, §2.2, §2.2, §5.1, Table 1, Table 2.
-  (2021) On learning the geodesic path for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1591–1600. Cited by: §2.1, §5.1.
-  (2021) Rectification-based knowledge retention for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15282–15291. Cited by: §1, §2.1.
-  (2015) Tiny ImageNet Challenge (CS231n). Note: http://tiny-imagenet.herokuapp.com/ Cited by: §5.1.
-  (2021) Layerwise optimization by gradient decomposition for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9634–9643. Cited by: §5.1.
-  (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §1.
-  (2018) Three continual learning scenarios. NeurIPS Continual Learning Workshop. Cited by: §1, §5.1.
-  (2021) Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12894–12904. Cited by: §2.3.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.3, §4.1.1.
-  (2021) Rehearsal revealed: The limits and merits of revisiting samples in continual learning. In ICCV, Cited by: §1.
-  (1985) Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11 (1), pp. 37–57. Cited by: §4.2.2.
-  (2021) Multi-label few-shot learning with semantic inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 15917–15918. Cited by: §2.2.
-  (2022-06) Continual learning with lifelong vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 171–181. Cited by: §1, §2.1, §4.1.1.
-  (2021) Continual learning with embeddings: algorithm and analysis. In ICML 2021 Workshop on Theory and Foundation of Continual Learning, Cited by: §1.
-  (2022) Continual learning through retrieval and imagination. Proceedings of the AAAI Conference on Artificial Intelligence 36 (8), pp. 8594–8602. Cited by: §1, §2.1.
-  (2022) SIN: semantic inference network for few-shot streaming label learning. IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14. Cited by: §2.1.
-  (2020) Deep streaming label learning. In International Conference on Machine Learning (ICML), Vol. 119, pp. 9963–9972. Cited by: §2.1.
-  (2021-10) CvT: introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 22–31. Cited by: §1, §2.3.
-  (2022) Single-leader multi-follower games for the regulation of two-sided mobility-as-a-service markets. European Journal of Operational Research. Cited by: §1.
-  (2020) Bounding the efficiency gain of differentiable road pricing for evs and gvs to manage congestion and emissions. PloS one 15 (7), pp. e0234204. Cited by: §2.1.
-  (2020) Incentive-compatible mechanisms for online resource allocation in mobility-as-a-service systems. arXiv preprint arXiv:2009.06806. Cited by: §2.2.
-  (2021-10) Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9981–9990. Cited by: §2.3, §5.1, §5.3, Table 3.
-  (2021) DER: dynamically expandable representation for class incremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §5.2.
-  (2020) Semantic drift compensation for class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6982–6991. Cited by: §1.
-  (2021) Improving vision transformers for incremental learning. arXiv preprint arXiv:2112.06103. Cited by: §2.3.
-  (2021-10) Tokens-to-token vit: training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 558–567. Cited by: §1, §2.3.
-  (2017) Continual learning through synaptic intelligence. In International Conference on Machine Learning (ICML), Cited by: §1, §2.1, §5.1.
-  (2021) Survey on facial expression recognition: history, applications, and challenges. IEEE MultiMedia 28 (4), pp. 38–44. Cited by: §2.3.
-  (2020) IExpressNet: facial expression recognition with incremental classes. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2899–2908. Cited by: §1, §1, §2.1.
-  (2019) Physiological signals-based emotion recognition via high-order correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15 (3s), pp. 1–18. Cited by: §2.2.