Recent Advances of Continual Learning in Computer Vision: An Overview

09/23/2021 ∙ by Haoxuan Qu, et al. ∙ Singapore University of Technology and Design Lancaster 52

In contrast to batch learning where all training data is available at once, continual learning represents a family of methods that accumulate knowledge and learn continuously with data available in sequential order. Similar to the human learning process with the ability of learning, fusing, and accumulating new knowledge coming at different time steps, continual learning is considered to have high practical significance. Hence, continual learning has been studied in various artificial intelligence tasks. In this paper, we present a comprehensive review of the recent progress of continual learning in computer vision. In particular, the works are grouped by their representative techniques, including regularization, knowledge distillation, memory, generative replay, parameter isolation, and a combination of the above techniques. For each category of these techniques, both its characteristics and applications in computer vision are presented. At the end of this overview, several subareas, where continuous knowledge accumulation is potentially helpful while continual learning has not been well studied, are discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human learning is a gradual process. Throughout the course of human life, humans continually receive and learn new knowledge. While new knowledge plays a role in its own accumulation, it also supplements and revises previous knowledge. In contrast, traditional machine learning and deep learning paradigms generally distinguish the processes of knowledge training and knowledge inference, where the model is required to complete its training on a pre-prepared dataset within a limited time which can then be used for inference.

With the widespread popularity of cameras and mobile phones, a large number of new images and videos are captured and shared every day. This has given birth to new requirements, especially in the computer vision area, for models to learn and update themselves sequentially and continuously during its inference, since retraining a model from scratch to adapt to the daily newly generated data is time-consuming and extremely inefficient.

Considering the different structures of neural networks and human brains, neural network training is not easily transformed from its original batch learning mode to the new continual learning mode. In particular, there exist two main problems. Firstly, learning from data with multiple categories in sequential order can easily lead to the problem of catastrophic forgetting

[MCCLOSKEY1989109, FRENCH1999128]. This means the performance of the model on previously learned categories often decreases sharply after the model parameters are updated from the data of the new category. Secondly, when learning from new data of the same category in sequential order, this can also lead to the problem of concept drift [schlimmer1986incremental, widmer1993effective, gama2014survey], as the new data may change the data distribution of this category in unforeseen ways [royer2015classifier]. Hence, the overall task of continual learning is to solve the stability-plasticity dilemma [grossberg2007consciousness, mermillod2013stability], which requires the neural networks to prevent forgetting previously learned knowledge, while maintaining the capability of learning new knowledge.

Figure 1: General trend of the number of papers on continual learning in computer vision published in top-ranked conferences during the past six years. The plot shows consistent growth in recent literature.

In recent years, an increasing number of continual learning methods have been proposed in various subareas of computer vision, as shown in Figure 1. Additionally, several competitions [lomonaco2020cvpr, 2ndclvisioncvprworkshop] related to continual learning in computer vision have been held in both 2020 and 2021. Hence, in this paper, we present an overview of the recent advances of continual learning in computer vision. We summarize the main contributions of this overview as follows. (1) A systematic review of the recent progress of continual learning in computer vision is provided. (2) Various continual learning techniques that are used in different computer vision tasks are introduced, including regularization, knowledge distillation, memory-based, generative replay, and parameter isolation. (3) The subareas in computer vision, where continual learning is potentially helpful yet still not well investigated, are discussed.

The remainder of this paper is organized as follows. Section 2 gives the definition of continual learning. Section 3 presents the commonly used evaluation metrics in this area. Section 4 discusses various categories of continual learning methods and their applications in computer vision. The subareas of computer vision where continual learning has not been well exploited are discussed in section 5. Finally, section 6 concludes the paper.

2 Continual Learning: Problem Definition

In this section, we introduce the formalization of continual learning, following the recent works in this area [lesort2020continual, mai2021online, aljundi2019online]. We denote the initial model before continual learning as , the number of classes of as , and a potentially infinite data stream of tasks as , where . and are the sets of data instances and their corresponding labels for the class in task , and is the total number of classes covered after learning . We further denote the model after learning as . Then a typical continual learning process can be defined as = {, , …, , …}, where . Note that besides a subset of data instances and labels stored in memory based methods, the data instances and labels previous to are generally inaccessible during the process of . Then the general objective of the continual learning method is to ensure every model learned in process achieves good performance on the new task without affecting the performance of the previous tasks.

As mentioned by Van de Ven and Tolias [van2019three], continual learning can be further divided into three different scenarios: (i) task-, (ii) domain-, and (iii) class-incremental continual learning. Task-incremental continual learning generally requires the task ID to be given during inference. Domain-incremental continual learning aims to distinguish classes inside each task instead of distinguishing different tasks and does not require the task ID during inference. Class-incremental continual learning aims to distinguish classes both inside and among tasks, without requiring the task ID during inference. For example, suppose we want to distinguish hand-written numbers from 1 to 4, and we divide this into two tasks and , where contains 1 and 2, and contains 3 and 4. For inference, with a given hand-written number from 1 to 4, task-incremental continual learning requires the corresponding task ID (either or ) and can distinguish between 1 and 2 in , and between 3 and 4 in

. Meanwhile, domain-incremental continual learning does not require a task ID, but can only distinguish between odd and even numbers. In contrast, class-incremental continual learning can distinguish among the numbers 1 to 4 without requiring the task ID. Hence, in general, task-incremental and domain-incremental continual learning can be regarded as simplified versions of class-incremental continual learning. Thus, in this overview, we mainly focus on continual learning methods from the perspective of various categories of techniques, rather than their usage in specific scenarios.

Figure 2: A taxonomy of continual learning methods.

3 Evaluation Metrics

In the evaluation of different continual learning methods, it is important to measure both their performance on continuously learning new tasks and how much knowledge from the previous tasks has been forgotten. There are several measurement metrics [lopez2017gradient, chaudhry2018riemannian] that have been popularly used in continual learning methods. In the following, we define as the accuracy on the test set of the task after continual learning of tasks, where .

Average accuracy () measures the performance of the continual learning method after a total of tasks have been learned, which can be formulated as:

(1)

Average forgetting () measures how much knowledge has been forgotten across the first tasks,

(2)

where knowledge forgetting of a task is defined as the difference between the maximal obtained knowledge during the continual learning process and the knowledge remaining after tasks have been learned. It can be calculated as:

(3)

Intransigence () measures how much continual learning prevents a model from learning a new task compared to typical batch learning:

(4)

where denotes the accuracy on the test set of the task when batch learning is used for the tasks.

Backward transfer () measures how much the continual learning on the task influences the performance of the previously learned tasks. It can be defined as:

(5)

Forward transfer () measures how much the continual learning on the task potentially influences the performance of the future tasks. Following the notation of [lopez2017gradient], we define as the accuracy on the test set of the task at random initialization. Then, we define forward transfer () as

(6)

4 Continual Learning Methods

Recently, a large number of continual learning methods have been proposed for various computer vision tasks, such as image classification, object detection, semantic segmentation, and image generation. Generally, these methods can be divided into different categories, including regularization based, knowledge distillation based, memory based, generative replay based, and parameter isolation based methods. Below we review these methods in detail. We also discuss the works that take advantage of the complementary strengths of multiple categories of methods to improve performance.

4.1 Regularization based methods

Regularization based methods [kirkpatrick2017overcoming, zenke2017continual, he2018overcoming, ebrahimi2019uncertainty]

generally impose restrictions on the update process of various model parameters and hyperparameters in order to consolidate previously learned knowledge while learning new tasks to mitigate catastrophic forgetting in continual learning. This can be achieved through a variety of schemes, which are introduced below.

4.1.1 Regularizing loss function

As the name suggests, the most typical scheme used by regularization based methods is to consolidate previously learned knowledge by regularizing the loss function.

In image classification, a typical method called Elastic Weight Consolidation (EWC) was first proposed by Kirkpatrick et al. [kirkpatrick2017overcoming]. EWC injects a new quadratic penalty term into the loss function to restrict the model from modifying the weights that are important for the previously learned tasks. The importance of weights is calculated via the diagonal of the Fisher information matrix. As shown in Figure 3, by adding the new penalty term to the loss function, EWC bounds the model parameters to update to the common low loss area among tasks, instead of the low loss area of the new task only, thus helping to alleviate the catastrophic forgetting problem. However, the assumption of EWC [kirkpatrick2017overcoming], that the Fisher information matrix is diagonal, is almost never true. To address this issue, Liu et al. [liu2018rotate] proposed to approximately diagonalize the Fisher information matrix by rotating the parameter space of the model, leaving the forward output unchanged. EWC [kirkpatrick2017overcoming] requires a quadratic penalty term to be added for each learned task, and hence has a linearly increasing computational cost. To handle this issue, Schwarz et al. [schwarz2018progress] proposed an online version of EWC to extend EWC to a memory-efficient version by only focusing on a single penalty term on the most recent task. Chaudhry et al. [chaudhry2018riemannian] also proposed an efficient alternative to EWC called EWC++ which maintains a single Fisher information matrix for all the previously learned tasks and updates the matrix using moving average [martens2015optimizing]. Loo et al. [loo2020generalized] proposed Generalized Variational Continual Learning (GVCL) which unifies both Laplacian approximation in online EWC [schwarz2018progress] and variational approximation in VCL [nguyen2017variational]. The authors also proposed a task-specific layer called FiLM to further mitigate the over-pruning problem in VCL. Lee et al. [lee2017overcoming] merged Gaussian posteriors of models trained on old and new tasks respectively by either mean-IMM which simply averages the parameters of two models or mode-IMM which utilizes Laplacian approximation to calculate a mode of the mixture of two Gaussian posteriors. Ritter et al. [ritter2018online] updated a quadratic penalty term in the loss function for every task by the block-diagonal Kronecker factored approximation of the Hessian matrix for considering the intra-layer parameter interaction simultaneously. Lee et al. [lee2020continual]

pointed out that the current usage of the Hessian matrix approximation as the curvature of the quadratic penalty function is not effective for networks containing batch normalization layers. To resolve this issue, a Hessian approximation method, which considers the effects of batch normalization layers was proposed.

Aside from EWC [kirkpatrick2017overcoming] and its extensions [liu2018rotate, schwarz2018progress] that generally calculate the importance of weights by approximating curvature, Zenke et al. [zenke2017continual]

proposed Synaptic Intelligence (SI) to measure the weights’ importance with the help of synapses. They pointed out that one-dimensional weights are too simple to preserve knowledge. Hence, they proposed three-dimensional synapses which can preserve much more knowledge and prevent important synapses from changing to preserve important previously learned knowledge. A modified version of SI, which measures the importance of weights using the distance in the Riemannian manifold instead of the Euclidean distance, was also proposed in

[chaudhry2018riemannian] to effectively encode the information about all the previous tasks. Park et al. [park2019continual] pointed out that SI [zenke2017continual] may underestimate the loss since it assumes the loss functions are symmetric which is often incorrect. Hence, they proposed Asymmetric Loss Approximation with Single-Side Overestimation (ALASSO), which considers the loss functions of the previous tasks as the observed loss functions. Using the quadratic approximation of these observed loss functions, ALASSO derives the loss function required for the new task by overestimating the unobserved part of the previous loss functions.

Figure 3: Illustration of EWC [kirkpatrick2017overcoming], which bounds the model parameters to update to the common low loss area among tasks, instead of the low loss area of only the new task.

In addition to EWC [kirkpatrick2017overcoming] and SI [zenke2017continual], various other methods [aljundi2018memory, ren2018incremental] have been proposed to preserve important parameters from other perspectives. Aljundi et al. [aljundi2018memory] proposed a method called Memory Aware Synapses (MAS). MAS calculates the importance of weights with a model of Hebbian learning in biological systems, which relies on the sensitivity of the output function and can hence be utilized in an unsupervised manner. Ren et al. [ren2018incremental] focused on incremental few-shot learning. They proposed a method called Attention Attractor Network (ANN) which adds an additional penalty term, adapted from attractor networks [zemel2001localist], to the loss function. Cha et al. [cha2020cpr]

pointed out that as the number of learned tasks increases, the common low loss area for all the learned tasks quickly becomes quite small. Hence, they proposed Classifier-Projection Regularization (CPR) which adds an additional penalty term to the loss function to widen the common low loss area. Hu et al.

[hu2021continual] proposed Per-class Continual Learning (PCL) which treats every task holistically on itself. Hence, new tasks would be less likely to modify the features of those learned tasks. This is achieved by adapting the one-class loss with holistic regularization [hu2020hrn] in continual learning scenarios.

Further to methods that add a single penalty term to the loss function, some methods [ahn2019uncertainty, jung2020continual] handle the stability-plasticity dilemma by directly adding one penalty term for stability and one for plasticity. Ahn et al. [ahn2019uncertainty]

integrated the idea of uncertainty into regularization by making the variance of the incoming weights of each node trainable and further added two additional penalty terms respectively for stability and plasticity. Jung et al.

[jung2020continual] selectively added two additional penalty terms to the loss function, comprising a Lasso term that controls the ability of the model to learn new knowledge, and a drifting term to prevent the model from forgetting. Volpi et al. [volpi2021continual] further proposed a meta-learning strategy to consecutively learn from different visual domains. Two penalty terms are added to the loss function, comprising a recall term to mitigate catastrophic forgetting and an adapt term to ease adaptation to each new visual domain. To mimic new visual domains, the authors applied heavy image manipulations to the data instances from the current domain to generate multiple auxiliary meta-domains. Aside from image classification, they also adapted their model to perform semantic segmentation.

Beyond image classification, some works have focused on other computer vision problems, such as domain adaptation [kundu2020class], image generation [seff2017continual] and image de-raining [zhou2021image]. Kundu et al. [kundu2020class] combined the problem of image classification with domain adaptation and proposed a modified version of the prototypical network [snell2017prototypical] to solve this proposed new problem. Seff et al. [seff2017continual] adapted EWC to a class-conditional image generation task, which requires the model to generate new images conditioned on their class. Zhou et al. [zhou2021image] proposed to apply a regularization based method on image de-raining. They regarded each dataset as a separate task and proposed a Parameter-Importance Guided Weights Modification (PIGWM) approach to calculate the task parameter importance using the Hessian matrix approximation calculated by the Jacobian matrix for storage efficiency.

4.1.2 Regularizing gradient

Besides regularizing loss functions, a few other methods [he2018overcoming, zeng2019continual]

proposed to regularize the gradient given by backpropagation to prevent the update of parameters from interfering with previously learned knowledge.

In image classification, He and Jaeger [he2018overcoming] replaced the typical backpropagation with conceptor-aided backpropagation. For each layer of the network, a conceptor characterizing the subspace of the layer spanned by the neural activation appearing in the previous tasks is calculated and this conceptor is preserved during the backpropagation process. Zeng et al. [zeng2019continual] proposed an Orthogonal Weights Modification (OWM) algorithm. During the training process of each new task, the modification of weights calculated during typical backpropagation is further mapped onto a subspace generated by all the previous tasks in order to maintain the performance of the previous tasks. A Context-Dependent Processing (CDP) module is also included to facilitate the learning of contextual features. Similarly, Wang et al. [wang2021training]

pointed out that mapping the update of the model parameters for learning the new task into the null space of the previous tasks can help to mitigate catastrophic forgetting. They proposed Adam-NSCL which uses Singular Value Decomposition (SVD) to approximate the null space of the previous tasks. Saha et al.

[saha2021gradient] partitioned the gradient into Core Gradient Space (CGS) and Residual Gradient Space (RGS) which are orthogonal to each other. The gradient step of the new task is enforced to be orthogonal to the CGS of the previous tasks. As a result, learning the new task can minimally affect the performance of the model on the previous tasks.

4.1.3 Regularizing learning rate

Aside from the regularization of model parameters, a few other methods [ebrahimi2019uncertainty, mirzadeh2020understanding] proposed to regularize the learning rate, which have been shown to be effective in mitigating catastrophic forgetting.

In image classification, Ebrahimi et al. [ebrahimi2019uncertainty] utilized uncertainty to modify the learning rate of the model parameters based on their importance for the previous tasks. Mirzadeh et al. [mirzadeh2020understanding] pointed out that the properties of the local minima of each task have an important role in preventing forgetting. Hence, they proposed to tune the learning rate and batch size to indirectly control the geometry of the local minima of different tasks.

In image semantic segmentation, Ozgun et al. [ozgun2020importance] proposed to prevent the model from losing knowledge by restricting the adaptation of important model parameters with learning rate regularization. More precisely, the learning rate of important model parameters is reduced, while the learning rate of non-important parameters is kept the same.

Apart from the above-mentioned schemes, i.e., regularizing the loss function, gradient and learning rate, Kapoor et al. [kapoor2021variational] performed regularization from another perspective. They proposed Variational Auto-Regressive Gaussian Processes (VAR-GPs), which uses sparse inducing point approximation to better approximate the Gaussian posterior, resulting in a lower bound objective for regularization.

4.2 Knowledge distillation based methods

Knowledge distillation based methods [li2017learning, dhar2019learning] incorporate the idea of knowledge distillation into continual learning by distilling knowledge from the model trained on the previous tasks to the model trained on the new task in order to consolidate previously learned knowledge. This can be achieved through a variety of schemes, which are introduced below.

4.2.1 Distillation in the presence of sufficient data instances

The majority of knowledge distillation based methods [li2017learning, dhar2019learning] were designed for continual learning from a stream of sufficient data instances for each new task.

Figure 4: Illustration of LwF [li2017learning], which stores a copy of the previous model’s parameters before learning the new task, and uses the response of that copied model on the data instances from the new task as the target for the previous tasks’ classifiers during the learning of the new task, while using the accessible ground truth as the target for the new task classifier.

In image classification, Li and Hoiem [li2017learning] proposed a typical method called Learning without Forgetting (LwF). As shown in Figure 4, LwF stores a copy of the previous model parameters before learning the new task. It then uses the response of that copied model on the data instances from the new task as the target for the classifiers of the previous tasks. The accessible ground truth is used as the target for the new task classifier. Rannen et al. [rannen2017encoder]

pointed out that LwF would not result in good performance if the data distributions between different tasks are quite diverse. Hence, they further trained an autoencoder for each task to learn the most important features corresponding to the task, and used it to preserve knowledge. Dhar et al.

[dhar2019learning] proposed Learning without Memorizing (LwM), which adds an extra term i.e., attention distillation loss, to the knowledge distillation loss. The attention distillation loss penalizes changes in the attention maps of classifiers and helps to retain previously learned knowledge. Fini et al. [fini2020online] proposed a two-stage method called Batch-Level Distillation (BLD). In the first stage of learning, only the data instances of the new task are used to minimize the classification loss over the new task classifier. In the second stage of learning, both the knowledge distillation and the learning of the new task are carried out simultaneously. Douillard et al. [douillard2020podnet] regarded continual learning as representation learning and proposed a distillation loss called Pooled Outputs Distillation (POD) which constrains the update of the learned representation from both the final output and the intermediate layers. Kurmi et al. [kurmi2021not] utilized the prediction uncertainty of the model on the previous tasks to mitigate catastrophic forgetting. Aleatoric uncertainty and self-attention are involved in the proposed distillation loss. Simon et al. [simon2021learning] proposed to conduct knowledge distillation on the low-dimensional manifolds between the model outputs of previous and new tasks, which was shown to better mitigate catastrophic forgetting. Hu et al. [hu2021distilling] found that the knowledge distillation based approaches do not have a consistent causal effect compared to end-to-end feature learning. Hence, they proposed to also distill the colliding effect as a complementary method to preserve the causal effect.

In addition to mitigating catastrophic forgetting, a few other methods [hou2019learning, zhao2020maintaining] were also proposed to solve the data imbalance problem or the concept drift problem at the same time. Hou et al. [hou2019learning] systematically investigated the problem of data imbalance between the previous and new data instances and proposed a new knowledge distillation based framework which treats both previous and new tasks uniformly to mitigate the effects of data imbalance. A cosine normalization, a less-forget constraint which acts as a feature level distillation loss, and an inter-class separation component are incorporated to effectively address data imbalance from different aspects. Zhao et al. [zhao2020maintaining] pointed out that the parameters of the last fully connected layer are highly biased in continual learning. They thus proposed a Weight Aligning (WA) method to correct the bias towards new tasks. He et al. [he2020incremental] pointed out that concept drift is much less explored in previous continual learning studies compared to catastrophic forgetting. Hence, the authors proposed both a new cross distillation loss to handle catastrophic forgetting and an exemplar updating rule to handle concept drift.

A few approaches [ke2020continual, lee2021sharing] have been proposed to only transfer relevant knowledge. Ke et al. [ke2020continual] proposed Continual learning with forgetting Avoidance and knowledge Transfer (CAT). When a new task comes, CAT automatically distinguishes the previous tasks into similar tasks and dissimilar tasks. After that, knowledge is transferred from those similar tasks to facilitate the learning of the new task, and task masks are applied to those dissimilar tasks to avoid them being forgotten. Lee et al. [lee2021sharing]

proposed to only transfer knowledge at a selected subset of layers. The selection is made based on an Expectation-Maximization (EM) method.

Besides image classification, several recent works [michieli2019incremental, cermelli2020modeling, douillard2021plop, michieli2021continual] focused on image semantic segmentation. Michieli et al. [michieli2019incremental] proposed several approaches to distill the knowledge of the model learned on the previous tasks, whilst updating the current model to learn the new ones. Cermelli et al. [cermelli2020modeling] further pointed out that at each training step, as the label is only given to areas of the image corresponding to the learned classes, other background areas suffer from a semantic distribution shift, which is not considered in previous works. They then proposed both a new distillation based framework and a classifier parameter initialization strategy to handle the distribution shift. Douillard et al. [douillard2021plop] handled the background semantic distribution shift problem by generating pseudo-labels of the background from the previously learned model. They also proposed a multi-scale spatial distillation, preserving both long-range and short-range spatial relationships at the feature level to mitigate catastrophic forgetting. Michieli and Zanuttigh [michieli2021continual] proposed several strategies in the latent space complementary to knowledge distillation including prototype matching, contrastive learning, and feature sparsity.

Aside from image classification and image semantic segmentation, some works have focused on other computer vision problems, such as object detection [shmelkov2017incremental], conditional image generation [wu2018memory, zhai2019lifelong], image and video captioning [nguyen2019contcap], and person re-identification [pu2021lifelong]. Shmelkov et al. [shmelkov2017incremental]

adapted the knowledge distillation method for object detection from images. This method distills knowledge from both the unnormalized logits and the bounding box regression output of the previous model. Wu et al.

[wu2018memory] adapted the knowledge distillation method for class-conditioned image generation. They proposed an aligned distillation method to align the generated images of the auxiliary generator with those generated by the current generator, given the same label and latent feature as inputs. Zhai et al. [zhai2019lifelong] applied knowledge distillation on conditional image generation. They proposed Lifelong GAN to adapt knowledge distillation towards class-conditioned image generation task that requires no image storage. As a result, this method can also be used for other conditional image generation tasks such as image-conditioned generation, where a reference image instead of a class label is given and there may exist no related image in the previous tasks. Nguyen et al. [nguyen2019contcap]

applied knowledge distillation to both image and video captioning. They proposed ContCap which adapts pseudo-label methods towards image captioning. They also merged knowledge distillation of intermediate features and proposed to partly freeze the model to transfer knowledge smoothly while maintaining the capability to learn new knowledge. Pu et al.

[pu2021lifelong]

proposed to apply knowledge distillation on person re-identification. They pointed out that in human cognitive science, the brain was found to focus more on stabilization during knowledge representation and more on plasticity during knowledge operation. Hence, an Adaptive Knowledge Accumulation (AKA) method was proposed as an alternative to the typical knowledge distillation method. This method uses a knowledge graph as knowledge representation and a graph convolution as knowledge operation, and further involves a plasticity-stability loss to mitigate catastrophic forgetting.

In contrast to the above-mentioned methods, Lee et al. [lee2019overcoming] pointed out that increasing number of data instances for training each task can help mitigate catastrophic forgetting. They thus defined a new problem setup where unlabelled data instances are used together with labeled data instances during continual learning. A confidence-based sampling method was proposed to select unlabelled data instances similar to the previous tasks. They further proposed a global distillation method to distill the knowledge from all the previous tasks together instead of focusing on each previous task separately.

4.2.2 Distillation in the presence of limited data instances

Aside from knowledge distillation based continual learning with sufficient data instances per each new task, several recent works [cheraghian2021semantic, yoon2020xtarnet, perez2020incremental] have focused on few-shot continual learning where only a few data instances are given for each new task.

In image classification, Liu et al. [liu2020incremental] proposed a method named Indirect Discriminant Alignment (IDA). IDA does not align the classifier of the new task towards all the previous tasks during distillation but carries out alignment to a subset of anchor tasks. Hence, the model is much more flexible towards learning new tasks. Cheraghian et al. [cheraghian2021semantic] made use of word embeddings when only a few data instances are available. Semantic information from word embeddings is used to identify the shared semantics between the learned tasks and the new task. These shared semantics are then used to facilitate the learning of the new task and preserve previously learned knowledge. Yoon et al. [yoon2020xtarnet] proposed a few-shot continual learning framework called XtarNet containing a base model and different meta-learnable modules. When learning a new task, novel features, which are extracted by the meta-learnable feature extractor module, are combined with the base features to produce a task-adaptive representation. The combination process is controlled by another meta-learnable module. The task-adaptive representation helps the base model to quickly adapt to the new task. In object detection, Perez et al. [perez2020incremental] proposed OpeNended Centre nEt (ONCE) which adapts CentreNet [zhou2019objects] detector towards the continual learning scenario where new tasks are registered with the help of meta learning.

Besides the above-mentioned schemes, i.e., distillation in the presence of sufficient data instances and distillation in the presence of limited data instances, Yoon et al. [yoon2021federated] proposed a new problem setup called federated continual learning, which allows several clients to carry out continual learning each by itself through an independent stream of data instances that is inaccessible to other clients. To solve this problem, they proposed to separate parameters into global and task-specific parameters, whereby each client can obtain a weighted combination of the task-specific parameters of the other clients for knowledge transfer.

4.3 Memory based methods

Memory based methods [rebuffi2017icarl, chaudhry2019tiny, lopez2017gradient] generally have a memory buffer to store data instances and/or various other information related to the previous tasks, which are replayed during the learning of new tasks, in order to consolidate previously learned knowledge to mitigate catastrophic forgetting. This can be achieved through a variety of schemes, which we introduce below.

4.3.1 Retraining on memory buffer

Most memory based methods [rebuffi2017icarl, chaudhry2019tiny] preserve previously learned knowledge by retraining on the data instances stored in the memory buffer.

In image classification, Rebuff et al. [rebuffi2017icarl] first proposed a typical method called incremental Classifier and Representation Learning (iCARL). During representation learning, iCARL utilizes both the stored data instances and the instances from the new task for training. During classification, iCARL adopts a nearest-mean-of-exemplars classification strategy to assign the label of the given image to the class with the most similar prototype. The distance between data instances in the latent feature space is used to update the memory buffer. The original iCARL method requires all data from the new task to be trained together. To address this limitation and enable the new instances from a single task to come at different time steps, Chaudhry et al. [chaudhry2019tiny] proposed Experience Replay (ER), which uses reservoir sampling [vitter1985random] to randomly sample a certain number of data instances from a data stream of unknown length, and store them in the memory buffer. However, reservoir sampling [vitter1985random] works well if each of the tasks has a similar number of instances, and it could lose information from the tasks that have significantly fewer instances than the others. Thus, several other sampling algorithms [aljundi2019gradient, liu2020mnemonics] were proposed to address this issue. Aljundi et al. [aljundi2019gradient] regarded the selection of stored data instances as a constraint selection problem and sampled data instances that can minimize the solid angle formed by their corresponding constraints. To reduce the computational cost, they further proposed a greedy version of the original selection algorithm. Liu et al. [liu2020mnemonics] trained exemplars using image-size parameters to store the most representative data instances from the previous tasks. Kim et al. [kim2020imbalanced] proposed a partitioning reservoir sampling as a modified version of reservoir sampling to address data imbalance problem. Chrysakis and Moens [chrysakis2020online] proposed Class-Balancing Reservoir Sampling (CBRS) as an alternative to reservoir sampling. CBRS is a two-phase sampling technique. During the first phase, all new data instances are stored in the memory buffer as long as the memory is not filled. After the memory buffer is filled, the second phase is activated to select which stored data instance needs to be replaced with the new data instance. Specifically, the new instance is replaced with a stored data instance from the same class if it belongs to a class which dominates the memory buffer at the time or at some previous time stamps. Otherwise, it is replaced with a stored data instance from the class which dominates the memory buffer at the time. Borsos et al. [borsos2020coresets] proposed an alternative reservoir sampling method named coresets, which stores a weighted subset of data instances from each previous task in the memory buffer through a bilevel optimization with cardinality constraints. Buzzega et al. [buzzega2021rethinking] applied five different tricks, including Independent Buffer Augmentation, Bias Control, Exponential LR Decay, Balanced Reservoir Sampling and Loss-Aware Reservoir Sampling, on ER [chaudhry2019tiny] and other methods to show their applicability. Bang et al. [bang2021rainbow] proposed Rainbow Memory (RM) to select previous data instances, particularly in the scenario where different tasks can share the same classes. It uses classification uncertainty and data augmentation to improve the diversity of data instances.

There are also a few other methods [aljundi2019online, shim2021online] that focused on selecting data instances from the memory buffer for retraining. Aljundi et al. [aljundi2019online] proposed a method called Maximally Interfered Retrieval (MIR) to select a subset of stored data instances, which suffer from an increase in loss if the model parameters are updated based on new data instances. Shim et al. [shim2021online] proposed an Adversarial Shapley (AS) value scoring method, which selects the previous data instances that can mostly preserve their decision boundaries during the training of the new task.

Besides methods that store previous data instances in a single buffer, there are also a few other methods [wu2019large, pham2020contextual] that divide the memory buffer into several parts. Wu et al. [wu2019large] pointed out that current methods which perform well on small datasets with only a few tasks cannot maintain their performance on large datasets with thousands of tasks. Hence, they proposed the Bias Correction (BiC) method focusing on continual learning from large datasets. Noticing that a strong bias towards every new task actually exists in the classification layer, BiC specifically splits a validation set from the combination of the previous and new data instances and adds a linear bias correction layer after the classification layer to measure and then correct the bias using the validation set. Pham et al. [pham2020contextual] separated the memory into episodic memory and semantic memory. The episodic memory is used for retraining and the semantic memory is for training a controller that modifies the parameter of the base model for each task.

Some other methods [belouadah2019il2m, chaudhry2020using] have also been proposed to store other information together with previous data instances. Belouadah and Popescu [belouadah2019il2m] pointed out that initial statistics of the previous tasks could help to rectify the prediction scores of the previous tasks. They thus proposed Incremental Learning with Dual Memory (IL2M) which has a second memory to store statistics of the previous tasks. These statistics are then used as a complementary information to handle the data imbalance problem. In their subsequent work [belouadah2020scail], the initial classifier of each task is stored in separate memory. After the training of each new task, the classifier of each previous task is replaced by a scaled version of its initial stored classifier with the help of aggregate statistics. Another similar method [belouadah2020initial] was also proposed to standardize the stored initial classifier of each task, resulting in a fair and balanced classification among different tasks. Chaudhry et al. [chaudhry2020using] proposed Hindsight Anchor Learning (HAL) to store an anchor per task in addition to data instances. The anchor is selected by maximizing a forgetting loss term. As such, the anchor can be regarded as the easiest-forgetting point of each task. By keeping these anchors to predict correctly, the model is expected to mitigate catastrophic forgetting. Ebrahimi et al. [ebrahimi2020remembering] stored both data instances and their model visual explanations (i.e, saliency maps) in the memory buffer, and encouraged the model to remember the visual explanations for the predictions.

A few other methods were also proposed to store either compressed data instances [hayes2020remind] or activation volumes [pellegrini2020latent]. Hayes et al. [hayes2020remind] proposed REMIND which stores a compressed representation of previous data instances obtained by performing product quantization [jegou2010product]. Aside from image classification, REMIND has also been applied to visual question answering to show its generalizability. Pellegrini et al. [pellegrini2020latent] stored activation volumes of previous data instances obtained from the intermediate layers, leading to much faster computation speed.

Aside from image classification, a few works have focused on other computer vision problems, such as image semantic segmentation [tasar2019incremental], object detection [joseph2021towards], and analogical reasoning [hayes2021selective]. Tasar et al. [tasar2019incremental] applied the memory based method on image semantic segmentation, particularly on satellite imagery, by considering segmentation as a multi-task learning problem where each task represents a binary classification. They stored patches from previously learned images and trained them together with data instances from new tasks. Joseph et al. [joseph2021towards] applied the memory based method on open-world object detection and stored a balanced subset of previous data instances to fine-tune the model after the learning of every new task. Hayes and Kanan [hayes2021selective] proposed an analogical reasoning method, where Raven’s Progressive Matrices [zhang2019raven] are commonly used as the measurement method. They used selective memory retrieval, which is shown to be more effective compared to random retrieval in continual analogical reasoning.

4.3.2 Optimizing on memory buffer

As an alternative to retraining scheme, which may over-fit on the stored data instances, the optimizing scheme [lopez2017gradient, chaudhry2018efficient] has been proposed to only use the stored data instances to prevent the training process of the new task from interfering with previously learned knowledge but not retraining on them.

In image classification, Lopez and Ranzato [lopez2017gradient] proposed a typical method called Gradient Episodic Memory (GEM), which builds an inequality constraint to prevent the parameter update from increasing the loss of each individual previous task, approximated by the instances in the episodic memory, during the learning process of the new task. Chaudhry et al. [chaudhry2018efficient] further proposed an efficient alternative to GEM named Averaged GEM (A-GEM) which tries to prevent the parameter update from increasing the average episodic memory loss. Sodhani et al. [sodhani2020toward] proposed to unify GEM and Net2Net [chen2016accelerating] frameworks to enable the model to mitigate catastrophic forgetting and increase its capacity.

Aside from GEM [lopez2017gradient] and its extensions [chaudhry2018efficient], other methods [derakhshani2021kernel, tang2021layerwise] were also proposed to optimize the model on the stored data instances from various other perspectives. Derakhshani et al. [derakhshani2021kernel]

proposed Kernel Continual Learning, which uses the stored data instances to train a non-parametric classifier with kernel ridge regression. Tang et al.

[tang2021layerwise] separated the gradient of the stored data instances of the previous tasks into shared and task-specific parts. The gradient for the update is then enforced to be consistent with the shared part but orthogonal to the task-specific part. With this requirement, the common knowledge among the previous tasks can help the learning of the new task while the task-specific knowledge remains invariant to mitigate catastrophic forgetting.

Preserving the topology of previously learned knowledge has also recently become another interesting direction. Tao et al. [tao2020few] proposed to store the topology of the feature space of the previous tasks through a neural gas network [martinetz1991neural] and proposed the TOpology-Preserving knowledge InCrementer (TOPIC) framework to preserve the neural gas topology of the previous tasks and adapt to new tasks given a few data instances. Tao et al. [tao2020topology] proposed Topology-Preserving Class Incremental Learning (TPCIL) to store an elastic Hebbian graph (EHG) instead of previous data instances in the buffer. EHG is constructed based on competitive Hebbian learning [martinetz1993competitive] and represents the topology of the feature space. Inspired by the idea from human cognitive science that forgetting is caused by breaking topology in human memory, TPCIL mitigates catastrophic forgetting by injecting a topology-preserving term into the loss function.

Besides topology information, a few other methods [von2019continual, iscen2020memory] proposed to preserve various other information. Oswald et al. [von2019continual] proposed to store a task embedding for each task in the buffer. Then, given the task embedding, a task-conditioned hypernetwork is trained to output the corresponding model parameters. Iscen et al. [iscen2020memory] proposed to store only the feature descriptors of the previous tasks. When a new task comes, instead of co-training the new data instances with previous ones, the method conducts a feature adaptation between the stored feature descriptor and the feature descriptor of the new task. Ren et al. [ren2020wandering] introduced a new problem setup called Online Contextualized Few-Shot Learning (OC-FSL). They then proposed to store both a prototype per class and a contextual prototypical memory focusing particularly on the contextual information. Zhu et al. [zhu2021prototype]

proposed to store a class-representative prototype for each previous task and augment these prototypes to preserve the decision boundaries of the previous tasks. Self-supervised learning was also proposed to generalize features from the previous tasks, facilitating the learning of the new task.

Joseph and Balasubramanian [joseph2020meta] recently proposed to incorporate the idea of meta learning in optimization. They pointed out that the distribution of the model parameters conditioned on a task can be seen as a meta distribution and proposed Meta-Consolidation for Continual Learning (MERLIN) to learn this distribution with the help of the learned task-specific priors stored in the memory.

Aside from the above methods in image classification, Ganea et al. [ganea2021incremental] applied the memory based method on few-shot image segmentation. They proposed to replace the fixed feature extractor with an instance feature extractor, leading to more discriminative embeddings for each data instance. The embeddings of all instances from each previous class are averaged and then stored in the memory. Hence, given a new task, extensive retraining from scratch is not necessary.

Besides the above two types of schemes to either retrain or optimize on the memory buffer, Prabhu et al. [prabhu2020gdumb] proposed GDumb, which consists of a greedy sampler and a dumb learner. GDumb greedily constructs the memory buffer from the data sequence with the objective of having a memory buffer with a balanced task distribution. At inference, GDumb simply trains the model from scratch on only the data in the memory buffer. The authors pointed out that even though GDumb is not particularly designed for continual learning, it outperforms several continual learning frameworks.

4.4 Generative replay based methods

Generative replay based methods [shin2017continual, kemker2017fearnet] have been proposed as an alternative to the memory based methods by replacing the memory buffer with a generative module which reproduces information related to the previous tasks. This can be achieved through a variety of schemes, which are introduced below.

4.4.1 Generating previous data instances

Typically, most generative replay based methods [shin2017continual, van2018generative] generate only previous data instances along with their corresponding labels.

In image classification, Shin et al. [shin2017continual] proposed a Deep Generative Replay (DGR) method, in which a GAN model is trained to generate previous data instances and pair them with their corresponding labels. Kemker and Kanan [kemker2017fearnet] also proposed a similar method called FearNet, which uses pseudo-rehearsal [robins1995catastrophic] to enable the model to revisit recent memories. Van de Ven and Tolias [van2018generative] pointed out that the generative replay based methods can perform well, especially when combined with knowledge distillation, but the computation cost is usually very heavy. Hence, they proposed an efficient generative replay based method by integrating the helper GAN model into the main model used for classification. Cong et al. [cong2020gan] proposed a GAN memory to mitigate catastrophic forgetting by learning an adaptive GAN model. They adapted the modified variants of style-transfer techniques to transfer the base GAN model towards each task domain, leading to increased quality of the generated data instances. Besides the typical GAN models, various other generative models were also explored. Lesort et al. [lesort2019generative] investigated the performance of different generative models including Variational AutoEncoders (VAEs), conditional VAE (CVAE), conditional GAN (CGAN), Wasserstein GANs (WGANs), and Wasserstein GANs Gradient Penalty (WGAN-GP). Among these generative models, GAN was still found to have the best performance but it was also pointed out that all generative models including GAN struggle with more complex datasets. Rostami et al. [rostami2019complementary]

trained an encoder to map different tasks into a task-invariant Gaussian Mixture Model (GMM). After that, for each new coming task, the pseudo previous data instances will be generated from this GMM model through a decoder to be trained collaboratively with the current data instances.

Wang et al. [wang2021ordisco] focused on utilizing unlabelled data instances and proposed Continual Replay with Discriminator Consistency. A minimax adversarial game strategy [goodfellow2014generative] was adapted where the output distribution of the generator was used to update the classifier and the output pseudo labels of the classifier were used to update the generator.

Aside from image classification, Wu et al. [wu2018memory] applied continual learning in class-conditional image generation. They proposed to generate previous data instances and train the model on both generated an new data instances.

4.4.2 Generating previous data instances along with latent representations

A few recent works [van2020brain, ye2020learning] have also proposed to generate both previous data instances and their latent representations to further consolidate previously learned knowledge.

In image classification, Van de Ven et al. [van2020brain] pointed out that generating previous data instances for problems with complex inputs (e.g., natural images) is challenging. Hence, they proposed to also generate latent representations of data instances via context-modulated feedback connection of the network. Ye and Bors [ye2020learning] proposed a lifelong VAEGAN which also learns latent representations of previous data instances besides the typical generative replay. It learns both shared and task-specific latent representations to facilitate representation learning.

Apart from the above-mentioned schemes, i.e., Generating previous data instances and generating previous data instances along with latent representations, Campo et al. [campo2020continual] applied the generative replay based method on general frame predictive tasks without utilizing the generated data instances. More precisely, they proposed to use the low-dimension latent features in between the encoder and the decoder of a Variational Autoencoders (VAE), which can further be integrated with other sensory information using a Markov Jump Particle Filter [baydoun2018learning].

4.5 Parameter isolation based methods

Parameter isolation based methods [rusu2016progressive, yoon2017lifelong, fernando2017pathnet, mallya2018packnet] generally assign different model parameters to different tasks to prevent later tasks from interfering with previously learned knowledge. This can be achieved through a variety of schemes, which are described below.

4.5.1 Dynamic network architectures

The majority of parameter isolation based methods [rusu2016progressive, aljundi2017expert] assign different model parameters to different tasks by dynamically changing the network architecture.

In image classification, Rusu et al. [rusu2016progressive] proposed a typical method called Progressive Network, which trains a new neural subnetwork for each new task. Through the training of each new subnetwork, feature transfer from subnetworks learned on the previous tasks is enabled by lateral connections. An analytical method based on the Fisher information matrix shows the effectiveness of this architecture in mitigating catastrophic forgetting. Aljundi et al. [aljundi2017expert] introduced a network of experts in which an expert gate was designed to only select the most relevant previous task to facilitate learning the new task. At test time, the expert gate structure is utilized to select the most approximate model for a given data instance from a certain task. Yoon et al. [yoon2017lifelong]

proposed a Dynamically Expandable Network (DEN) which utilizes the knowledge learned from the previous tasks and expands the network structure when the previous knowledge is not enough for handling the new task. The addition, replication, and separation of neurons were designed to expand the network structure. Xu and Zhu

[xu2018reinforced]

further proposed a Reinforced Continual Learning (RCL) method, which uses a recurrent neural network trained following an actor-critic strategy to determine the optimal number of nodes and filters that should be added to each layer for each new task. Li et al.

[li2019learn] then proposed a Learn-to-Grow framework that searches for the optimal model architecture and trains the model parameters separately per task. Hung et al. [hung2019compacting] proposed a method called Compacting Picking Growing (CPG). For each new task, CPG freezes the parameters trained for all the previous tasks to prevent the model from forgetting any relevant information. Meanwhile, the new task is trained by generating a mask to select a subset of frozen parameters that can be reused for the new task and decide if additional parameters are necessary for the new task. After the above two steps, a compression process is conducted to remove the redundant weights generated by the training process of the new task. Lee et al. [lee2020neural] pointed out that during human learning, task information is not necessary and hence proposed a Continual Neural Dirichlet Process Mixture (CN-DPM) model which can be trained both with a hard task boundary and in a task-free manner. It uses different expert subnetworks to handle different data instances. Creating a new expert subnetwork is decided by a Bayesian non-parametric framework. Veniat et al. [veniat2020efficient] focused on building an efficient continual learning model, whose architecture consists of a group of modules representing atomic skills that can be combined in different ways to perform different tasks. When a new task comes, the model potentially reuses some of the existing modules and creates a few new modules. The decision of which modules to reuse and which new modules to create is made by employing a data-driven prior. Hocquet et al. [hocquet2020ova] proposed One-versus-All Invertible Neural Networks (OvA-INN) in which a specialized invertible subnetwork [dinh2014nice] is trained for each new task. At the test time, the subnetwork with the highest confidence score on a test sample is used to identify the class of the sample. Kumar et al. [kumar2021bayesian] proposed to build each hidden layer with the Indian Buffet Process [griffiths2011indian] prior. They pointed out that the relation between tasks should be reflected in the connections they used in the network, and thus proposed to update the connections during the training process of every new task. Yan et al. [yan2021dynamically] proposed a new method with a new data representation called super-feature. For each new task, a new task-specific feature extractor is trained while keeping the parameters of the feature extractors of the previous tasks frozen. The features from all feature extractors are then concatenated to form a super-feature which is passed to a classifier to assign a label. Zhang et al. [zhang2021few] proposed a Continually Evolved Classifier (CEC) focusing particularly on the few-shot scenario of continual learning. They first proposed to separate representation and classification learning, where representation learning is frozen to avoid forgetting, and the classification learning part is replaced by a CEC which uses a graph model to adapt the classifier towards different tasks based on their diverse context information.

Several other approaches [yoon2019scalable, kanakis2020reparameterizing, ebrahimi2020adversarial] have focused on separating the shared and task-specific features to effectively prevent new tasks from interfering with previously learned knowledge. Yoon et al. [yoon2019scalable] proposed a method called Additive Parameter Decomposition (APD). APD first uses a hierarchical knowledge consolidation method to construct a tree-like structure for the model parameters on the previous tasks where the root represents the parameters that are shared among all the tasks and the leaves represent the parameters that are specific to one task. The middle nodes of the tree represent the parameters that are shared by a subset of the tasks. Given a new task, APD tries to encode the task with shared parameters first and then expands the tree with task-specific parameters. Kanakis et al. [kanakis2020reparameterizing] proposed Reparameterized Convolutions for Multi-task learning (RCM) which separates the model parameters into filter-bank part and modulator part. The filter-bank part of each layer is pre-trained and shared among tasks encoding common knowledge, while the modulator part is task-specific and is separately fine-tuned for each new task. Ebrahimi et al. [ebrahimi2020adversarial] trained a common shared module to generate task-invariant features. A task-specific module is expanded for each new task to generate task-specific features orthogonal to the task-invariant features. Singh et al. [singh2020calibrating] separated the shared and the task-specific components by calibrating the activation maps of each layer with spatial and channel-wise calibration modules which can adapt the model to different tasks. Singh et al. [singh2021rectification] further extended [singh2020calibrating] to be used for both zero-shot and non-zero-shot continual learning. Verma et al. [verma2021efficient] proposed an Efficient Feature Transformation (EFT) to separate the shared and the task-specific features using Efficient Convolution Operations [howard2017mobilenets].

Aside from image classification, a few recent works [zhai2020piggyback, zhai2021hyper] have focused on the conditional image generation problem. Zhai et al. [zhai2020piggyback] pointed out that not all parameters of a single Lifelong GAN [zhai2019lifelong] can be adapted to different tasks. They thus proposed Piggyback GAN which has a filter bank containing the filters from layers of the model trained on the previous tasks. Given a new task, Piggyback GAN is learned to reuse a subset of filters in the filter bank and update the filter bank by adding task-specific filters to ensure high-quality generation of the new task. It was pointed out by Zhai et al. [zhai2021hyper] that this method is not memory efficient. Hence, Hyper-LifelongGAN was proposed to apply a hypernetwork [ha2016hypernetworks] to the lifelong GAN framework. Instead of learning a deterministic combination of filters for each task, the hypernetwork learns a dynamic filter as task-specific coefficients to be multiplied with the task-independent base weight matrix.

Further to this, some works have focused on other computer vision problems, such as action recognition [li2021elsenet] and multi-task learning [hung2019increasingly]. Li et al. [li2021elsenet] proposed an Elastic Semantic Network (Else-Net) for skeleton based human action recognition, which comprises a base model followed by a stack of multiple elastic units. Each elastic unit consists of multiple learning blocks and a switch block. For every new action as a task, only the most relevant block in each layer is selected by the switch block to facilitate the training process while all the other blocks are frozen in order to preserve previously learned knowledge. Hung et al. [hung2019increasingly]

proposed a Packing-And-Expanding (PAE) method which sequentially learns face recognition, facial expression understanding, gender identification, and other relevant tasks in a single model. PAE improves Packnet

[mallya2018packnet] in two aspects. Firstly, PAE adapts iterative pruning procedure [zhu2017prune] to construct a more compact model compared to Packnet. Secondly, PAE allows the model architecture to be expanded when the remaining trainable parameters are not enough to effectively learn the new task.

4.5.2 Fixed network architectures

Figure 5: Illustration of Pathnet [fernando2017pathnet], which freezes the model parameters along all paths selected by the previous tasks and reinitializes and retrains the remaining model parameters following the same process to select the best path.

Other than using parameter isolation based methods to dynamically change the network architectures, several other works [fernando2017pathnet, serra2018overcoming] have designed fixed network architectures which can still assign different parameters for handling different tasks.

In image classification, Fernando et al. [fernando2017pathnet]

proposed PathNet. In the training of the first task, several random paths through the network are selected. After that, a tournament selection genetic algorithm is utilized to select the best path to be trained for the task. Then for each following task, as shown in the Figure

5, model parameters along all paths selected by the previous tasks are frozen and the remaining parameters are re-initialized and trained again following the above process. Serra et al. [serra2018overcoming]

proposed HAT, which calculates an almost binary attention vector for each task during training. For each following task, all the attention vectors calculated before are used as the masks to freeze the network parameters which are crucial for the previous tasks. A similar idea was also proposed by Mallya et al.

[mallya2018piggyback].Rajasegaran et al. [rajasegaran2019random] pointed out that the PathNet [fernando2017pathnet] requires heavy computational costs. Hence, they proposed RPS-Net which reuses paths of the previous tasks to achieve a much lower computational overhead. Abati et al. [abati2020conditional] proposed a gating module for each convolution layer to select a limited set of filters and protect the filters of the previous tasks from being unnecessarily updated. A sparsity objective was also used to make the model more compact. Unlike [fernando2017pathnet, serra2018overcoming, abati2020conditional], Shi et al. [shi2021continual]

proposed a Bit-Level Information Preserving (BLIP) method, which updates the model parameters at the bit level by estimating information gain on each parameter. A certain number of bits are frozen to preserve information gain provided by the previous tasks.

Besides the above mentioned methods, freezing a subset of the model parameters with the help of meta learning has also received attention recently. Beaulieu et al. [beaulieu2020learning] proposed a Neuromodulated Meta-Learning Algorithm (ANML) which contains a neuromodulatory neural network and a prediction learning network. The neuromodulatory neural network is meta-trained to activate only a subset of parameters of the prediction learning network for each task in order to mitigate catastrophic forgetting. Hurtado et al. [hurtado2021optimizing] proposed a method called Meta Reusable Knowledge (MARK), which has a single common knowledge base for all the learned tasks. For each new task, MARK uses meta learning to update the common knowledge base and uses a trainable mask to extract the relevant parameters from the knowledge base for the task.

Unlike the above-mentioned methods that focused on how to freeze a subset of model parameters per task, Mallya and Lazebnik [mallya2018packnet] proposed PackNet, which first trains the whole neural network on each new task and then uses a weight-based pruning technique to free up unnecessary parameters. Adel et al. [adel2019continual] proposed Continue Learning with Adaptive Weights (CLAW), which adapts the model parameters in different scales in a data-driven way. More precisely, CLAW is a variational inference framework that trains three new groups of parameters for each neuron, including one binary parameter to represent whether or not to adapt the neuron, and the two other parameters to represent the magnitude of adaptation.

Hu et al. [hu2018overcoming] proposed a Parameter Generation and Model Adaptation (PGMA) method, in which the model parameters are dynamically changed at inference time instead of training time. The model includes both a set of shared parameters and a parameter generator. At inference time, the parameter generator with the test data instance as input, outputs another set of parameters. After that, the shared set and the output set of parameters are used together to classify the test data instance.

Aside from image classification, some works have focused on other computer vision problems, such as interactive image segmentation [zheng2021continual] and image captioning [del2020ratt]. Zheng et al. [zheng2021continual] proposed to activate only a subset of convolutional kernels by employing a Bayesian non-parametric Indian Buffet Process [griffiths2011indian], which results in extracting the most discriminative features for each task. The kernels that are frequently activated in the previous tasks are encouraged to be re-activated for the new tasks to facilitate knowledge transfer. Chiaro et al. [del2020ratt] applied the parameter isolation based method on image captioning by extending the idea of HAT [serra2018overcoming] to recurrent neural networks. They pointed out that image captioning allows the same words to describe images from different tasks. Hence, compared to HAT, they allowed the vocabulary masks for different tasks to have shared elements.

4.6 Combination of multiple categories of methods

To enhance the performance of continual learning, several other works proposed to combine two or more categories of the aforementioned techniques.

4.6.1 Combination of regularization and memory based methods

Several approaches [nguyen2017variational, aljundi2019task] take the advantages of both the regularization and memory based methods to enhance continual learning performance in image classification. Nguyen et al. [nguyen2017variational] proposed Variational Continual Learning (VCL) which combines variational inference and coreset data summarization [bachem2015coresets] together. VCL recursively calculates the posterior information of the existing model from a subset of important previous data instances and merges it with the likelihood information calculated from the new task. Similarly, Kurle et al. [kurle2019continual] used both the posterior information and a subset of previous data instances. However, they introduced a new update method and further adapted the Bayesian neural network to non-stationary data. Titsias et al. [titsias2019functional] pointed out that previous methods such as EWC [kirkpatrick2017overcoming] and VCL [nguyen2017variational] suffer from the problem of brittleness as a subset of model parameters are preserved towards the previous tasks while the other parameters are further trained towards the new task. They thus proposed a functional regularisation approach which marginalises out the task-specific weights to resolve the problem of brittleness. They also proposed to store a subset of inducing points for each task. When a new task comes, these stored inducing points are used to conduct functional regularisation to mitigate catastrophic forgetting. Chen et al. [chen2021overcoming] pointed out that the assumption made by VCL [nguyen2017variational]

, i.e., the shared and the task-specific parameters are independent to each other, is not true in general. Hence, they proposed to use an energy-based model

[lecun2006tutorial] together with Langevin dynamic sampling [bussi2007accurate] as a regularizer to choose the shared parameters that satisfy the independent assumption of VCL.

Besides VCL [kirkpatrick2017overcoming] and its extensions, various other image classification methods [riemer2018learning, aljundi2019task] have also been proposed. Riemer et al. [riemer2018learning] integrated a modified version of the Reptile algorithm [nichol2018reptile] into an experience replay method to effectively mitigate catastrophic forgetting and preserve the ability of future learning. Aljundi et al. [aljundi2019task] proposed a method that allows classes to be repeated between different tasks. They utilized the structure of the memory-aware synapse [aljundi2018memory] as a weight regularizer and stored hard data instances with the highest loss to help identification of important parameters of the previous tasks. Similar to [titsias2019functional], Pan et al. [pan2020continual] also employed functional regularisation to only regularize the model outputs. But, unlike [titsias2019functional] which requires solving a discrete optimisation problem to select inducing points, they proposed to store data instances that are close to the decision boundary. Tang and Matteson [tang2020graph] generated random graphs from the stored data instances and introduced a new regularization term penalizing the forgetting of the edges of the generated graphs. Mirzadeh et al. [mirzadeh2020linear] pointed out that there exists a linear connection between continual learning and multi-task learning. Based on this, they proposed Mode Connectivity SGD to treat continual learning as multi-task learning. A memory based method is employed to approximate the loss of the previous tasks.

4.6.2 Combination of knowledge distillation and memory based methods

A few other approaches [castro2018end, hou2018lifelong] have combined the knowledge distillation and memory based methods. In image classification, Castro et al. [castro2018end] proposed a cross-distilled loss consisting of a knowledge distillation loss to preserve the knowledge acquired from the previous tasks, and a cross-entropy loss to learn the new task. They also introduced a representative memory unit which performs a herding based selection and removal operation to update the memory buffer. Hou et al. [hou2018lifelong] pointed out that with the help of knowledge distillation, storing only a small number of previous data instances can lead to a large improvement in mitigating catastrophic forgetting. They thus proposed a method named Distillation and Retrospection to obtain a better balance between preservation of previous tasks and adaptation to new tasks. In video classification, Zhao et al. [zhao2021video] adapted the knowledge distillation based method by separately distilling the spatial and temporal knowledge. They also introduced a dual granularity exemplar selection method to store only key frames of representative video instances from the previous tasks. In fake media detection, Kim et al. [kim2021cored] proposed to store the feature representations of both real and fake data instances to facilitate the knowledge distillation process.

4.6.3 Combination of knowledge distillation and generative replay based methods

The combination of knowledge distillation and generative replay based methods has also been explored [wu2018incremental, huang2021half]. In image classification, Wu et al. [wu2018incremental] proposed a new loss function which consists of a cross-entropy loss and a knowledge distillation loss. A GAN is employed to generate data instances from the previous tasks, which are then combined with real new data instances to train the model. A scalar, which represents the bias on data instances from the new task, is also used to remove the bias caused by data imbalance. In image semantic segmentation, Huang et al. [huang2021half] proposed to generate fake images of the previous classes using a Scale-Aware Aggregation module and combine them with the images from the new task to facilitate the knowledge distillation process.

4.6.4 Combination of memory and generative replay based methods

A few other works [he2018exemplar, xiang2019incremental] have proposed to combine the memory and generative replay based techniques. In image classification, He et al. [he2018exemplar] proposed an Exemplar Supported Generative Reproduction (ESGR) method, which leverages both generated data instances and stored real data instances to better mitigate catastrophic forgetting. Specifically, given a new task, ESGR first generates data instances of previous tasks by employing task-specific GANs. Then, the combination of the stored previous data instances, the generated instances and new data instances is used to train the classification model. Unlike ESGR [he2018exemplar] which requires real previous data instances to be stored, Xiang et al. [xiang2019incremental] propose storing statistical information of previous data instances, such as mean and covariance, which are fed to a conditional GAN to generate pseudo data instances. Ayub and Wagner [ayub2021eec] proposed a memory efficient method named Episodes as Concepts (EEC). When there still exists enough memory space, EEC stores compressed embeddings of data instances in the memory buffer and retrieves these embeddings while learning new tasks. When memory space is full, EEC combines similar stored embeddings into centroids and covariance matrices to reduce the amount of memory required. It then generates pseudo data instances from these matrices while learning new tasks. In image semantic segmentation, Wu et al. [wu2019ace] proposed to store feature statistics of previous images to help train the image generator. The trained image generator is then used to align the style of previous and new coming images by matching their first and second-order feature statistics.

4.6.5 Combination of generative replay and parameter isolation based methods

A few other methods [ostapenko2019learning, rao2019continual] explore the combination of generative replay and parameter isolation based methods for image classification. Ostapenko et al. [ostapenko2019learning] combined the HAT method [serra2018overcoming] with the generative replay based method using a modified AC-GAN [odena2017conditional] architecture. Rao et al. [rao2019continual] proposed a new problem setup called unsupervised continual learning, in which information including task labels, task boundaries, and class labels are unknown. To solve this problem, they proposed a Continual Unsupervised Representation Learning (CURL) framework, which can expand itself dynamically to learn new concepts. Besides, CURL also involves the mixture generative replay (MGR) as an extension of DGR [shin2017continual]

to conduct unsupervised learning without forgetting.

4.6.6 Other Combinations

Apart from the above mentioned combination approaches, there are several other methods [yang2019adaptive, buzzega2020dark] which proposed other combinations to perform image classification. Yang et al. [yang2019adaptive] combined regularization and parameter isolation based methods. They proposed an Incremental Adaptive Deep Model (IADM), which has an attention module for the hidden layers of the model, enabling training the model with adaptive depth. IADM also incorporates adaptive Fisher regularization to calculate the distribution of both the previous and new data instances to mitigate catastrophic forgetting of the previous tasks. Buzzega et al. [buzzega2020dark] combined regularization, knowledge distillation and memory based methods and proposed the Dark Experience Replay (DER). Comparing to ER [chaudhry2019tiny], DER further distills previously learned knowledge with the help of Dark Knowledge [hinton2014dark] and regularizes the model based on the network’s logits so that consistent performance can be expected through the sequential stream of data without a clear task boundary.

4.7 Other methods

Aside from the above mentioned categories of methods, several other techniques [caccia2020online, yu2020self] have also been proposed for continual learning.

In image classification, several meta learning based methods have been proposed. Caccia et al. [caccia2020online] introduced a new scenario of continual learning, where the model is required to solve new tasks and remember the previous tasks quickly. Thus, continual-MAML (i.e. an extension of the MAML [finn2017model]) was proposed to first initialize the model parameters with meta learning and then add new knowledge into the learned initialization only when the new task has a significant distribution shift from the previously learned tasks. Javed and White [javed2019meta] proposed a meta-objective to learn representations which are naturally highly sparse and thus effectively mitigate catastrophic forgetting in continual learning. Jerfel et al. [jerfel2018reconciling] employed a meta-learner to control the amount of knowledge transfer between tasks and automatically adapt to a new task when a task distribution shift is detected. Rajasegaran et al. [rajasegaran2020itaml] proposed an Incremental Task-Agnostic Meta-learning (iTAML) method, which adapts meta learning to separate the task-agnostic feature extractor from the task-specific classifier. In this way, iTAML first learns a task-agnostic model to predict the task, and then adapts to the task in the second step. This two-phase learning enables iTAML to mitigate the data imbalance problem by updating the task-specific parameters separately for each task.

There are also several generic frameworks that can be integrated with methods in the above-mentioned continual learning categories to further improve their performance. Yu et al. [yu2020semantic] pointed out that embedding networks compared to the typical classification networks suffer much less from catastrophic forgetting. Hence, they adapted LwF [li2017learning], EWC [kirkpatrick2017overcoming], and MAS [aljundi2018memory] towards embedding networks. Liu et al. [liu2020more] pointed out that various classifiers with different decision boundaries can help to mitigate catastrophic forgetting. Hence, they proposed a generic framework called MUlti-Classifier (MUC) to train additional classifiers for each task particularly on out-of-distribution data instances. They showed that MUC can further mitigate catastrophic forgetting when combined with other methods such as MAS [aljundi2018memory] and LwF [li2017learning]. Mendez and Eaton [mendez2020lifelong] incorporated compositional learning into continual learning and proposed a generic framework that is agnostic to any continual learning algorithm. The framework consists of several components including linear models, soft layer ordering [meyerson2017beyond], and soft gating as a modified version of soft layer ordering, which can be combined together in different orders to construct different models for different tasks. Given a new task, the model first uses the previously learned components to solve the task, and then updates the components and creates new components when necessary.

Several other continual learning methods have also been proposed to perform image classification from different perspectives. Stojanov et al. [stojanov2019incremental] proposed a synthetic incremental object learning environment called Continual Recognition Inspired by Babies (CRIB), which enables generating various amounts of repetition. They then showed that continual learning with repetition is important in mitigating catastrophic forgetting. Wu et al. [wu2021incremental] proposed ReduNet which adapts the newly proposed ”white box" DNN derived from the rate reduction principle [chan2020deep] to mitigate catastrophic forgetting in continual image classification. Knowledge from the previous tasks are explicitly preserved in the second order statistics of ReduNet. Chen et al. [chen2020long] adapted the lottery ticket hypothesis [frankle2018lottery] to continual learning. They proposed a bottom-up pruning method to build a sparse subnetwork for each task, which is lightweight but can still achieve comparable or even better performance than the original dense model. Zhu et al. [zhu2021self] mapped the representations of the previous tasks and the new task into the same embedding space and further used their distance to guide the model to preserve previous knowledge. A randomly episodic training and self-motivated prototype refinement were also proposed to extend the representation ability of the feature space given only a few data instances per task. Liu et al. [liu2021adaptive] proposed Adaptive Aggregation Networks (AANets) which add a stable block and a plastic block to each residual level and aggregate their outputs to balance the ability to learn new tasks while remembering previous tasks. Abdelsalam et al. [abdelsalam2021iirc] proposed a new problem setup called Incremental Implicitly-Refined Classification (IIRC), in which each data instance has a coarse label and a fine label. This problem setup is closer to real-world scenarios as humans interact with the same family of entities multiple times and discover more granular information about them over time, while trying not to forget previous knowledge. Several continual learning approaches have been evaluated on this setup.

Aside from image classification, Yu et al. [yu2020self] introduced a continual learning method for image semantic segmentation. Before the training process of each new task, the model containing previous knowledge is stored. After training a new model on data instances from the new task, a set of unlabelled data instances are fed into both the previous and the new models to generate their pseudo labels, which are then fused by a Conflict Reduction Module to get their most accurate pseudo labels. Finally, a joint model is retrained with the unlabelled data instances and their corresponding pseudo labels to preserve previously learned knowledge.

5 Discussion

In the previous sections, we have reviewed recent continual learning methods across different computer vision tasks. In this section, we briefly discuss some potential directions that could be further investigated.

Firstly, most of the existing continual learning methods focus on the image classification problem. Although continual learning in image classification is a valuable topic to be explored, successful applications of continual learning to other computer vision problems are valuable as well. As with different characteristics of different computer vision problems, a simple adaptation of methods proposed for image classification may not lead to satisfactory performance in other computer vision problems. For example, in video grounding, Jin et al. [jin2020visually] pointed out that simple adaptation of ideas from image classification fails with this compositional phrases learning scenario of the language input. In visual question answering (VQA), Perez et al. [perez2018film] pointed out that their model cannot preserve previously learned knowledge well after trained continuously on objects with different colors. Greco et al. [greco2019psycholinguistics] further pointed out that besides the color of the objects, question difficulty and order also affect the degree of catastrophic forgetting in the area of VQA, and directly adapting methods from image classification can only mitigate catastrophic forgetting to a small extent. Besides visual grounding and VQA, Parcalabescu et al. [parcalabescu2020seeing]

also showed the failure of pre-trained vision and language models in the continual learning scenario of image sentence alignment and counting and pointed out the potential catastrophic forgetting problem. Other than various visual and language tasks, in image semantic segmentation, as different classes can co-exist in the same image, the model outputs corresponding to different classes can affect each other. Hence, Cermelli et al.

[cermelli2020modeling] pointed out that simple adaptation of ideas from image classification which regards the previously learned tasks as background during the learning process of the new task can even encourage rather than mitigate catastrophic forgetting. Ideas that are not effective in image classification may become effective in other computer vision tasks. For example, in analogical reasoning, Hayes and Kanan [hayes2021selective] showed that selective memory retrieval compared to random memory retrieval can effectively mitigate catastrophic forgetting. However, selective memory retrieval has been shown to be less effective in image classification [chaudhry2018riemannian, hayes2020remind]. Hence, we believe that besides image classification, continual learning in other computer vision problems is also worthy to be further investigated.

Secondly, although Hung et al. [hung2019increasingly] has applied continual learning to multi-problem learning, almost all continual learning studies nowadays focus on developing problem-specific continual learning algorithms such as image classification or image semantic segmentation. As knowledge can be shared among different computer vision problems, e.g., face recognition, facial identification and facial anti-spoofing, how to continuously learn such kind of shared information in a multi-problem learning setting is worth exploring.

Thirdly, most of the existing continual learning approaches focus on the fully supervised problem setup with no class overlap between different tasks. However, in real-world scenarios, it is more likely that data instances that come at different time steps are unlabelled and have common classes between each other. Hence, there still exists a gap between most existing continual learning problem setups and the real-world scenarios. We believe that continual learning methods with other alternative problem setups including unsupervised learning, few-shot learning, or learning with no hard boundary between tasks also warrant further investigations.

Finally, many continual learning works focus on the evaluation of accuracy. Although some of them force their methods to only use limited memory space, the computation complexity and resource assumption, which are important factors for practical applications, are often not evaluated in many continual learning works. Considering the application of continuous learning on lightweight devices, the exploration of lightweight continuous learning methods is also important.

6 Conclusion

Continual learning is an important complementary of typical batch learning in neural network training. Recently, it has attracted broad attention, especially in the computer vision area. In this paper, we have given a comprehensive overview of the recent advances of continual learning in computer vision with various techniques used and various subareas that the methods have been applied to. We also give a brief discussion on some potential future research directions.

Acknowledgement

This work is supported by AISG-100E-2020-065, the SUTD Project PIE-SGP-Al2020-02, and the TAILOR project funded by EU Horizon 2020 research and innovation programme under GA No 952215.

References