FoCL: Feature-Oriented Continual Learning for Generative Models

03/09/2020 ∙ by Qicheng Lao, et al. ∙ Imagia Cybernetics Inc. 0

In this paper, we propose a general framework in continual learning for generative models: Feature-oriented Continual Learning (FoCL). Unlike previous works that aim to solve the catastrophic forgetting problem by introducing regularization in the parameter space or image space, FoCL imposes regularization in the feature space. We show in our experiments that FoCL has faster adaptation to distributional changes in sequentially arriving tasks, and achieves the state-of-the-art performance for generative models in task incremental learning. We discuss choices of combined regularization spaces towards different use case scenarios for boosted performance, e.g., tasks that have high variability in the background. Finally, we introduce a forgetfulness measure that fairly evaluates the degree to which a model suffers from forgetting. Interestingly, the analysis of our proposed forgetfulness score also implies that FoCL tends to have a mitigated forgetting for future tasks.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models have shown great potential for generating natural images in stationary environments under the assumption that training examples are available throughout training, and are independent and identically distributed (i.i.d.). Continual Learning (CL), however, is the ability to learn from a continuous stream of data in a non-stationary environment, which entails that not only the distribution of the data is subject to change but also the nature of the task itself. To learn under such circumstances, the model needs to have the plasticity to acquire new knowledge and elasticity to deal with catastrophic forgetting Farquhar and Gal (2018)

. However, neural networks have been shown to deteriorate severely on previous tasks when trained in a sequential manner 

McCloskey and Cohen (1989).

Current scalable 111Therefore other categories of approaches, such as memory rehearsal/replay and dynamic architectures, are not included here. approaches to mitigate catastrophic forgetting can be grouped into two main categories: the prior-based approaches, where the model trained on previous tasks acts as a prior for the model training on the current task via a regularization term applied on the parameters Kirkpatrick et al. (2017); Zenke et al. (2017); Nguyen et al. (2018); and the replay-based approaches, where synthetic images are generated through a snapshot model to mimic the real data seen in previous tasks. These images are then used as additional training data to constrain the model from forgetting Shin et al. (2017); Wu et al. (2018). Both categories deal with forgetting by different means of regularization: the prior-based approaches regularize in the parameter space, i.e., the parameters of the current model are regularized to stay close to that of previous model (illustrated in Figure 1 (a)), whereas the replay-based approaches remember previous tasks by regularizing in the image space (Figure 1 (b)). While both means of regularization have shown great success in the literature, each has its own limitations. For example, the strong regularization in the parameter space can limit the expressive power of the model in acquiring new knowledge, and the regularization in the image space on the other hand, may include the mapping of some undesired noisy information, since the image space is not fully representative of the underlying nuances of a task.

Figure 1: Approaches in continual learning from the perspective of regularization space. denotes for the images generated by the model at task for a previous task  ( is the conditioning factor), and denotes for the features representing task .

In this work, we present a new framework in continual learning for generative models: Feature-oriented Continual Learning (FoCL), where we propose to address the catastrophic forgetting problem by imposing the regularization in the shared feature space (Figure 1 (c)). Instead of the strong regularization in the parameter space or image space as done in previous works, we regularize the model in a more meaningful and therefore more efficient way in the feature representation space. Regularizing in the feature space is also related to regularization in the functional space introduced in a parallel work by Titsias et al. Titsias et al. (2020)

. In their approach, regularization is applied to distribution of task specific functions parameterized by neural networks, whereas in this work we regularize the high level features directly. Also their model is applied to supervised learning tasks but not generative tasks.

We show in our experiments that our framework has faster adaptation to changes in the task distribution, and improves the performance of the generative models on several benchmark datasets in task incremental learning settings. To construct the feature space, several ways are proposed in this work: (1) an adversarially trained encoder, (2) distilled knowledge from the current model, and (3) a pretrained model that can represent our prior knowledge on the task descriptive information, which later on could be potentially applied to zero-shot learning scenarios. Moreover, we show that in some use case scenarios, FoCL performance can be further boosted by leveraging additional regularization in the image space, especially when the variability in the background is high that could heavily disturb the feature representation learning. Therefore, for those tasks, we propose to extend FoCL with combined regularization strategies. Finally, we introduce a new metric, forgetfulness score, to fairly evaluate to which degree a model suffers from catastrophic forgetting. The analysis of the forgetfulness scores can offer complimentary information from a new perspective compared to current existing metrics, and it can also predict the potential performance on future tasks.

2 Related work

2.1 Continual learning for generative models

The two major approaches to generative models, namely generative adversarial networks (GANs) Goodfellow et al. (2014) and variational auto-encoders (VAEs) Kingma and Welling (2013), have been used in a continual learning setup, where the objective is to learn a model not only capable of generating examples of the current task but also previous tasks. To facilitate elasticity and preserve plasticity, one or a mixture of aforementioned solutions to catastrophic forgetting have been applied. The performance comparison of different generative models in the context of continual learning has also been investigated here Lesort et al. (2019).

Nguyen et al. Nguyen et al. (2018) proposed variational continual learning (VCL), a VAE based model with separate heads (as conditioning factor) for different tasks. The task specific parameters help the plasticity of the model when dealing with different tasks. To address catastrophic forgetting, VCL uses variational inference within a Bayesian neural network to regularize the weights. In their framework, the posterior at the end of each task is set to be the prior for the next one. They also showed that rehearsal on a corpus of real examples from previous tasks can improve performance.

Other works have taken GAN based approaches Seff et al. (2017); Shin et al. (2017); Wu et al. (2018). Seff et al. Seff et al. (2017) used elastic weight consolidation (EWC 222EWC was first proposed by Kirkpatrick et al. Kirkpatrick et al. (2017)

in supervised and reinforcement learning context.

) to prevent critical parameters for all previous tasks from changing. Shin et al. Shin et al. (2017) introduced deep generative replay (DGR) where a snapshot of the model trained on previous tasks is used to generate synthetic training data from previous tasks. The synthetic data is used to augment the training examples of the current task, thus providing a more i.i.d. setup. Wu et al. Wu et al. (2018) took a similar approach to Shin et al. (2017) but used a pixel-wise norm to regularize the model through the memory replay on previous tasks.

While FoCL takes inspiration from Shin et al. (2017) and Wu et al. (2018), in both of these approaches, the memory replay is performed to regularize the model in the image space. Whereas in FoCL, the memory replay is applied on the more task-representative feature space.

2.2 Feature matching for generative models

Feature matching has been explored as a means for perceptual similarity in the context of generative adversarial networks Li et al. (2015); Dosovitskiy and Brox (2016); Salimans et al. (2016); Warde-Farley and Bengio (2017); Nguyen et al. (2017). Dosovitskiy et al. Dosovitskiy and Brox (2016)

used high-level features from a deep neural network to match visual similarity between real and fake images. They argued that while feature matching alone is not enough to provide a good loss function, it can lead to improved performance if used together with an objective on the image. Salimans et al. 

Salimans et al. (2016) showed that feature matching as an auxiliary loss function can lead to more stable training and improved performance in semi-supervised scenarios.

Contrary to previous works that use feature matching to align the model distribution to the empirical data distribution, we use feature matching as a regularizer to remember knowledge from previous tasks. Intuitively, matching in the feature space allows us to go beyond pixel-level information to focus more on remembering factors of variation contributing to a particular task, and those that are shareable among different tasks.

3 Framework

3.1 Problem formalisation

Given a stream of generative tasks arrived sequentially, each of which has its own designated dataset 333For consistency, we use superscript throughout this paper to denote the task that the data or variable belongs to., the goal of continual learning in generative models is to learn a unified model parameterized by , such that for all . In a conditional generator setting, , where is the category label for task . The challenge comes from the non-i.i.d. assumption in continual learning that for the current task , the model has no access to the real data distributions for previous tasks. Therefore, it is critical that the model retains in memory all the knowledge learned in previous tasks (i.e., ) while solving the current new task. This can be formulated as the following:


given the assumption that estimates the data distributions of previous tasks incrementally well for all . As described earlier, the current continual learning approaches for solving eq. (1) mostly fall into two main categories depending on the regularization space: parameter space Kirkpatrick et al. (2017); Zenke et al. (2017); Nguyen et al. (2018) and image space Shin et al. (2017); Wu et al. (2018).

3.2 Feature oriented continual learning

Here, we propose a new framework in continual learning for generative models, where we focus the regularization in the feature space, by introducing an encoder function that maps images into low-dimensional feature representations . To mitigate catastrophic forgetting, we regularize the model to remember previous tasks by explicitly matching the high-level features learned through previous tasks. This updates the problem defined in eq. (1) to:


Intuitively, the change of regularization to feature space has two potential benefits. First, it allows the model to focus on matching more representative information instead of parameters or pixel-level information, i.e., not all pixels in the image space contribute equally for each task. Second, in cases where task representative information (i.e., task descriptors) is available, we could leverage that information directly in the alignment process without solely relying on from the previous task. In fact, the matching in the feature space can also be viewed as a generalized version of the matching in the image space, which allows more flexibility in the regularization during optimization.

To maintain generality, we denote as any divergence function that measures the disparity between distributions and . Therefore, we can write the objective function for solving eq. (2) as:


where and

are appropriate instances of divergence functions. Common choices for the divergence function can include Kullback-Leibler divergence

, Jensen-Shannon divergence and Bregman divergence , where is a continuously-differentiable and strictly convex function. For instance, in a GAN setup, minimizing the first divergence term in eq. (3) (current task) can be expressed as a minmax game:


where and are appropriate functions parameterized by based on the chosen divergence function Grover et al. (2018). In a VAE setup, minimizing is equivalent to maximizing the marginal log-likelihood over data:


which is then often approximated by the evidence lower bound (ELBO) Kingma and Welling (2013). For the second divergence term in eq. (3) (previous tasks), the matching of empirical feature distributions could be done through the alignment of feature examples generated by the two generative processes ( and ).

As a proof of concept, in this work, we choose to use Wasserstein distance as the distance function for (i.e., in a Wasserstein GAN setup Arjovsky et al. (2017)), and for the feature matching divergence , we experiment with both Bregman divergence with (i.e., feature-wise loss) (Section 4.3) and Wasserstein distance (Section 4.2) in this study.

Figure 2: We propose three alternatives for constructing the feature space: (1) Adversarially learned encoder, (2) Distilled knowledge from the image discriminator, and (3) Prior knowledge from a pretrained model. Top: learning current task, bottom: replay from memory for previous tasks.

3.3 Constructing the feature space

As illustrated in Figure 2, we present three alternatives to construct the feature space. For simplicity of notation, we denote as the features for previous task encoded by the current model , and as the features for the same task by the snapshot model . The first way to construct the feature space is through an adversarially learned encoder Enc (Figure 2, 1⃝ Learned encoder), and the features can be obtained by:


The Enc is trained to compete with a discriminator on distinguishing pairs of and .

Another way to construct the feature space is by means of knowledge distillation on the intermediate features from the image discriminator (Figure 2, 2⃝ Distilled knowledge 444The idea of knowledge distillation for neural networks was proposed by Hinton et al. Hinton et al. (2015) and was extended to continual learning by Li et al. Li and Hoiem (2018) for supervised models.). This is also similar to the knowledge distillation adopted in Lifelong GAN Zhai et al. (2019), where multiple levels of knowledge distillation is used. Alternatively, depending on the availability of prior knowledge on the tasks, we can directly use them as features in the matching process. In this work, we consider features that are extracted from a pretrained model (pretrained on task-irrelevant data) as the representatives of our prior knowledge (Figure 2, 3⃝ Prior knowledge), due to the unavailability of such information in current benchmark datasets. Note that the use of prior knowledge or task descriptive information does not violate the assumption of continual learning in most use cases, since the true data distributions for previous/future tasks still remain without access. The intuition is that, if a model is already trained to draw and separately in previous tasks, it should know how to draw in zero-shot, if we learn a good feature space that renders . This feature space could also be used for our proposed feature matching process, however, the features need to cover the variability of information with respect to the task space.

3.4 Combined regularization spaces

In addition to standalone usages, FoCL can also be extended with other regularization strategies depending on the use case scenarios. For example, if the images have high variability in the background, the performance could be further boosted by augmenting FoCL with additional regularization in the image space. In such cases, the objective function can be given as:


where ,

, and the choice of the hyperparameter

() relies on the given data. To distinguish from standalone FoCL, we name this extended version of FoCL as -FoCL.

3.5 Forgetfulness measurement

The current evaluation metrics used in continual learning research for generative models mostly focus on the overall quality of the generated images for all tasks, for example, the average classification accuracy based on a classifier pretrained the real data (generated images of better quality give better accuracy) 

Wu et al. (2018); Ostapenko et al. (2019); van de Ven and Tolias (2018), Fréchet inception distance (FID) 555FID was originally proposed in Heusel et al. (2017). Wu et al. (2018); Lesort et al. (2019) and test log-likelihood of the generated samples Nguyen et al. (2018). However, all these metrics fail to disentangle the pure forgetfulness measurement from the generative model performance, i.e., there are two varying factors in current evaluation metrics: the approach to solve the catastrophic forgetting and the choice of using different generative models or architectures (e.g., GAN or VAE). Therefore, we propose a metric that we call forgetfulness score for a fair comparison across different methods in continual learning.

Figure 3: Proposed forgetfulness measurement.

As shown in Figure 3, for each previous task , we compute the distance between the generated data distribution and the true data distribution , and at the current task , we recompute the same distance (the subscript denotes for the current task index) by using the current model parameterized by . The difference () measures the amount of forgetting from task to the current task . We define as the task forgetfulness score for task :


For the overall forgetfulness measurement, we average the weighted task forgetfulness scores: . Note that our proposed forgetfulness score requires an assumption that the model is capable of learning the current task well enough so that is meaningful and comparable (e.g., a random model can result in an infinitely large , making the subtraction meaningless). In order to compensate this when comparing different methods, we can adjust the original forgetfulness score (eq. (8)) by adding a penalty term on the current task , resulting in compensated forgetfulness score:


We will show later in the experiments (Section 4.5) that our forgetfulness measurement offers complementary information that is not available with current commonly used metrics when comparing various methods. In addition, the slope of the curve for the task forgetfulness score can to a certain degree reveal the potential performance on future tasks.

4 Experiments

4.1 Setup

4.1.1 Implementation details

We follow the similar architecture designs and hyperparameter choices as Wu et al. (2018) (code available here 666 in order for a fair comparison. We set for both MNIST and Fashion MNIST datasets, and

for SVHN dataset. For the feature encoder, 3 conv layers are used to extract 128 dimensional vector at

resolution. The feature discriminator is composed of 1 conv layer and 1 linear layer. For the pretrained VGG, we use VGG-19 provided by tensorlayer 777 To reproduce the results from previous methods, we use the implementation from Nguyen et al. (2018) for VCL, EWC and SI methods. The code is available here 888 For the DGR method, we use the implementation from Lesort et al. (2019), with the code available here 999

. We follow their choices of architecture designs and hyperparameters. For the Fashion MNSIT dataset, we train the model for 400 epochs for each task. For computing the average classification accuracies (

and ), we evaluate on 6,400 samples by using pretrained classifiers. We use the same classifiers for both MNIST and SVHN datasets provided by Wu et al. (2018), and we train our classifier for Fashion MNIST with ReNet-18. We generate 6,400 images for FID score when calculating our forgetfulness scores.

4.1.2 Datasets

We perform our experiments on four benchmark datasets: MNIST LeCun (1989), SVHN Netzer et al. (2011), Fashion MNIST Xiao et al. (2017) and CIFAR10 Krizhevsky et al. (2009). SVHN contains 73,257 training and 26,032 test 3232 pixel color images of house number digits. MNIST and Fashion MNIST contain ten categories of black-and-white 28

28 pixel images composed of 60,000 training and 10,000 test images. Images from these two datasets are zero-padded the borders to form 32

32 images. CIFAR10 has 50,000 training images and 10,000 test images in 10 classes, and all images are 3232 color images. Following previous works Wu et al. (2018); Ostapenko et al. (2019), we also consider each class as a different task in this work.

4.2 Adversarially learned feature encoder

In this subsection, we construct our feature space by adversarially learning the matched and mismatched pairs of features. The features are extracted from an encoder that plays a minmax game with a discriminator. We use the WGAN-GP Gulrajani et al. (2017) technique in training our encoder, which makes our divergence function the Wasserstein distance when matching the empirical feature distributions: for all at task . Similar to previous work Wu et al. (2018)

, we use conditional batch normalization 

De Vries et al. (2017) for the task conditioning, and integrate an auxiliary classifier (AC-GAN) Odena et al. (2017) to predict the category labels. The weight of auxiliary classifier for AC-GAN is set to .

In order to compare our proposed FoCL (standalone) with previous methods, we first use the same quantitative metric (average classification accuracy) as used in previous works Wu et al. (2018); Ostapenko et al. (2019) for both 5 sequential tasks () and 10 sequential tasks (). As shown in Table 1, FoCL achieves better performance for 10 tasks () on three datasets. On the Fashion MNIST dataset, it remarkably improves the accuracy from 80.46% to 90.26%. For 5 tasks (), FoCL also outperforms previous state-of-the-art method MeRGAN Wu et al. (2018) on the MNIST and SVHN datasets. However, it has an impaired result for on Fashion MNIST dataset. Generally, we observe that FoCL tends to consistently work better as the number of tasks grows, which is within our expectation since constructing a representative feature space typically requires training on a certain amount of samples and tasks. It could be possible that our feature encoder has not been trained well in early tasks, and this agrees with our results in Section 4.3, where is shown to be improved by using prior knowledge as feature encoder ( i.e., the feature space is pre-constructed).

width= MNIST (%) SVHN (%) Fashion MNIST (%) Method JT 97.66 96.92 85.30 84.82 87.12 89.08 EWC Seff et al. (2017) 70.62 77.03 39.84 33.02 - - DGR Shin et al. (2017) 90.39 85.40 61.29 47.28 - - MeRGAN Wu et al. (2018) 98.19 97.01 80.90 66.78 92.17* 80.46* FoCL 99.07 98.09 84.80 77.31 86.18 90.26 (mean std) 98.96 0.13 97.68 0.25 84.14 0.58 76.49 0.65 85.02 0.82 89.58 0.40 * results based on our experiments; standalone FoCL, best result based on 3 independent experiments.

Table 1: Performance comparison based on average classification accuracy.
Figure 4: Fast convergence in feature space for 10 tasks.
Figure 5:

t-SNE visualizations of encoded features (extracted from adversarially learned encoders).

We also find in our experiments that, the performance can be improved by simply increasing the number of iteration steps when using the generative replay method Shin et al. (2017). We then further investigate how replay iteration affects the model performance when we switch the regularization from image space to feature space. Figure 4 demonstrates that not only can FoCL give significantly better performance (e.g., Fashion MNIST), but it can also have faster adaptation, suggesting the efficiency of matching in the feature space as compared to the image space where many pixels can be perturbation factors. This observation also agrees with the findings in Bengio et al. (2020) that high-level representation space can lead to fast adaptation to distributional changes due to non-stationarities.

width= Weight Iterations per task (%) on 4K 8K 16K 32K AC =0 * 62.26 0.99 53.06 1.69 71.90 0.53 63.07 0.77 80.18 0.73 71.47 0.31 84.14 0.58 76.49 0.65 =1 61.33 1.28 52.50 0.54 72.55 0.73 63.17 0.15 79.58 1.09 71.56 0.70 84.23 0.58 76.72 0.75 =1e-3 47.13 1.26 36.50 1.64 65.35 2.33 53.88 0.47 75.85 0.58 64.33 2.38 84.44 0.15 75.93 0.32 =1e-5 45.20 2.89 35.57 1.92 66.61 0.33 56.03 0.53 75.71 1.07 66.22 0.88 82.49 0.30 72.21 0.36 * without AC in the feature encoder

Table 2: Auxiliary classifier (AC) in the feature encoder on SVHN dataset.

To visualize the learned features, we encode the testing images in the datasets for t-SNE visualization (Figure 5). In both MNIST and Fashion MNIST datasets, our learned features are task-discriminative whereas in SVHN dataset, the features contain task agnostic information such as background color. The correlation between features being task-discriminative and the better performance in classification accuracy raises a question whether the former is a causal explanation of the latter. To answer this question, we integrate an auxiliary classifier in our encoder to further regularize the features for SVHN dataset to be task-discriminative. However, we do not observe significant performance changes (Table 2). Interestingly, we find that in both MNIST and Fashion MNIST datasets, the embeddings for the last task (i.e., digit 9 and ankle boot) have been meaningfully allocated in the feature space (digit 9 close to digit 4 and digit 7, and ankle boot close to sandal and sneaker), even though the images of the last task have not been exposed to the encoder (Note that the encoder is trained during the replay process only on previous tasks). This implies an important future direction in continual learning to incrementally learn task descriptive features that can be potentially applied to zero-shot learning scenarios.

width= MNIST (%) SVHN (%) Fashion MNIST (%) Feature source Learned encoder 98.96 0.13 97.68 0.25 84.14 0.58 76.49 0.65 85.02 0.82 89.58 0.40 Distilled knowledge 98.67 0.10 97.57 0.15 83.67 2.54 77.63 0.10 90.18 0.19 89.40 0.35 Prior knowledge 21.53 2.28 13.15 1.72 85.02 0.08 54.46 0.06 90.39 4.01 57.21 6.79

Table 3: Different ways to construct the feature space.

4.3 Distilled or prior knowledge as feature encoder

Next, we consider features either achieved from an intermediate layer of the image discriminator as distilled knowledge, or extracted from a pretrained model (e.g

., VGG pretrained on ImageNet) as prior knowledge for feature matching during the replay process. Since the features are not adversarially learned in both cases, we simply use

loss to align the features, therefore the regularization is equivalent to the use of Bregman divergence with for the divergence function in eq. (3). Table 3 shows the performance of FoCL with different feature sources. Surprisingly, using prior knowledge as feature source gives improved performance in on both SVHN (85.09%, best result) and Fashion MNIST (93.61%, best result) datasets, which also outperform the results obtained from using regularization in the image space (Table 1, MeRGAN Wu et al. (2018)). However, this finding is neither generalizable (MNIST) nor scalable (). In most cases, we obtain comparable performance between the adversarially learned encoder and distilled knowledge as feature sources, suggesting that the improved performance comes from the regularization in the feature space rather than the choice of divergence function.

4.4 Boosted performance with -FoCL

Despite the improved performance with standalone FoCL compared to previous methods, we notice in our experiments that the regularization in the feature space alone is insufficient to address catastrophic forgetting when dealing with image data that has high variability in the background, such as the CIFAR10 dataset (Table 4), possibly due to the less attention on fine-grained background details in our proposed approach. However, this can be effectively addressed by -FoCL (eq. (7)). Moreover, with -FoCL, we also observe additionally boosted performance on MNIST, SVHN and Fashion MNIST datasets compared to standalone FoCL, achieving the new state-of-the-art results (Table 4). Figure 6 compares the generated samples between image space and feature space, and the latter has notable improvement on sharpness (SVHN) and diversity (Fashion MNIST).

width= Method MNIST(%) SVHN(%) F-MNIST(%) CIFAR10(%) MeRGAN Wu et al. (2018)* 97.81 76.68 80.46 68.92 =0.0 (equivalent to MeRGAN) 97.81 76.68 80.46 68.92 =0.2 97.72 78.13 87.11 70.65 =0.4 98.01 78.95 90.63 71.29 =0.6 97.94 77.70 90.32 71.35 =0.8 97.89 77.59 91.73 71.58 =1.0 (equivalent to standalone FoCL) 98.09 77.31 90.26 23.85 * improved results based on our experiments

Table 4: Boosted performance with -FoCL.
Figure 6: Comparison of generated samples between regularization in image space and feature space.

4.5 Forgetfulness evaluation

Here, we further evaluate FoCL (standalone) using our proposed forgetfulness scores. We choose to compute our task forgetfulness scores by using FID as the distance measure between the generated data and true data 101010Another choice for distance measure could be based on classification accuracy.. As discussed before, in practice, we also notice that the original forgetfulness score (, eq. (8)) is only meaningful when is comparable, whereas the compensated forgetfulness score (, eq. (9)) is more suitable when comparing different methods with big variations in , as the latter also takes into consideration the difficulty level of not forgetting.

Figure 7: Forgetfulness score by tasks ().
Matching space
Image space 24.60 3.62 13.12 9.04 64.48 13.46
Feature space 15.60 0.82 66.19 3.81 47.13 3.59
Table 5: Overall forgetfulness score () based on FID.

Therefore, in our experiments, we use only for the comparison between the regularization in the image space and feature space, where the experiments are more strictly controlled so that the assumption of being comparable is satisfied. The results of are shown in Figure 7, and Figure 8 compares the curve of from FoCL to those from previous methods including DGR Shin et al. (2017) and MeRGAN Wu et al. (2018) that use the regularization in the image space. As seen in both figures (more evident in MINST and Fashion MNIST datasets for ), despite the latency at early tasks compared to MeRGAN, FoCL can quickly catch up in performance and maintain a more stable state of forgetfulness, suggesting that it suffers less from the forgetting as the task goes on. The previous methods, however, have much steeper curves. On SVHN dataset (Figure 7), we do not observe lower from FoCL possibly due to the limited number of tasks in SVHN given the complexity of the data, which has not allowed FoCL to outperform.

Figure 8: The curves of compensated forgetfulness score ().
MNIST (%) Fashion MNIST (%)
EWC Seff et al. (2017) 67.82 1.61 147.63 10.02
SI Zenke et al. (2017) 66.13 1.99 152.70 15.46
DGR Shin et al. (2017) 69.86 1.31 150.39 7.24
VCL Nguyen et al. (2018) 54.79 1.59 106.94 3.80
MeRGAN Wu et al. (2018) 41.43 3.29 94.30 13.86
FoCL 32.87 0.68 76.41 4.12
Table 6: Comparison of different methods by the overall compensated forgetfulness score () based on FID.

Another possible explanation could be that our method focuses more on task representative information rather than all fine-grained details, therefore resulting in worse FID scores, however, it manages to remember more task representative information and as a result gives better classification accuracy (Table 1). Nevertheless, in general, we still observe that the growth of forgetfulness score is slowed down when the matching space is in the feature space instead of image space. In fact, the trend of the curve can also offer us a prediction on the forgetfulness score for future tasks, and we can use the slope of the linearly-fitted curve as the indicator. Table 5 and Table 6 summarize the comparisons of both the overall forgetfulness scores and their corresponding linearly-fitted curve slope for different methods.

Our forgetfulness measurement gives complementary information in addition to previous metrics such as the average classification accuracy shown in Table 1. For example, we show on MNIST dataset that FoCL has significantly better performance in forgetfulness scores (Table 5 and Table 6), while it is indistinguishable in average classification accuracy (Table 1). In fact, the measure of forgetfulness scores focuses on the degree to which the model suffers from catastrophic forgetting during the learning process while the previous metrics merely care about whether the end point can successfully solve all the tasks. We believe both of them should be taken into consideration in order for a fair comparison among different methods in continual learning for future studies.

5 Conclusion

In this paper, we present FoCL, a general framework in continual learning for generative models. Instead of the regularization in the parameter space or image space as done in previous works, FoCL regularizes the model in the feature space. To construct the feature space, several ways have been proposed. We show in our experiments on several benchmark datasets that FoCL has faster adaptation and achieves the state-of-the-art performance for generative models in task incremental learning. Finally, we show our proposed forgetfulness measurement offers a new perspective view when evaluating different methods in continual learning.


  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §3.2.
  • Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, and C. Pal (2020) A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations (ICLR), Cited by: §4.2.
  • H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. C. Courville (2017) Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pp. 6594–6604. Cited by: §4.2.
  • A. Dosovitskiy and T. Brox (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in neural information processing systems, pp. 658–666. Cited by: §2.2.
  • S. Farquhar and Y. Gal (2018) Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733. Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
  • A. Grover, M. Dhar, and S. Ermon (2018) Flow-gan: combining maximum likelihood and adversarial learning in generative models. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.2.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §4.2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: footnote 5.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: footnote 4.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.1, §3.2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §3.1, footnote 2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1.2.
  • Y. LeCun (1989)

    The mnist database of handwritten digits

    http://yann. lecun. com/exdb/mnist/. Cited by: §4.1.2.
  • T. Lesort, H. Caselles-Dupré, M. Garcia-Ortiz, A. Stoian, and D. Filliat (2019) Generative models from the perspective of continual learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.1, §3.5, §4.1.1.
  • Y. Li, K. Swersky, and R. Zemel (2015)

    Generative moment matching networks


    International Conference on Machine Learning

    pp. 1718–1727. Cited by: §2.2.
  • Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: footnote 4.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §4.1.2.
  • A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski (2017) Plug & play generative networks: conditional iterative generation of images in latent space. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4467–4477. Cited by: §2.2.
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational continual learning. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §3.1, §3.5, §4.1.1, Table 6.
  • A. Odena, C. Olah, and J. Shlens (2017) Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2642–2651. Cited by: §4.2.
  • O. Ostapenko, M. Puscas, T. Klein, P. Jähnichen, and M. Nabi (2019) Learning to remember: a synaptic plasticity driven framework for continual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.5, §4.1.2, §4.2.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §2.2.
  • A. Seff, A. Beatson, D. Suo, and H. Liu (2017) Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395. Cited by: §2.1, Table 1, Table 6.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §1, §2.1, §2.1, §3.1, §4.2, §4.5, Table 1, Table 6.
  • M. K. Titsias, J. Schwarz, A. G. d. G. Matthews, R. Pascanu, and Y. W. Teh (2020) Functional regularisation for continual learning using gaussian processes. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • G. M. van de Ven and A. S. Tolias (2018) Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §3.5.
  • D. Warde-Farley and Y. Bengio (2017) Improving generative adversarial networks with denoising feature matching. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
  • C. Wu, L. Herranz, X. Liu, J. van de Weijer, B. Raducanu, et al. (2018) Memory replay gans: learning to generate new categories without forgetting. In Advances in Neural Information Processing Systems, pp. 5962–5972. Cited by: §1, §2.1, §2.1, §3.1, §3.5, §4.1.1, §4.1.2, §4.2, §4.2, §4.3, §4.5, Table 1, Table 4, Table 6.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.2.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §1, §3.1, Table 6.
  • M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori (2019) Lifelong gan: continual learning for conditional image generation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2759–2768. Cited by: §3.3.