Log In Sign Up

Continual Classification Learning Using Generative Models

Continual learning is the ability to sequentially learn over time by accommodating knowledge while retaining previously learned experiences. Neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on previously learned tasks when tasks are presented one at a time. This problem is called catastrophic forgetting. In this work, we propose a classification model that learns continuously from sequentially observed tasks, while preventing catastrophic forgetting. We build on the lifelong generative capabilities of [10] and extend it to the classification setting by deriving a new variational bound on the joint log likelihood, p(x; y).


page 1

page 2

page 3

page 4


On robustness of generative representations against catastrophic forgetting

Catastrophic forgetting of previously learned knowledge while learning n...

Continual Learning in Generative Adversarial Nets

Developments in deep generative models have allowed for tractable learni...

Lifelong Learning of Few-shot Learners across NLP Tasks

Recent advances in large pre-trained language models have greatly improv...

Sequential mastery of multiple tasks: Networks naturally learn to learn

We explore the behavior of a standard convolutional neural net in a sett...

CLeaR: An Adaptive Continual Learning Framework for Regression Tasks

Catastrophic forgetting means that a trained neural network model gradua...

Defeating Catastrophic Forgetting via Enhanced Orthogonal Weights Modification

The ability of neural networks (NNs) to learn and remember multiple task...

Progressive Continual Learning for Spoken Keyword Spotting

Catastrophic forgetting is a thorny challenge when updating keyword spot...

1 Introduction

Continual learning tries to mimic the ability of humans to retain or accumulate previous knowledge and use it to solve future problems with possible adaptations. In this paper we propose a new method for continual learning in the classification setting. Our model combines the encoder and decoder of a variational autoencoder (VAE)


with a classifier. To do this we derive a new variational bound on the joint log-likelihood


To enable the continual discriminative learning we build on the work of Ramapuram et al. [10] on lifelong generative modelling. The model has a student-teacher architecture (similar to that in distillation methods, [4], [2]), where the teacher contains a summary of all past distributions and is able to generate data from the previous tasks once we no longer have access to the original data. Every time a new task arrives, a student is trained on the new data together with the data generated by the teacher from the old tasks. The proposed method thus does not need to store the previous models (it only stores their summary within the teacher model) nor data from the previous tasks (it can generate them using the teacher model).

1.1 Related work

Several approaches have been proposed to solve catastrophic forgetting over the last few years. We can roughly distinguish 2 streams of work: a) methods that rely on a dynamic architecture that evolves as they see new tasks b)methods with regularization approaches that constrain the models learned in new tasks so that the network avoids modifying the important parameters of the previous tasks. In dynamic architectures parameters of the models learned on the old tasks are passed over to the new tasks while the past models for each task are preserved ([11], [1]). In contrast, our method does not need to keep the past models. Regularization approaches ([7], [13]) impose constrains to the objective function to minimize changes in parameters important for previous tasks. However, these methods need to store the parameters of the previous tasks, something that is not required in our proposed method.

In Variational Continual Learning, [9], the authors propose a method which is applicable to discriminative and generative models but not both at the same time while our method is. While VCL shows rather impressive results, it achieves those relying on the reuse of some of the previous data through the use of core-sets and by maintaining task-specific parameters, called head networks. It therefore relaxes the continual learning paradigms of no access to past data and no storage of past task-specific models; paradigms that our method fully takes on board.

2 Model

In the continual classification setting, we deal with data that come sequentially in pairs . For each task the network receives a new data set and does not have access to any of the data sets of the previously seen tasks.

To perform the classification, we use a latent variable model as shown in Fig.1. In this model, each observation has a corresponding latent variable , that is used to generate the correct label class

. The joint distribution of the latent variable model that we consider factorizes as

where are labeled data pairs and are the latent variables. The data variables are assumed to be conditionally independent given the latent variables , , such that .

Figure 1: Graphical model

Following the classical VAE approach we will use variational inference to approximate the intractable posterior . Instead of the natural we use to approximate the true posterior since in the test phase of the classification is not available. To measure the similarity between the true posterior and its approximation

we minimize the Kullback-Leibler divergence between the approximate posterior and the true posterior.


The term in Eq.1 is a constant. This means that in order to minimize the KL-divergence we minimize which is equivalent to maximizing .


Rearranging Eq.1 as:


we can see that the is a lower bound of the joint log-likelihood, : a new variational bound for the joint generative and discriminative VAE learning.

To gain better intuition into our newly derived variational bound, we show the relation to the classical ELBO (variational bound on the marginal likelihood ) used in VAEs. Rearranging the terms in Eq.3, under the conditional independence assumption and using the fact that the KL-divergence is always positive, we arrive at:


The first term , as in the standard VAE is the variational bound on the marginal likelihood (ELBO). The second term is the expectation of the conditional log-likelihood of the labels on the latent variable , the classification loss. This term allows our variational bound to be used in classification settings. This means that we solved the two problems of producing the labels , and generating input data jointly, resulting in a common latent variable which is good for classification and reconstruction at the same time.

Furthermore, it is easy to show that under our conditional independence assumption . Assuming that summarizes well for the classification of () both of the ratios are close to 1. Replacing the intractable posterior by the approximation results in which is what the minimization of the KL-divergence in Eq.(1) tries to achieve. This therefore provides and alternative argument for the validity of our approach described above.

Our goal in this paper is to correctly classify data from different tasks that arrive continuously, requiring us to handle the catastrophic forgetting problem. For this we use the lifelong generative ability of [10] and extend their VAE based generative model to include a classifier that remembers all the classification tasks it has seen before. The method uses a dual architecture based on a student-teacher model. The main goal of the student model is to classify the input data. The teacher model’s role is to preserve the memory of the previously learned tasks and to pass this knowledge onto the student.

Both the teacher and the student consist of an encoder , a decoder and newly a classifier following the graphical model in Fig.1. In the above notation represents the teacher and student model respectively. The teacher model remembers the old tasks and generates data from them for the student to use in learning once the old data are no longer available. The student model learns to generate and classify over the new labeled data pairs and the old-task data generted by the teacher . Every time a new task is initiated, the student passes the latest parameters to the teacher and starts learning over data from the new task augmented with data generated by the teacher from all the previous tasks. In this way the acquired information of the previous tasks is preserved and the proposed model learns to classify correctly even over data distributions seen in previous tasks. The proposed architecture does not need to store the task-specific models for the previous data distributions nor the previous data themselves.

Figure 2: The architecture of the learning procedure. Fig.1(a) The teacher model generates input-output pairs from the previously seen tasks and passes them onto the student. Moreover the teacher evaluates the posterior Fig.1(b) The student model learns to classify and generate new data augmented by data from the teacher.

The student optimizes the variational bound of the joint log-likelihood Eq.(4) instead of the marginal log-likelihood over which the classical VAE operates. As a result our model is able to both generate the input data and learn the labels at the same time. We should note that previous approaches to classification with VAEs ([5]) do so by adding an ad-hoc manner to the VAE optimization function terms that relate to classification performance. Here we naturally extend the VAE setting to classification.

Following [10] we add an additional term () to our objective to preserve the posterior representation of all previous tasks to speed up the training and a negative information gain regularizer, , between the latent representation and the generated data from the teacher. The final loss that we optimize is given by Eq.5.


3 Experiments

In this section we present preliminary results achieved with the proposed model. We investigate the problem of whether our model is able to learn a set of different tasks that are coming in sequence without forgetting the previously trained tasks.

We evaluated our approach for continual learning on permuted MNIST [8], [3]. Each task is a 10-way classification (0-9 digits) over images with the pixels shuffled by a random fixed permutation. We train on a sequence of 5 tasks (original MNIST and 4 random permutations). After the training of each task we allow no further training or access to that task’s data set. For training we process the data in mini batches of 256 (random data shuffling) and use early stopping on the classification accuracy.

We use two baseline models for comparisons. The first is a standard VAE augmented by our classifier (vae-cl) using our variational bound but without the teacher-student architecture. In the second, we adapt the elastic weight consolidation (EWC) regularisation approach of [7] to our setting. We use the teacher here to keep the summary of all the previous distributions111In our EWC baseline the teacher is not used to generate data for the student and employ the EWC-like regularisation over the parameters of the teacher and student models where .

We measure the performance by the ability of the network to solve all tasks seen up till the current point. For all tested methods we performed a random hyper-parameter sweep over convolutional and dense network architectures. We present the results of the best obtained models222Convolutional for ours and vae-cl, dense for EWC in Fig. 3. For the naive vae-cl method the performance drops dramatically already when the training regime switches from the MNIST to the first permuted task. For the EWC method the performance after the first task degrades less severely, but it still forgets the previous tasks. Our model, continual classification learning using generative models (CCL-GM) retains high average classification accuracy Fig. 2(a) and low average reconstruction ELBO 2(b). This shows that our model is able to learn continuously and concurrently for both classification and generation.

(a) Average test classification accuracy
(b) Average test negative reconstruction ELBO
Figure 3: Average performance over all learned tasks from the permuted MNIST data set as a function of the number of tasks. Our approach, CCL-GM maintains high accuracy and low negative ELBO as the number of tasks increases. Vanilia VAE our classifier performs far worse. EWC degrades less severely, but still forgets the previous tasks

To support our initial results from the above experiments, we conducted a second set of experiments on a sequence of three different tasks: MNIST, FashionMNIST [12] and one MNIST permutation. The results presented in Fig. 4 show that our method outperforms the baselines and confirm our preliminary conclusions that our new model CCL-GM has the ability to mitigate catastrophic forgetting in joint generative and discriminative problems.

(a) Average test classification accuracy
(b) Average test negative reconstruction ELBO
Figure 4: Average performance over all learned tasks.

4 Conclusion

In this work we propose a method to address continual learning in the classification setting. We use a generative model to generate input-output pairs from the previously learned tasks and use these to augment the data of the current tasks for further training. In this way our classification model overcomes catastrophic forgetting. Our model does not reuse data nor previous task-specific models and it continuously learns to concurrently classify and reconstruct data over a number of different tasks.