1 Introduction
Continual learning tries to mimic the ability of humans to retain or accumulate previous knowledge and use it to solve future problems with possible adaptations. In this paper we propose a new method for continual learning in the classification setting. Our model combines the encoder and decoder of a variational autoencoder (VAE)
[6]with a classifier. To do this we derive a new variational bound on the joint loglikelihood
.To enable the continual discriminative learning we build on the work of Ramapuram et al. [10] on lifelong generative modelling. The model has a studentteacher architecture (similar to that in distillation methods, [4], [2]), where the teacher contains a summary of all past distributions and is able to generate data from the previous tasks once we no longer have access to the original data. Every time a new task arrives, a student is trained on the new data together with the data generated by the teacher from the old tasks. The proposed method thus does not need to store the previous models (it only stores their summary within the teacher model) nor data from the previous tasks (it can generate them using the teacher model).
1.1 Related work
Several approaches have been proposed to solve catastrophic forgetting over the last few years. We can roughly distinguish 2 streams of work: a) methods that rely on a dynamic architecture that evolves as they see new tasks b)methods with regularization approaches that constrain the models learned in new tasks so that the network avoids modifying the important parameters of the previous tasks. In dynamic architectures parameters of the models learned on the old tasks are passed over to the new tasks while the past models for each task are preserved ([11], [1]). In contrast, our method does not need to keep the past models. Regularization approaches ([7], [13]) impose constrains to the objective function to minimize changes in parameters important for previous tasks. However, these methods need to store the parameters of the previous tasks, something that is not required in our proposed method.
In Variational Continual Learning, [9], the authors propose a method which is applicable to discriminative and generative models but not both at the same time while our method is. While VCL shows rather impressive results, it achieves those relying on the reuse of some of the previous data through the use of coresets and by maintaining taskspecific parameters, called head networks. It therefore relaxes the continual learning paradigms of no access to past data and no storage of past taskspecific models; paradigms that our method fully takes on board.
2 Model
In the continual classification setting, we deal with data that come sequentially in pairs . For each task the network receives a new data set and does not have access to any of the data sets of the previously seen tasks.
To perform the classification, we use a latent variable model as shown in Fig.1. In this model, each observation has a corresponding latent variable , that is used to generate the correct label class
. The joint distribution of the latent variable model that we consider factorizes as
where are labeled data pairs and are the latent variables. The data variables are assumed to be conditionally independent given the latent variables , , such that .Following the classical VAE approach we will use variational inference to approximate the intractable posterior . Instead of the natural we use to approximate the true posterior since in the test phase of the classification is not available. To measure the similarity between the true posterior and its approximation
we minimize the KullbackLeibler divergence between the approximate posterior and the true posterior.
(1) 
The term in Eq.1 is a constant. This means that in order to minimize the KLdivergence we minimize which is equivalent to maximizing .
(2) 
Rearranging Eq.1 as:
(3) 
we can see that the is a lower bound of the joint loglikelihood, : a new variational bound for the joint generative and discriminative VAE learning.
To gain better intuition into our newly derived variational bound, we show the relation to the classical ELBO (variational bound on the marginal likelihood ) used in VAEs. Rearranging the terms in Eq.3, under the conditional independence assumption and using the fact that the KLdivergence is always positive, we arrive at:
(4) 
The first term , as in the standard VAE is the variational bound on the marginal likelihood (ELBO). The second term is the expectation of the conditional loglikelihood of the labels on the latent variable , the classification loss. This term allows our variational bound to be used in classification settings. This means that we solved the two problems of producing the labels , and generating input data jointly, resulting in a common latent variable which is good for classification and reconstruction at the same time.
Furthermore, it is easy to show that under our conditional independence assumption . Assuming that summarizes well for the classification of () both of the ratios are close to 1. Replacing the intractable posterior by the approximation results in which is what the minimization of the KLdivergence in Eq.(1) tries to achieve. This therefore provides and alternative argument for the validity of our approach described above.
Our goal in this paper is to correctly classify data from different tasks that arrive continuously, requiring us to handle the catastrophic forgetting problem. For this we use the lifelong generative ability of [10] and extend their VAE based generative model to include a classifier that remembers all the classification tasks it has seen before. The method uses a dual architecture based on a studentteacher model. The main goal of the student model is to classify the input data. The teacher model’s role is to preserve the memory of the previously learned tasks and to pass this knowledge onto the student.
Both the teacher and the student consist of an encoder , a decoder and newly a classifier following the graphical model in Fig.1. In the above notation represents the teacher and student model respectively. The teacher model remembers the old tasks and generates data from them for the student to use in learning once the old data are no longer available. The student model learns to generate and classify over the new labeled data pairs and the oldtask data generted by the teacher . Every time a new task is initiated, the student passes the latest parameters to the teacher and starts learning over data from the new task augmented with data generated by the teacher from all the previous tasks. In this way the acquired information of the previous tasks is preserved and the proposed model learns to classify correctly even over data distributions seen in previous tasks. The proposed architecture does not need to store the taskspecific models for the previous data distributions nor the previous data themselves.
The student optimizes the variational bound of the joint loglikelihood Eq.(4) instead of the marginal loglikelihood over which the classical VAE operates. As a result our model is able to both generate the input data and learn the labels at the same time. We should note that previous approaches to classification with VAEs ([5]) do so by adding an adhoc manner to the VAE optimization function terms that relate to classification performance. Here we naturally extend the VAE setting to classification.
Following [10] we add an additional term () to our objective to preserve the posterior representation of all previous tasks to speed up the training and a negative information gain regularizer, , between the latent representation and the generated data from the teacher. The final loss that we optimize is given by Eq.5.
(5) 
3 Experiments
In this section we present preliminary results achieved with the proposed model. We investigate the problem of whether our model is able to learn a set of different tasks that are coming in sequence without forgetting the previously trained tasks.
We evaluated our approach for continual learning on permuted MNIST [8], [3]. Each task is a 10way classification (09 digits) over images with the pixels shuffled by a random fixed permutation. We train on a sequence of 5 tasks (original MNIST and 4 random permutations). After the training of each task we allow no further training or access to that task’s data set. For training we process the data in mini batches of 256 (random data shuffling) and use early stopping on the classification accuracy.
We use two baseline models for comparisons. The first is a standard VAE augmented by our classifier (vaecl) using our variational bound but without the teacherstudent architecture. In the second, we adapt the elastic weight consolidation (EWC) regularisation approach of [7] to our setting. We use the teacher here to keep the summary of all the previous distributions^{1}^{1}1In our EWC baseline the teacher is not used to generate data for the student and employ the EWClike regularisation over the parameters of the teacher and student models where .
We measure the performance by the ability of the network to solve all tasks seen up till the current point. For all tested methods we performed a random hyperparameter sweep over convolutional and dense network architectures. We present the results of the best obtained models^{2}^{2}2Convolutional for ours and vaecl, dense for EWC in Fig. 3. For the naive vaecl method the performance drops dramatically already when the training regime switches from the MNIST to the first permuted task. For the EWC method the performance after the first task degrades less severely, but it still forgets the previous tasks. Our model, continual classification learning using generative models (CCLGM) retains high average classification accuracy Fig. 2(a) and low average reconstruction ELBO 2(b). This shows that our model is able to learn continuously and concurrently for both classification and generation.
To support our initial results from the above experiments, we conducted a second set of experiments on a sequence of three different tasks: MNIST, FashionMNIST [12] and one MNIST permutation. The results presented in Fig. 4 show that our method outperforms the baselines and confirm our preliminary conclusions that our new model CCLGM has the ability to mitigate catastrophic forgetting in joint generative and discriminative problems.
4 Conclusion
In this work we propose a method to address continual learning in the classification setting. We use a generative model to generate inputoutput pairs from the previously learned tasks and use these to augment the data of the current tasks for further training. In this way our classification model overcomes catastrophic forgetting. Our model does not reuse data nor previous taskspecific models and it continuously learns to concurrently classify and reconstruct data over a number of different tasks.
References
 [1] Chrisantha Fernando, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A Rusu, Alexander Pritzel, and Daan Wierstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 [2] Tommaso Furlanello, Jiaping Zhao, Andrew M Saxe, Laurent Itti, and Bosco S Tjan. Active long term memory networks. arXiv preprint arXiv:1606.02355, 2016.
 [3] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211, 2013.
 [4] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 [5] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 [6] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [7] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796, 2016.
 [8] Yann LeCun, Corinna Cortes, and CJC Burges. The mnist dataset of handwritten digits. 1998.
 [9] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. In International Conference on Learning Representations, ICLR, 2018.
 [10] Jason Ramapuram, Magda Gregorova, and Alexandros Kalousis. Lifelong generative modeling. arXiv preprint arXiv:1705.09847, 2017.
 [11] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 [12] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[13]
Friedemann Zenke, Ben Poole, and Surya Ganguli.
Continual learning through synaptic intelligence.
In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning
, volume 70 of Proceedings of Machine Learning Research, pages 3987–3995, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.