Deep unsupervised generative learning allows us to take advantage of the massive amount of unlabeled data available in order to build models that efficiently compress and learn an approximation of the true data distribution. It has numerous applications such as image denoising, inpainting, super-resolution, structured prediction, clustering, pre-training and many more. However, something that is lacking in the modern ML toolbox is an efficient way to learn these deep generative models in a lifelong setting. In a lot of real world scenarios we observe distributions sequentially; children in elementary school for example learn the alphabet letter-by-letter and in a sequential manner. Other real world examples include video data from sensors such as cameras and microphones or other similar sequential data. A system can also be resource limited wherein all of the past data or learnt models cannot be stored. The navigation of a resource limited robot in an unknown environment for instance, might require the robot to be able to inpaint images from a learnt generative model in a previous environment.
In the lifelong learning setting we sequentially observe a single distribution at a time from a possibly infinite set of distributions. Our objective is to learn a single model that is able to generate from each of the individual distributions without the preservation of the observed data111This setting is drastically different from the online learning setting; we touch upon this in Appendix 8.2. We provide an example of such a setting in figure 1(a) using MNIST lecun-mnisthandwrittendigit-2010 , where we sequentially observe three distributions. Since we only observe one distribution at a time we need to develop a strategy of retaining the previously learnt distributions and integrating it into future learning. To accumulate additional distributions in the current generative model we utilize a student-teacher architecture similar to those in distillation methods hinton2015distilling ; furlanello2016active . The teacher contains a summary of all past distributions and is used to augment the data used to train the student model. The student model thus receives data samples from the currently observable distribution as well as synthetic data samples from previous distributions. Once a distribution shift occurs the existing teacher model is discarded, the student becomes the teacher and a new student is instantiated.
We introduce a novel regularizer in the form of a Bayesian update rule that allows us to bring the posterior of the student close to that of the teacher for the synthetic data generated by the teacher. This allows us to build upon and extend the teacher’s inference model into the student each time the latter is re-instantiated (rather than re-learning it from scratch). By coupling this regularizer with a weight transfer from the teacher to the student we also allow for faster convergence of the student model. We empirically show that this regularizer mitigates the effect of catastrophic interference mccloskey1989catastrophic . It also ensures that even though our model evolves over time, it preserves the ability to generate samples from any of the previously observed distributions, a property we call consistent sampling.
We choose to build our lifelong generative models using Variational Autoencoders (VAEs)
kingma2014 as they provide an efficient and stable way to do
posterior inference, a requirement for our regularizer and data
generation method (Section 4.1). Using a standard VAE decoder to generate synthetic data for the student is
problematic due to a couple of limitations of the VAE generative
process as shown in Figure 1(b). 1) Sampling the prior can select a point in the latent space that is in between two
separate distributions, causing generation of unrealistic synthetic data and
eventually leading to loss of previously learnt distributions; 2) data points mapped to the posterior that are further away from the prior mean
will be sampled less frequently resulting in an undersampling of some of the constituent distributions222 This is due to the fact that VAE’s
generate data by sampling their prior (generally an isotropic
standard gaussian) and decoding the
sample through the decoder neural network. Thus a posterior instance further from
the prior mean is sampled less frequently.
This is due to the fact that VAE’s generate data by sampling their prior (generally an isotropic standard gaussian) and decoding the sample through the decoder neural network. Thus a posterior instance further from the prior mean is sampled less frequently.
. To address these sampling limitations we decompose the latent variable vector into a continuous and a discrete component (Section4.3). The discrete component summarizes the discriminative information of the individual generative distributions while the continuous caters for the remaining sample variability (a nuisance variable louizos2015variational ). By independently sampling the discrete and continuous components we preserve the distributional boundaries and circumvent the two VAE limitations described above.
2 Related Work
One of the key obstacles for a neural lifelong learner is the effect of catastrophic interference mccloskey1989catastrophic . Model parameters of a neural network trained in a sequential manner tend to be biased towards the distribution of the latest observations, while forgetting what was learnt previously over data no longer accessible for training. Lifelong / continual learning aims to mitigate the effects of catastrophic interference using four major strategies: transfer learning, replay mechanisms, parameter regularization and distribution regularization.
Transfer learning : These approaches attempt to solve the problem of catastrophic interference by relaying previously learnt information to the current model. Methods such as Progressive Neural Networks rusu2016progressive and Deep Block-Modular Neural Networks terekhov2015knowledge for example, transfer a hidden layer representation from previously learnt models into the new model. The problem with transfer learning approaches is that they generally require the preservation of all previously learnt model parameters and thus do not scale with a large number of tasks.
Replay mechanisms : Recently there have been a few efforts to use generative replay in order to avoid catastrophic interference in a classification setting shin2017continual ; kamra2017deep . These methods work by regenerating previous samples and using them (in conjunction with newly observed samples) in future learning. Neither of these, however, leverages information from the previously trained models. Instead they simply re-learn each new joint task from scratch.
Parameter regularization : Methods such as Elastic Weight Consolidation (EWC) kirkpatrick2017overcoming , Synaptic Intelligence zenke2017continual and Variational Continual Learning (VCL) nguyen2018variational constrain the parameters of the new model to be close to the previous model through a predefined metric. EWC kirkpatrick2017overcoming for example, utilizes the Fisher Information matrix (FIM) to control the change of model parameters between two tasks. Intuitively, important parameters should not have their values changed, while non-important parameters are left unconstrained. The FIM is used as a weighting in a quadratic parameter difference regularizer under a Gaussianity assumption of the posterior . This assumption been hypothesized Neal1995-dx and later demonstrated empirically blundell2015weight to be sub-optimal for learnt neural network weights.
VCL also violates two of the requirements for a practical lifelong learning algorithm: a separate head network is added per task (reducing the solution to Progressive Neural Networks rusu2016progressive ) and a core-set of true data-samples is stored per observed distribution. Both of these requirements prevent the scalability of VCL to a truly lifelong setting.
Distribution regularization : In contrast, methods such as distillation hinton2015distilling , ALTM furlanello2016active and Learning Without Forgetting (LwF) li2016learning constrain the outputs of models from different tasks to be similar. This can be interpreted as distributional regularization by generalizing the constraining metric (or semi-metric) to be a divergence on the output conditional distribution, i.e: . One of the pitfalls of distribution regularization is that it necessitates the preservation of the previously observed data which is a violation of the lifelong learning setting where data from old distributions are no longer accessible.
Our work builds on the distribution regularization and replay strategies. In contrast to standard distribution regularization, where the constraint is applied on the output distribution, we apply our regularizer on the amortized, approximate posterior of the VAE (Section 4.1). In addition, we do not assume a parametric form for the distribution of the model’s posterior as in EWC or VCL and allow the model to constrain the parameters between two tasks in a highly non-linear way (Section 4.2). By combining our replay mechanism with information transfer from the previous model, we increase the training efficiency
in terms of the number of required training epochs (Experiment5.1), while at the same time not preserving any previous data and only requiring constant333We only require one teacher and student model as opposed to rusu2016progressive ; terekhov2015knowledge which require keeping all previous models extra model storage.
We consider an unsupervised setting where we observe a dataset of realizations from an unknown true distribution with
. We assume that the data is generated by a random process involving a non-observed random variable. In order to incorporate our prior knowledge we posit a prior over . Our objective is to approximate the true underlying data distribution by a model such that .
Given a latent variable model we obtain the marginal likelihood by integrating out the latent variable
from the joint distribution. The joint distribution can in turn be factorized using the conditional distributionor the posterior :
We model the conditional distribution by a decoder, typically a neural network. Very often the marginal likelihood will be intractable because the integral in equation (1
) does not have an analytical form nor an efficient estimatorKingma_undated-gm . As a result, the respective posterior distribution, , is also intractable.
Variational inference side-steps the intractability of the posterior by approximating it with a tractable distribution . VAEs use an encoder (generally a neural network) to model the approximate posterior and optimize the parameters to minimize the reverse KL divergence between the approximate posterior distribution and the true posterior . Given that is a powerful model (such that the KL divergence against the true posterior will be close to zero) we maximize the tractable Evidence Lower BOund (ELBO) to the intractable marginal likelihood (full derivation available in Appendix Section 8.8)
By sharing the variational parameters of the encoder across the data points (amortized inference gershman2014amortized ), variational autoencoders avoid per-data optimization loops typically needed by mean-field approaches.
3.1 Lifelong Generative Modeling
The standard setting in maximum-likelihood generative modeling is to estimate the set of parameters that will maximize the marginal likelihood for data sample generated IID from a single true data distribution . Latent variable models (such as VAEs) capture the complex structures in by conditioning the observed variables on the latent variables and combining these in (possibly infinite) mixtures .
Our sequential setting is vastly different from the standard approach described above. We receive a sequence of (possibly infinite) datasets where each dataset originates from a disparate distribution . At any given time we observe the latest dataset generated from a single distribution without access to any of the previous observed datasets . Our goal is to learn a single model that is able to generate samples from each of the observed distributions , without the addition of an approximation model per observed distribution.
To enable lifelong generative learning we propose a dual model architecture based on a student-teacher model. The teacher and the student have rather different roles throughout the learning process: the teacher’s role is to preserve the memory of the previously learnt distributions and to pass this knowledge onto the student; the student’s role is to learn the distributions over the new incoming data while accommodating for the knowledge obtained from the teacher. The dual model architecture is summarized in figure 2.
The top row represents the teacher model. At any given time the teacher contains a summary of all previous distributions within the learnt parameters of the encoder and the learnt parameters of the decoder . The teacher is used to generate synthetic samples from these past distributions by decoding samples from the prior through the decoder . The generated synthetic samples are passed onto the student model as a form of knowledge transfer about the past distributions.
The bottom row of figure 2 represents the student, which is responsible for updating the parameters of the encoder and decoder over the newly observed data. The student is exposed to a set of learning instances sampled from ; it sees synthetic instances generated by the teacher , and real ones sampled from the currently active training distribution . The mean of the Bernouli distribution controls the sampling proportion of the previously learnt distributions to the current one and is set based on the number of observed datasets. If we have seen k datasets (and thus k distributions) prior to the current one then . This ensures that all the current and past distributions are equally represented in the training set used by the student model. Once a new distribution is signalled, the old teacher is dropped, the student model is frozen and becomes the new teacher (), and a new student is initiated with the latest weights and from the previous student (the new teacher).
4.1 Teacher-student consistency
Our central objective is to learn a single set of parameters such that we are able to generate samples from all observed distributions . Given that we can generate samples for all previous observed distributions via the teacher model, our objective can be formalized as the maximization of the augmented ELBO, , under the assumption that :
The ELBO , in Equation 3 is approximate due to the fact that we use the generations instead of the true data 444Note that the ELBO described in Equation 3 is still a single sample ELBO of ; we overload this notation here to imply that we can regenerate many samples similar to the true dataset using the decoder network.. Rather than naively shrinking the full posterior to the prior via the KL divergence in Equation 3 we introduce a posterior regularizer that distills the teacher’s learnt representation into the student over the generated data 555While it is also possible to apply a similar cross-model regularizer to the reconstruction term, i.e: , we observe that doing so hurts performance (Appendix 8.3).. We will now show how this regularizer can be perceived as a natural extension of the VAE learning objective across the combined dataset through the lens of a Bayesian update of the student posterior.
For random variables and with conditionals and , both distributed as a categorical or gaussian and parameterized by and respectively, the KL divergence between the distributions is:
where depends on the parametric form of Q, and C is only a function of .
We prove Lemma 1 for the relevant distributions (under some mild assumptions) in Appendix 8.1. Using Lemma 1 and the assumption that , Equation 3 can be interpreted as a standard VAE ELBO under a reparameterization :
where the last term is constant with respect to and thus not included in Equation 5. Recasting the problem in such a manner allows us to see that transitioning the ELBO to a sequential setting involves: 1) , a term bringing the student posterior (as a function of both itself and the teacher parameters) close to the prior for previously observed data and 2) , the standard term in the ELBO that attempts to bring the student posterior close to the prior for the current data.
Naively evaluating the student ELBO using , the synthetic teacher data and , the real current data, results in equation 3. While the change seems minor, it omits the introduction of which allows for a transfer of information between models. In practice, we analytically evaluate , the KL divergence between the teacher and the student posteriors, instead of deriving the functional form of for each different distribution pair.
4.2 Contrast To EWC
Our distribution regularizer affects the same parameters as parameter regularizer methods such as EWC. However, it does so in a non-linear manner dependent on the underlying network structure as opposed to the fixed functional form of the distance metric in EWC. We will demonstrate in our experiments that our proposed method does no worse than EWC in the worst case (i.e. when the EWC constraint is a valid distance metric assumption as in Experiment 5.1), but drastically outperforms EWC in the case when this is not true (Experiment 5.3).
|EWC||Lifelong ( Isotropic Gaussian Posterior )|
In the above table we examine the distance metric , used to minimize the effects of catastrophic inference in both EWC and our proposed Lifelong method. While our method can operate over any distribution that has a tractable KL-divergence, for the purposes of demonstration we examine the simple case of an isotropic gaussian latent-variable posterior. EWC directly enforces a quadratic constraint on the model parameters , while our method indirectly affects the same parameters through a regularization of the posterior distribution . For a given input in the Lifelong case, the only freedom the model has is to change ; it does so in a non-linear666This is because the parameters of the distribution are modeled by a deep neural network. way such that the analytical KL shown above is minimized.
4.3 Latent variable
A critical component of our model is the synthetic data generation by the teacher’s model . The synthetic samples need to be representative of all the previously observed distributions in order to provide the student with ample information about the learning history. Considering only the case of teacher generated samples : the minibatch of N samples received by the student after k distribtions should contain approximately samples from each of the k observed distributions in order to prevent catastrophic forgetting.
A simple unimodal prior distribution , such as the isotropic Gaussian typically used in classical VAEs (see Figure 1(b)), results in an undersampling of distributions in the posterior that are further away from the prior mean. This in turn leads to catastrophic forgetting of the undersampled distributions in the student model. We circumvent this problem by decomposing the posterior distribution into a conditionally independent discrete and a continuous component . We assume a uniform multivariate discrete prior for the discrete component and a multivariate standard normal prior for the continuous component. The discrete component is used to summarize the most discriminative information about each of the true generating distributions , while the continuous component attends to the sample variability (a nuisance variable louizos2015variational ). This split allows us to directly generate synthetic samples per observed distribution by fixing a discrete component , varying the continuous sample and decoding through the decoder network .
4.4 Information restricting regularizer
In order to enforce that carries the most relevant information for the generative process (which in turn allows us to easily generate samples from an individual previously observed distribution), we introduce a negative information gain regularizer between the continuous representation and the generated data : . is used to denote the marginal entropy of and denotes the conditional entropy of given . This prevents the model from primarily using the continuous representation, while disregarding the discrete one and therefore the pathological . We utilize a lower bound for this term in a similar manner as done in InfoGAN NIPS2016_6399 ; goodfellow2014generative :
Rather than maximizing the mutual information between and (as in InfoGAN) and introduce a min-max optimization problem, we instead minimize the information of the continuous component as an equivalent problem formulation. Since our model doesn’t utilize skip connections, information from the input data has to flow through the latent variables to reach the decoder. Minimizing the information gain between and the generated decoded sample forces the model to dominantly use .
In contrast to InfoGAN, VAEs already estimate the posterior and thus do not need the introduction of any extra parameters for the approximation. Finally, as opposed to InfoGAN, which uses the variational bound (twice) on the mutual information huszar_2016 , our regularizer has a clear interpretation: it restricts information through a specific latent variable within the computational graph. We observe that this constraint is essential for empirical performance of our model and empirically validate this in our ablation study in Experiment 5.2.
4.5 Learning Objective
The hyper-parameter controls the importance of the information gain regularizer. Too large a value for
causes a lack of sample diversity, while too small a value causes the model to not use the discrete latent distribution. We did a random hyperparameter search and determinedto be a reasonable choice for all of our experiments. This is in line with the used in InfoGAN NIPS2016_6399 for continuous latent variables. We empirically validate the necessity of both terms proposed in Equation 7 in our ablation study in Experiment 5.2.
In all of our experiments we focus on the performance benefits our architecture and augmented learning objective brings into the lifelong learning setting which is the main motivation of our work. To do this we divide our experiments into three distinct problems, namely sequential learning of similar distributions (Experiments 5.1, 5.2 ), sequential learning of disparate distributions (Experiment 5.3) and finally a complex transfer learning problem (Experiment 5.4). Lifelong learning over similar distributions allow us to examine the re-usability of the learnt feature representation; on the other hand lifelong learning over disparate distributions and the complex transfer learning settings allow us to explore the extent to which our model can accomodate new information without forgetting previously learnt representations. We evaluate our model and the baselines over standard datasets used in other state of the art lifelong / continual learning literature nguyen2018variational ; zenke2017continual ; shin2017continual ; kamra2017deep ; kirkpatrick2017overcoming ; rusu2016progressive . While these datasets are simple in a classification setting, transitioning to a lifelong-generative setting scales the problem complexity substantially. We give details specific to each experiment in their individual sections. Some of the commonalities between the experiments are described below.
In Experiment 5.1 and Experiment 5.3 we compare our model to a set of EWC baselines. For comparability, we use the same student-teacher architecture as in our model, but instead of our consistency regularizer we augment the VAE ELBO by the EWC distance metric between the student () and teacher () models. Since we do not have access to the true log-likelihood we estimate the diagonal Fisher information matrix from the ELBO:
|EWC Learning Objective||Fisher Approximation|
We utilize two major performance metrics for our experiments : the negative test ELBO and the Frechet distance777More details about this metric are provided in Appendix Section 8.12 (as proposed in heusel2017gans ). The negative test ELBO provides a lower bound to the test log-likelihood of the true data distribution, while the Frechet distance gives us a quantification of the quality and diversity of the generated samples. Note that lower values are better for both metrics. In both Experiment 5.1 and Experiment 5.3
, we run each model five times each and report the mean and standard deviations. We utilize both fully convolutional (-C-) and fully dense architectures (-D-) and list the top performing models and baselines. We provide the entire set of analyzed baselines in Appendix Section8.6. In addition, all network architectures and other optimization details are provided in Appendix Section 8.4 as well our our git repository.
5.1 Fashion MNIST : Sequential Generation
In this experiment, we demonstrate the performance benefit our architecture and augmented learning objective 7 bring to the continual learning of a set of related distributions. The hypothesis for working over similar distributions is that models should leverage previously learnt features and use it for future learning. We compare our method (lifelong-) (where is the mutual information hyperparameter) against a standard VAE (vanilla-vae), a VAE that observes all the data (full-vae), a VAE that observes all the data upto (inclusively) the current distribution (upto-vae) and finally to a set of EWC baselines (ewc-) where gamma is the EWC hyperparameter value. The lifelong, ewc and vanilla models only observe one dataset at a time and do not have access to any of the previous true datasets . In order to fairly evaluate the test ELBO,we utilize the same graphical model (i.e. the discrete and continuous latent variables) for all models.
We use Fashion MNIST xiao2017/online 888We also report MNIST results for the same experiment in Appendix 8.7 to simulate our continual learning setting. We treat each object as a different distribution and present the model with samples drawn from a single distribution at a time. We sequentially progress over the ten available distributions and report performance metrics on the respective test set at the end of the training (quantified by an early-stopping criterion). Note that, the test set is incrementally increased, eg: at the second distribution the test set contains samples from the first and second test datasets. Since the cardinality of the test set increases, we will observe an increase in the negative test ELBO and Frechet distance. This is due to the fact that the model needs to be able to not only reconstruct (or generate in the case of the Frechet metric) the dataset that it just observed, but also all previous test sets .
The full-vae and upto-vae models present the best attainable performance as they have access to all the previous data at all times, and thus do not suffer from catastrophic interference. We observe that our model does at least as well as EWC with respect to the test ELBO (Figure 3a), while it does quite a bit better with respect to the log-Frechet distance (Figure 3b). We surmise the improvement with regards to sample generation is due to the fact that our model can generate distinct, high quality samples (Figure 5) while avoiding mixing or under-sampling due to the joint interaction of the consistency regularizer (Section 4.1) and information gain regularizer (Section 4.3).
Finally, we evaluate the number of epochs needed to train both an ewc and a lifelong model in Figure 3(c). We quantify the convergence time as the number of epochs it takes a model to trigger an early stopping criterion on the validation set. If a model converges faster, we surmise that it is efficiently using information from previous learning. We observe that the ewc model initially converges faster than our lifelong model, but does so at the expense of significantly worse sample generation (Figure 3b, Figure 5a-left), minimizing it’s usefulness in a lifelong generative setting. The lifelong model consecutively requires fewer and fewer epochs for convergence and finally reaches the same number of epochs as ewc. We also tried extending this to a much larger set of similar distributions in Appendix 8.5 and observed that our model does outperform ewc in such scenarios. This confirms the benefits our new objective formulation brings to the lifelong setting, where previous knowledge is retained and used in future learning.
5.2 MNIST: Ablation Study
In order to independently evaluate the benefit of our proposed Bayesian update regularizer (Section 4.1) and the negative information gain term proposed in (Section 4.3) we perform a simple ablation study examining the Frechet distance over a set of sequential distributions. We also visualize sequential generations from the final student model as in the previous experiment. We utilize the MNIST dataset instead of Fashion MNIST in order to provide experiment diversity. The dataset is divided and iterated over as in the previous experiment.
In contrast to the previous experiment we evaluate three scenarios: 1) with consistency + mutual information regularizers, 2) only consistency regularizer and 3) without both regularizers. For this experiment we also fix the seed used by pytorchpaszke2017automatic and numpy ascher.dubois.hinsen.hugunin.oliphant-1999-np such that the effects of initialization and dataset shuffling are non-existent 999We only run a single experiment here since multiple trials produce the same solution..
We observe that both components are necessary in order to generate high quality samples as evidenced by the log-Frechet distance (Figure 6a) 101010We observe that this effect gets more pronounced when the dimensionality of is increased.. The generations produced without the information gain regularizer (Figure 6c) are blurry for all but the last two observed distributions (eight and nine in this case). This can be attributed to two possibilities: : 1) uniformly sampling the discrete component is not guaranteed to generate samples from , one of the unique, previously approximated distributions (see mixing issue in Figure 1b) and 2) the discrete posterior distribution leverages more information from the continuous component, i.e. causing catastrophic forgetting.
5.3 Permuted MNIST: Sequential Generation
In this experiment we examine the capability of our model to generate and recall completely different distributions. This setting differs from Experiment 5.1 in that the models cannot leverage previously learnt feature representations for future learning. We apply a set of unique fixed image permutations where to the entire MNIST dataset. We create 5 such datasets and sequentially progress over them in a similar manner as Experiment 5.1. We use an unpermuted version of the MNIST dataset to simulate the first distribution as it allows us to visually asses the degradation of reconstructions. This is a common setup utilized in continual learning kirkpatrick2017overcoming ; zenke2017continual and we extend it here to the generative setting.
EWC works well when the learnt parameters of the old task are relevant for the learning of the new task. In this experiment however, EWC is forced to accommodate the new task, while still preserving the parameters learnt over a drastically different old task. This is ill posed for the rigid EWC distance metric (Section 4.2).
The lifelong implementation on the other hand allows the model to flexibly adapt it’s distance metric (Section 4.2) in order to learn an appropriate constraint for preserving both the current distribution as well as the previous distributions . This is due to the fact that we constrain the latent posterior distribution (Section 4.1) and keep the conditional similar to that of the previous task through the data augmentation step rather than simply constraining by a simple quadratic parameter difference. In these experiments we see the lifelong model outperform all other models (barring the upto-vae and full-vae which present the best attainable performance) in terms of both reconstructions (Figure 7a) and generations (Figure 7b).
5.4 SVHN to MNIST
In this experiment we explore the ability of our model to retain and transfer knowledge across completely different datasets. We use MNIST and SVHN netzer2011reading to demonstrate this. We treat all samples from SVHN as being generated by one distribution and all the MNIST111111MNIST was resized to 32x32 and converted to RBG to make it consistent with the dimensions of SVHN. samples as generated by another distribution (irrespective of the specific digit).
We visualise examples of the true inputs and the respective reconstructions in figure 9(a). We see that even though the only true data the final model received for training were from MNIST, it can still reconstruct SVHN data observed previously. This confirms the ability of our architecture to transition between complex distributions while still preserving the knowledge learned from the previously observed distributions. Finally, in figure 9(b) and 9(c) we illustrate the data generated from an interpolation of a 2-dimensional continuous latent space. For this we specifically trained the models with the continuous latent variable . To generate the data, we fix the discrete categorical to one of the possible values and linearly interpolate the continuous over the range . We then decode these to obtain the samples . The model learns a common continuous structure for the two distributions which can be followed by observing the development in the generated samples from top left to bottom right on both figure 9(b) and 9(c).
In this work we propose a novel method for learning generative models over a lifelong setting. The principal assumption for the data is that they are generated by multiple distributions and presented to the learner in a sequential manner. A key limitation for the learning process is that the method has no access to any of the old data and that it shall distill all the necessary information into a single final model. The proposed method is based on a dual student-teacher architecture where the teacher’s role is to preserve the past knowledge and aid the student in future learning. We argue for and augment the standard VAE’s ELBO objective by terms helping the teacher-student knowledge transfer. We demonstrate the benefits this augmented objective brings to the lifelong learning setting using a series of experiments. The architecture, combined with the proposed regularizers, aid in mitigating the effects of catastrophic interference by supporting the retention of previously learned knowledge (models).
7 Future Work
The standard assumption in current lifelong / continual learning approaches nguyen2018variational ; zenke2017continual ; shin2017continual ; kamra2017deep ; kirkpatrick2017overcoming ; rusu2016progressive is to use known, fixed distributions instead of learning the distribution transition boundaries. In future work we will attempt to learn this transition boundary jointly with the distribution approximations. We are also looking into scaling our model to utilize GANs for better image generations, specifically through the usage of ALI dumoulin2016adversarially and BiGAN donahue2016adversarial , methods that learn a posterior distribution within the GAN framework.
- (1) D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin, and T. Oliphant. Numerical Python. Lawrence Livermore National Laboratory, Livermore, CA, ucrl-ma-128569 edition, 1999.
C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra.
Weight uncertainty in neural network.
International Conference on Machine Learning, pages 1613–1622, 2015.
- (3) L. Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):142, 1998.
- (4) T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational bayes. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 1727–1735, 2013.
- (5) X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016.
- (6) J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
- (7) V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
- (8) T. Furlanello, J. Zhao, A. M. Saxe, L. Itti, and B. S. Tjan. Active long term memory networks. arXiv preprint arXiv:1606.02355, 2016.
- (9) S. Gershman and N. Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the Cognitive Science Society, volume 36, 2014.
- (10) X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249–256, 2010.
- (11) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- (12) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6629–6640, 2017.
- (13) G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
- (14) M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
- (15) F. Huszar. Infogan: using the variational bound on mutual information (twice), Aug 2016.
- (16) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
V. Jain and E. Learned-Miller.
Online domain adaptation of a pre-trained cascade of classifiers.In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 577–584. IEEE, 2011.
- (18) E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. International Conference on Learning Representations, 2017.
- (19) N. Kamra, U. Gupta, and Y. Liu. Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368, 2017.
- (20) M. Karpinski and A. Macintyre. Polynomial bounds for vc dimension of sigmoidal and general pfaffian neural networks. Journal of Computer and System Sciences, 54(1):169–176, 1997.
D. P. Kingma.
"Variational Inference & Deep Learning: A New Synthesis". PhD thesis, 2017.
- (22) D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. 2015.
- (23) D. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 2014.
- (24) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, page 201611835, 2017.
- (25) Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
- (26) Z. Li and D. Hoiem. Learning without forgetting. In European Conference on Computer Vision, pages 614–629. Springer, 2016.
- (27) C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder. ICLR, 2016.
- (28) C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- (29) M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation, 24:109–165, 1989.
- (30) K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
- (31) R. M. Neal. Bayesian Learning For Neural Networks. PhD thesis, University of Toronto, 1995.
- (32) Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, page 5, 2011.
- (33) C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. ICLR, 2018.
- (34) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- (35) N. C. Rabinowitz, G. Desjardins, A.-A. Rusu, K. Kavukcuoglu, R. T. Hadsell, R. Pascanu, J. Kirkpatrick, and H. J. Soyer. Progressive neural networks, Nov. 23 2017. US Patent App. 15/396,319.
- (36) H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2994–3003, 2017.
- (37) E. D. Sontag. Vc dimension of neural networks. NATO ASI Series F Computer and Systems Sciences, 168:69–96, 1998.
- (38) A. V. Terekhov, G. Montone, and J. K. O’Regan. Knowledge transfer in deep block-modular neural networks. In Proceedings of the 4th International Conference on Biomimetic and Biohybrid Systems-Volume 9222, pages 268–279. Springer-Verlag New York, Inc., 2015.
- (39) S. Thrun. Lifelong learning: A case study. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE, 1995.
- (40) S. Thrun and T. M. Mitchell. Lifelong robot learning. In The biology and technology of intelligent autonomous agents, pages 165–196. Springer, 1995.
- (41) H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- (42) F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995, 2017.
8.1 Understanding the Consistency Regularizer
The analytical derivations of the consistency regularizer show that the regularizer can be interpreted as an a transformation of the standard VAE regularizer. In the case of an isotropic gaussian posterior, the proposed regularizer scales the mean and variance of the student posterior by the variance of the teacher8.1 and adds an extra ’volume’ term. This interpretation of the consistency regularizer shows that the proposed regularizer preserves the same learning objective as that of the standard VAE. Below we present the analytical form of the consistency regularizer with categorical and isotropic gaussian posteriors:
We assume the learnt posterior of the teacher is parameterized by a centered, isotropic gaussian with and the posterior of our student by a non-centered isotropic gaussian with , then
Via a reparameterization of the student’s parameters:
It is also interesting to note that our posterior regularizer becomes the prior if:
We parameterize the learnt posterior of the teacher by and the posterior of the student by . We also redefine the normalizing constants as and for the teacher and student models respectively. The reverse KL divergence in equation 15 can now be re-written as:
where is the entropy operator and is the cross-entropy operator.∎
8.2 Contrast to streaming / online methods
Our method has similarities to streaming methods such as SVB  in that we estimate and refine posteriors through time. In general this can be done through the following bayesian update rule:
SVB computes the intractable posterior utilizing an approximation that accepts as input the current dataset , along with the previous posterior :
The first posterior input () to the approximating function is the prior . The objective of SVB and other streaming methods is to model the posterior of the currently observed data in the best possible manner. Our setting differs from this in that we want to retain information from all previously observed distributions (sometimes called a knowledge store). This can be useful in scenarios where a distribution is seen once, but only used much later down the road. Rather than creating a posterior update rule, we recompute the posterior via equation (11), leveraging the fact that we can re-generate through the generative process:
We demonstrate in Section 4.1 that doing this allows us to not only maintain a posterior for the latest distribution, but keep a common global posterior across all previous distributions.
Finally, another key difference between lifelong learning and online methods is that lifelong learning aims to learn from a sequence of different tasks while still retaining and accumulating knowledge; online learning  generally assumes that the true underlying distribution comes from a single distribution. There are some exceptions to this where online learning is applied to the problem of domain adaptation, eg: .
8.3 Reconstruction Regularizer
It is also possible to constrain the reconstruction term of the VAE in a similar manner to the consistency posterior-regularizer, i.e: , however this results in diminished model performance. We hypothesize that this is due to the fact that this regularizer contradicts the objective of the reconstruction term in the ELBO which already aims to minimize some metric between the input samples and the reconstructed samples ; eg: if , then the loss is proportional to , the standard L2 loss. Without the addition of this reconstruction cross-model regularizer, the model is also provided with more flexibility in how it reconstructs the output samples.
In order to quantify the this we run Experiment 5.1 utilizing two dense models (-D): one with the consistency regularizer (without-LL-D) and one with the consistency and likelihood regularizer (with-LL-D). We observe the model performance drop (with respect to the Frechet distance as well the test ELBO) in the case of the with-LL-D as demonstrated in Figures 11 and 11.
8.4 Model Architecture
We utilized two different architectures for our experiments. When we utilize a dense network (-D- in experiments) we used two layers of 512 to map to the latent representation and two layers of 512 to map back to the reconstruction for the decoder. We used batch norm 
and ELU (and sometimes SeLU) activations for all the layers barring the layer projecting into the latent representation and the output layer. Note that while we used the same architecture for EWC we observed a drastic negative effect when using batch norm and thus dropped it’s usage. The convolution architectures used the architecture described below for the encoder and the decoder (where the decoder used conv-transpose layers for upsampling). The notation is [OutputChannels, (filterX, filterY), stride]:
|Method||Initial dimension||Final dimension||dimension||# initial parameters||# final parameters|
The table above lists the number of parameters for each model and architecture for Experiment 5.1. The lifelong models initially start with a of dimension 1 and at each step we grow the representation by one dimension to accommodate the new distribution (more info in Section 8.10). In contrast, the baselines are provided with the full representation throughout the learning process. EWC has double the number of parameters because the computed diagonal fisher information matrix is the same dimensionality as the number of parameters. EWC also neeeds the preservation of the teacher model to use in it’s quadratic regularizer. Both the vanilla and full models have the fewest number of parameters as they do not use a student-teacher framework and only use one model, however the vanilla model has no protection against catastrophic interference and the full model is just used as an upper bound for performance.
We utilized Adam  to optimize all of our problems with a learning rate of 1e-4 or 1e-3. When we utilized weight transfer we re-initialized the accumulated momentum vector of Adam as well as the aggregated mean and covariance of the Batch Norm layers. The full architecture can be examined in our github repository 121212https://github.com/jramapuram/LifelongVAE_pytorch and is provided under an MIT license.
8.5 Extending Number Of Sequential Distributions
8.6 EWC Baselines: Comparing Conv & Dense Networks
We compared a whole range of EWC baselines and utilized the best performing models few in our experiments. Listed in Figure 8.6 are the full range of EWC baselines run on the Permuted 5.3 and Fashion 5.1 experiments. Recall that C / D describes whether a model is convolutional or dense and the the number following is the hyperparameter for the EWC or Lifelong VAE.
8.7 MNIST: Sequential Generation
8.8 ELBO Derivation
Variational inference  side-steps the intractability of the posterior distribution by approximating it with a tractable distribution ; we then optimize the parameters in order to bring this distribution close to . The form of this approximate distribution is fixed and is generally conjugate to the prior . Variational inference converts the problem of posterior inference into an optimization problem over
. This allows us to utilize stochastic gradient descent to solve our problem. To be more concrete, variational inference tries to minimize the reverse Kullback-Leibler (KL) divergence between the variational posterior distributionand the true posterior :
Rearranging the terms in equation 15 and utilizing the fact that the KL divergence is a measure, we can derive the evidence lower bound (ELBO) which is the objective function we directly optimize:
In order to backpropagate it is necessary to remove the dependence on the stochastic variable
. To achieve this, we push the sampling operation outside of the computational graph for the normal distribution via the reparameterization trick and the gumbel-softmax reparameterization [28, 18] for the discrete distribution. In essence the reparameterization trick allows us to introduce a distribution that is not a function of the data or computational graph in order to move the gradient operator into the expectation:
8.9 Gumbel Reparameterization
. The Gumbel-Softmax reparameterization over logits [linear output of the last layer in the encoder]and an annealed temperature parameter is defined as:
. As the temperature parameter , converges to a categorical.
8.10 Expandable Model Capacity and Representations
Multilayer neural networks with sigmoidal activations have a VC dimension bounded between  and  where are the number of parameters. A model that is able to consistently add new information should also be able to expand its VC dimension by adding new parameters over time. Our formulation imposes no restrictions on the model architecture: i.e. new layers can be added freely to the new student model.
In addition we also allow the dimensionality of
, our discrete latent representation to grow in order to accommodate new distributions. This is possible because the KL divergence between two categorical distributions of different sizes can be evaluated by simply zero padding the teacher’s smaller discrete distribution. Since we also transfer weights between the teacher and the student model, we need to handle the case of expanding latent representations appropriately. In the event that we add a new distribution we copy all the weights besides the ones immediately surrounding the projection into and out of the latent distribution. These surrounding weights are reinitialized to their standard Glorot initializations.
8.11 Forward vs. Reverse KL
in our setting we have the ability to utilize the zero forcing (reverse or mode-seeking) kl or the zero avoiding (forward) kl divergence. in general, if the true underlying posterior is multi-modal, it is preferable to operate with the reverse KL divergence ( 21.2.2). In addition, utilizing the mode-seeking KL divergence generates more realistic results when operating over image data.
In order to validate this, we repeat the experiment in 5.1. We train two models: one with the forward KL posterior regularizer and one with the reverse. We evaluate the -ELBO mean and variance over ten trials. Empirically, we observed no difference between the different measures. This is demonstrated in figure 12.
8.12 Frechet Performance Metric
The idea proposed in  is to utilize a trained classifier model to compare the feature statistics (generally under a Gaussianity assumption) between synthetic samples of the generative model and samples drawn from the test set. If the Frechet distance between these two distributions is small, then the generative model is said to be generating realistic and diverse images. The Frechet distance between two gaussians with means with corresponding covariances is: