Deep learning models suffer from catastrophic forgetting  when trained on multiple databases in a sequential manner. A deep learning model quickly forgets the characteristics of the previously learned experiences while adjusting to learning new information. The ability of artificial learning systems of continuously acquiring, preserving and transferring skills and knowledge throughout their lifespan is called lifelong learning . Existing approaches would either use dynamic architectures, adopt regularization during training, or employ generative replay mechanisms. Dynamic architecture approaches [9, 46, 58, 64, 39] would increase the network capacity by adding new layers and processing units in order to adapt the network’s architecture to acquiring new information. However, such approaches would require a specific architecture design while their parameters would increase progressively with the number of tasks. Regularization approaches [19, 22, 26, 31, 44] aim to impose a penalty when updating the network’ parameters in order to preserve the knowledge associated with previously learned tasks. In practice, these approaches suffer from performance degradation when learning a series of tasks where the datasets are entirely different from the previously learned ones. Memory-based methods use a buffer in order to upload previously learned data samples [6, 3], or utilize powerful generative networks such as a Variational Autoencoders (VAEs) [50, 42, 1, 43]
or Generative Adversarial Networks (GANs)[63, 55] as memory-based replay networks that reproduces and generates data which is consistent with what has seen and learned before. These approaches would need additional memory storage space for recording parameters of the generated data while their performance on the previously learned tasks is heavily dependent on the generator’s ability to realistically replicate data.
Promising results have been achieved on prediction tasks [39, 19, 17, 51, 52, 62]. However, these methods do not capture the underlying structure behind the data, which prevents them from being applied in a wide range of applications. There are very few attempts addressing representation learning under the lifelong setting [42, 1]. The performance of these methods degrades significantly when engaging in the lifelong training with datasets containing complex images or on a long sequence of tasks. The reason is that these approaches require to retrain their generators on artificially generated data. Meanwhile, the performance loss on each dataset is accumulated during the lifelong learning of a sequence of several tasks. To address this problem, we propose a probabilistic mixture of experts model, where each expert infers a probabilistic representation of a given task. A Dirichlet sampling process defines the likelihood of a certain expert to be activated when presented with a new task.
This paper has the following contributions :
A novel mixture learning model, called Lifelong Mixture of VAEs (L-MVAE). Instead of capturing different characteristics of a database as in other mixture models [49, 28, 11, 56], the proposed mixture model enables to automatically embed the knowledge associated with each database into a distinct latent space modelled by one of the mixture’s experts during the lifelong learning.
A training algorithm based on the maximization of the mixture of individual component evidence lower bounds (MELBO)
A mixing-coefficient sampling process is introduced in order to activate or drop out experts in L-MVAE. Besides defining an adaptive architecture, this procedure accelerates the learning process of new tasks while overcoming the forgetting of the previously learned tasks.
The remainder of the paper contains a detailed overview of the existing state of the art in Section II, while the proposed L-MVAE model is discussed in Section III. In Section IV we discuss the theory behind the proposed L-MAE model and in Section V
we explain how the proposed methodology can be used in unsupervised, supervised and semi-supervised learning applications. The expansion mechanism for the model’s architecture is presented in SectionVI. The experimental results are analyzed in Section VII while the conclusions are drawn in Section VIII.
Ii Related research studies
A variational autoencoder (VAE) 
is made up of two networks, an encoder and a decoder. Given a data set, the encoder extracts a latent vector, and the decoder aims to reconstruct the given data from the latent vectors. A number of research works have been developed for capturing meaningful and disentangled data representations by using the VAE framework [18, 23, 14, 5, 33]. These approaches show promising results on achieving disentanglement between latent variables as well as interpretable visual results, where specific properties of the scene can be manipulated through changing the relevant latent variables. However, these models work well only on data samples drawn from a single domain, corresponding to a specific database used for training. When they are re-trained on a different database, their parameters are updated and then they fail to perform on the tasks learned previously. This happens because they do not have appropriate objective functions to deal with catastrophic forgetting [26, 13, 45].
Recently, there have been some attempts to learn cross-domain representations under the lifelong learning by introducing an environment-dependent mask that specifies a subset of generative factors , by proposing a Teacher-Student lifelong learning framework [42, 60] or a hybrid model  of Generative Adversarial Nets (GANs)  and VAE. The models proposed in [42, 1, 61] are based on Generative Replay Mechanisms (GRM) aiming to overcome forgetting. However, these methods suffer from poor performance when considering complex data.
Aljundi et al.  proposed a lifelong learning system named the Expert Gate model, where new experts are added to a network of experts. The most relevant expert from the given set is chosen during the testing stage, according to the reconstruction error of the data. However, this may not necessarily correspond to the best log-likelihood estimate for the data. Moreover, the Expert Gate model was used only for supervised classification tasks.
Regularization based approaches alleviate catastrophic forgetting by adding an auxiliary term that penalizes changes in the weights when the model is trained on a new task [39, 19, 22, 26, 31, 44, 45, 10, 37, 29] or store past samples to regulate the optimization [17, 7]. However, regularization based approaches have huge computation requirements when the number of tasks increases .
In another direction of research, mixtures of VAEs have been employed for continuous learning [49, 28, 11, 56]. These models are able to capture underlying complex structures behind data and therefore perform well on many down-stream tasks including clustering and semi-supervised classification. However, these mixture models would only capture characteristics of a single database which had been split into batches of data, and tend to forget previously learned data characteristics when attempting to learn a sequence of distinct tasks. In contrast to the above mentioned methods, our model is able to capture underlying generative latent variable representations across multiple data domains during the lifelong learning.
Iii The Lifelong Mixture of VAEs
Iii-a Problem formulation
In this paper we consider a model made up of a mixture of networks  which is able to deal with three different learning scenarios: supervised, semi-supervised and unsupervised, under the lifelong learning setting. Let us consider a sequence of tasks and denote as a dataset characterizing the task, where is the source domain and is the target domain which is usually defined by class labels, while each domain is associated to a given task. We aim to learn a model which not only generates or reconstructs data but which can also generate meaningful representations useful for various tasks during a lifelong learning process.
Iii-B Mixture objective function
Traditional mixture models [35, 54] normally capture different characteristics of a dataset by learning several latent variable vectors, with distinct sets of variables associated to each mixture’ component. In this paper, we implement each expert by using a generative latent variable model , where is the latent variable and represents the decoder’s parameters, as in VAEs . The learning goal of the generative model is to maximize the log-likelihood of the data distribution, which is actually a difficult problem due to the intractability of the marginal distribution , requiring access to all latent variables. Instead, we optimize the evidence lower bound (ELBO) on the data log-likelihood,  :
where is called the variational distribution, and
represents the parameters of the encoder. We use the Gaussian distribution for both the prioras well as for the variational distribution . The latent variable is sampled using the reparametrization trick , , where and are inferred by the encoder, and is sampled from . is implemented by a decoder with trainable parameters , receiving the latent variables and producing data reconstructions .
When considering that we have
experts in the mixture model, we introduce the loss function as the Mixture of individual ELBOs (MELBO), defined through (1) :
where is the mixing coefﬁcient, which controls the significance of the -th expert. We model all mixing coefﬁcients by using a Dirichlet distribution , of parameters . In the following we describe the mechanism for selecting appropriate L-MVAE components during the training.
Iii-C The selection of L-MVAE mixture’s components during training
Certain research studies [49, 28] have considered equal contributions for the components of deep learning mixture systems. However, in this paper we consider that each mixture component is specialized for a specific task. The selection of a specific mixture component is performed through the mixing weights ,
In the following, we introduce an assignment vector , with each of its entries , , representing the probability of including or not the -th expert in the mixture. is sampled from as Bernoulli distribution. Before starting the training, we set all entries as , . The assignment probability for each mixing component is calculated considering the sample log-likelihood of each expert after learning each task, as :
where is sampled from the given data batch, drawn from the database corresponding to the current task learning. denotes the assignment variable for -th expert and represents the value resulted when learning the previous task before evaluating Eq. (3). is used to ensure that is outside the range of possible values for , when evaluating Eq. (3), and therefore we consider as a large value. Then we find the maximum probability for a mixing component :
where represents the index of the selected VAE component according to the parameters learnt during the previous tasks. We then normalize the other assignment variables, except for :
Since is an assignment corresponding to the learning process of the previous task, before evaluating Eq. (3), in order to determine the dropout status of -th expert during the current task learning, we use Eq. (5) to recover the dropout status of all experts except for -th expert which is actually dropped out from the future training because it is going to be used for recording and reproducing the information associated with the current task being learnt. When learning the first task, all mixture’s components will be trained and then when learning the second task, only components are trained, while one component is no longer trained because it is considered as a depository of the information associated with the first task. This component will consequently be used to generate information consistent with the probabilistic representation associated with the first task. This process is continued and for the last task at least one VAE is available for training. The number of mixing components considered initially should be larger or at least equal to the number of tasks assumed to be learned during the lifelong learning process. In Section VI we describe a mechanism for expanding the mixture.
The sampling of mixing weights.
Suppose that L-MVAE finished learning the -th task. We collect several batches of samples from the -th task, where each represents the -th batch of samples, which are used to evaluate the assignment vector by using Eq. (3). We calculate the average probability , where each represents the probability for the assignment of . Then we find by using Eq. (4) and we recover the previous assignments except for by using Eq. (5). The Dirichlet parameters are calculated in order to fix the mixture components containing the information corresponding to the previously learnt tasks while making the other mixture components available for training with the future tasks. For the mixing components that have been used for learning the previous tasks, we consider
where is a very small positive value. For , where represents the number of tasks learnt so far out of a total of given tasks, during the lifelong learning. A small value for the Dirichlet parameters implies that the corresponding mixture components are no longer trained. The mixing weights are sampled from a Dirichlet distribution with parameters . We then train the mixture model with by using Eq. (2) at the -th task learning.
Suppose that after the lifelong learning process, we have trained components. In the testing phase, we perform a selection of a single component to be used for the given data samples. We firstly calculate the selection probability by calculating the log-likelihood of the data sample for each component :
Then we select a component by sampling the mixing weight vector from Categorical distribution .
The structure of the proposed L-MVAE model is shown in Fig. 1. In the next section we evaluate the convergence properties of L-MVAE model during the lifelong learning.
Iv Theoretical analysis of L-MVAE
In this section, we evaluate the convergence properties of the proposed L-MVAE model during the lifelong learning. We evaluate the evolution of the objective function during the training and define a lower bound on the data’s log-likelihood. We also show how L-MVAE model infers across several tasks during the lifelong learning.
Definition. Let us define the following function :
where is defined for the -th mixture component by considering the objective function (1) and where we consider . We also define the likelihood function for the mixture model, denoted as .
Lemma. By considering , defining the likelihood function
and the previous Definition, we have :
Proof. After considering the latent variables for each VAE component, the marginal log-likelihood of the mixture is given by:
We know that is bounded by the local ELBO objective function , according to (1), and we have
where represents the parameters for the -th mixture component.
Since the function is a monotone increasing function, then we have:
which proves the Lemma.
Optimizing the mixture’s objective function, , corresponds to finding a lower bound on the data log-likelihood, .
From the Lemma, we have :
Let us define as the log-likelihood of the objective function . Then we have during the inference, where represents the log-likelihood of a single VAE, characterized by parameters .
Estimating the log-likelihood during the inference is intractable because the generation process of the mixture model involves an implicit component selection procedure. By considering (2), the log-likelihood is given by :
where , where the mixing parameters are sampled from , where . The marginal log-likelihood for each VAE component is given by its approximation . The proposed model selects only the most suitable expert VAE, indexed as , which has the highest likelihood for the given data samples used during the training :
This shows that during the testing stage, we can evaluate the data’s log-likelihood, by using the proposed L-MVAE model. Unlike in the approach from , the proposed mixture system not only that can perform generation tasks but it also learns meaningful data representations across the domains assimilated during the lifelong learning process.
V Defining the Lifelong MVAE for supervised, semi-supervised and unsupervised learning
In this section, we extend the mixture model for being used under various types of learning paradigms, such as : unsupervised, supervised, and semi-supervised.
Unsupervised disentangled representation learning.5], which was built on a similar concept to the -VAE , for modelling disentangled representations in single VAE models. We extend  to be used for the mixture objective function by replacing Eq. (2) with the following loss function:
where . The former term represents the Kullback-Leibler (KL) divergence associated with the output of each VAE decoder, by considering the disentanglement among the latent space variables, weighted by , while the latter term is associated with the log-likelihood of data reconstruction by each mixture’s encoder. The parameters associated with the disentanglement are set similarly to those from : is linearly increasing during the training, starting from a low value, while defines the contribution of this modified KL term to the objective function. and represent the parameters for all encoders and decoders of the mixture and of the individual components, respectively.
Lifelong supervised learning. We consider that the given data is labelled
, within a supervised learning framework. When considering a single VAE component we define a latent generative variable model, where is the continuous latent variable and is the latent variable associated with the discrete information, labels for example. Then we derive its corresponding ELBO, considering two distinct encoders, characterized by the parameters and for the discrete and continuous latent variables, respectively, as follows:
We assume that is independent from , which is guaranteed by using two separate inference models and for modelling and . Eq. (18) corresponds to the ELBO for one component of the mixture model. We then define the mixture’s objective function by evaluating a sum over all individual components ELBO’s, each multiplied by its associated mixing coefﬁcient :
where and , represent the parameters for the encoders modelling continuous , and discrete , latent variables, for each mixture’ component. We call each as the class-specific encoder. The last two terms from (19) represent the KL divergences between the posterior and prior distributions for the variables and , associated to continuous and discrete latent spaces, respectively.
by using a neural network of parametersin which the last layer implements the softmax function producing the probability vector , while the sampling process is defined by :
where is sampled from the distribution. The sample vector is treated as a continuous approximation of the categorical representation (one-hot vector). The sampling process is incorporated into both generation and inference stages. For enforcing the discrete latent variables to capture discriminative information such as the data type, we introduce a mixture of cross-entropy loss :
where we incorporate the individual VAE components cross-entropy loss weighted by the associated mixing coefﬁcients, characterizing the encoders specific to learning the discrete variables, into a single objective function for the mixture system and . The pseudocode for the supervised learning is provided in Algorithm 1 where we firstly optimize the parameters of the model by using Eq. (19) and Eq. (21) at each iteration.
Lifelong semi-supervised learning. We also consider the semi-supervised learning context  for the proposed L-MVAE model. Under the semi-supervised setting, we only have a small subset of labeled observations , with labels , with the number of samples and a much larger number of unlabeled data samples for each learning task, where we assume data in total, where . In semi-supervised learning we aim to associate the unlabelled data samples based on their statistical consistency with the labelled data, following model training. Assigned labels would then replace discrete variables , used for supervised learning, during the decoding process. The objective function for semi-supervised training is :
where is the loss function for the semi-supervised learning of the L-MVAE model, , while , and represent the mixture’s model parameters characterizing the decoders and the encoders specific to the continuous and to the labels , respectively.
In addition to from (22), we also optimize the parameters using the mixture cross-entropy , similar to (21), used for supervised learning. For the unlabeled samples, missing labels are inferred by using Gumble-softmax based sampling in which the probability vector is sampled from the encoder, defined by . These discrete variables are then used during the decoding. The final objective function for semi-supervised learning tasks is defined as:
where the first term is given in (22), and controls the importance of the loss associated to the supervised learning , which is defined in (19). We separately optimize the parameters of the model by using (23) and (21) during each iteration, similar to the supervised learning setting.
Vi Mixture Expansion Mechanism
A given mixture architecture has limits in its modelling capabilities. Such limits are especially exposed during the lifelong learning, when the model has to learn new tasks. In this section, we introduce a procedure for expanding the L-MVAE architecture in order to enhance the architecture ability to deal successfully with a growing number of tasks. Meanwhile we aim to use a minimal number of model parameters and optimize the training time for efficiently learning all tasks. We introduce a joint network by adding to the existing VAE component structure consisting of an encoder and a decoder, defined by the parameters and , respectively, a sub-encoder and a sub-decoder, with parameters and , respectively. During the first task learning, we build the first mixture component based on this joint network. We use and to represent the decoder and encoder, respectively, where and . During the training we update both the shared parameter set and the specific parameter set when learning the first task. When learning the next task, parameters are fixed, while a new VAE component is added and only its corresponding specific parameter set is updated using Eq. (1) using data from the given database. We introduce a new mechanism for acquiring the knowledge corresponding to a new task during the lifelong learning, by either updating an existing mixture component, or adding a new component and training its parameters. The process of the proposed expansion mechanism is shown in Fig. 2.
In order to allow a single component to learn several similar tasks, we introduce a similarity measure between the probabilistic representation associated with a new task and the information recorded by each trained mixture component. If the new task is novel enough relative to the already learnt knowledge, the mixture model will add a new component in order to learn the new task. Otherwise, the training algorithm will select and update the most appropriate component. Let us consider that the mixture model has components after learning the -th task. We evaluate the novelty of the -th task by comparing the knowledge acquired by each of the components and the probabilistic representation of the -th task. We consider a probabilistic representation of the -th task by randomly selecting a set , where in the experiments we consider samples. The probabilistic representation of the knowledge acquired by each expert is represented by its ability to generate specific data. Thus, for each expert , we generate a dataset , where in the experiments we consider , for and represents the database used for sampling the original data . We consider the L2 distance between all data of the two databases, as statistical similarity measure:
for . A new expert is added to the mixture model when none of the experts is able to generate data similar to those from the new dataset, according to :
where is a threshold defining the level of novelty in the knowledge acquired by each expert. The parameter set of the new expert is , where only the parameters are trained according to the objective defined in (1). If (25) is not fulfilled then the most suitable component is chosen :
and its encoder and decoder parameters are updated. We call the proposed expansion mechanism with the mixture model as L-MVAE dynamic (L-MVAE-Dyn). By considering a fixed component of the model, made up of the sub-decoder and sub-encoder of parameters we ensure a common heritage knowledge for all tasks, corresponding to a set of features shared by the data from several databases. When learning each task, we add an additional set of parameters corresponding to characteristic information from each database. This procedure ensures a fast and efficient learning procedure, while maintaining the required set of parameters to a minimum, when learning a sequence of tasks.
We evaluate the performance of the proposed L-MVAE system when learning several tasks. We also assess how L-MVAE is used for semi-supervised and unsupervised learning tasks in the context of lifelong learning. All implementations are done using the TensorFlow framework.
Vii-a Supervised learning
We select four datasets for the lifelong supervised training of L-MVAE: MNIST, Fashion 
, SVHN and CIFAR10 , called MFSC sequence. We estimate the average classification accuracy on all testing data across different domains during the lifelong training, and the results are provided in Fig. 3
. Each task was trained for 10 epochs using Stochastic Gradient Descent (SGD). From these results we observe that each time when training with a new dataset, L-MVAE maintains almost its full performance on the previously learned tasks. For comparison in the same plot from Fig.3 we show the results obtained by the Deep Generative Replay (DGR)  which has a significant performance drop on the previously learnt tasks, when training with a new dataset.
In Table I
we provide the classification accuracy for the lifelong learning of the MFSC sequence of databases. When all these databases are used jointly for training, within an approach named “JVAE”, we achieve good results on simple datasets such as MNIST and Fashion, but the performance drops on the datasets containing more complex images. “Transfer” represents training a single classifier on a sequence of tasks without using the generative replay mechanism. We can observe that the “Transfer” approach only achieves good results on the latest task and completely forgets any previously learnt knowledge. L-MVAE-S is the mixture model sharing the parameters of the decoder with all experts. Although L-MVAE-S uses fewer parameters than L-MVAE, it still provides very good results. The generative replay based methods used for comparison, Lifelong generative modeling (LGM), DGR  and Continual Unsupervised Representation Learning (CURL)  display a performance drop on all tasks, which is mainly the result of the generative replay samples quality. The generative replay methods tend to forget the previous learnt tasks when learning a sequence of different domains.
|Dataset||L-MVAE*||CURL* ||CAE ||M1 ||M1+M2 ||Semi-VAE |
Vii-B Semi-supervised learning
We investigate the performance of the L-MVAE in semi-supervised tasks. For the labelled set we randomly select 1,000 training images from the MNIST and 10,000 from each of the datasets: Fashion, SVHN and Cifar10. The remaining data samples are considered as unlabelled. We train the L-MVAE system on both the labelled and unlabelled samples under the MNIST, Fashion, SVHN and Cifar10 lifelong learning, according to Eq. (23) where we set . The results are provided in Table II, where we use ‘*’ to denote the model learned under the lifelong setting. The proposed model almost achieves better results than CURL  in each task learning and even achieves competitive results when comparing with the current state of the art semi-supervised methods trained only on a single dataset, such as CAE , M1 , M1+M2  and Semi-VAE .
Vii-C Unsupervised lifelong reconstruction and interpolation
In the following, L-MVAE model is used in unsupervised applications, where there are no data labels. We train the proposed mixture system with four components () under the MNIST, Fashion, SVHN and Cifar10 (MFSC) as well as when using CelebA, CACD, 3D-chairs and Omniglot (CCDO) lifelong learning settings. The original images for MFSC and for CCDO databases are provided in Figures 4 a-d and 5 a-d, respectively. The image reconstruction results corresponding to these images, following the lifelong learning, are shown in Figures 4 e-h, and Figures 5 e-h, respectively. These results show that the proposed L-MVAE mixture system is able to make accurate inference across several different domains. We also perform interpolations in the latent space of multiple domains. When interpolating between two latent vectors, we initially select the most relevant expert, according to the selection strategy from Section III-C, and then infer the latent variables using the selected inference model. The selected decoder will then recover images from the interpolated latent variable space. We present the interpolation results in Figures 6 a-d, for images from CelebA, CACD, 3D-chairs and Omniglot databases. The proposed model achieves continuity in the latent space as reflected in the generated images derived by each expert, according to these results.
Vii-D Disentangled representation learning
We train L-MVAE system under the CelebA, CACD, 3D-chairs and Omniglot lifelong learning by using the disentangled loss function from Eq. (17) where is increased from a very small value to 25.0 during the training and we set . After the training, the L-MVAE system firstly chooses the most relevant expert and then a single latent variable, inferred by the selected expert, is changed from -3 to 3 while fixing the other latent variables. The results are shown in Figures 7 and 8. From Figures 7 a-d we observe that the proposed L-MVAE approach can discover four disentangled representations for CelebA by changing: age, hair style, illumination and face orientation. From Figures 8 a-c we can observe that we can change chair size, style and orientation.
Vii-E Visual quality evaluation for the generated images
For assessing the representation learning ability under the lifelong setting, we evaluate the negative log-likelihood (NLL), representing the reconstruction error plus the KL divergence term, as well as we evaluate the inception score (IS)  for the reconstructed images from the testing set. First, we train various models under the MNIST, Fashion, SVHN and CIFAR10 (MFSC) lifelong learning setting, by considering 100 epochs for learning each task. The results for MFSC and when considering the learning of the databases in reversed order as CSFM, are provided in Tables III and IV for the average NLL and the average reconstruction error, respectively. These results show that the proposed approach achieves the best results when compared with CURL , LGM  and with JVAE (when training with all databases at once).
We also consider the lifelong training for ImageNet, CIFAR100, CIFAR10 and MNIST. After the training, we choose 5,000 images for testing from CIFAR10, CIFAR100 and ImageNet, respectively, while the IS score of the reconstructed images is provided in TableV when comparing with CURL  and LGM . Then we train various models under the CIFAR100, CIFAR10 and ImageNet lifelong learning and we provide the results in Table VI. These results show that the proposed model still provide the best performance even when learning a sequence of several databases containing complex and diverse images.
|Dataset||L-MVAE||CURL ||LGM |
Vii-F Ablation study
We perform an ablation study to investigate the performance when we change the configuration of the mixture model. We train L-MVAE with components under the MNIST, Fashion, SVHN and Fashion lifelong learning setting. We plot the average reconstruction errors on all MNIST testing samples in Fig. 9a. The results show that the number of components does not affect the performance too much and this is why we use components in the experiments.
We also investigate the performance of the proposed model when not properly estimating the Dirichlet parameters, where the weights , are sampled from the same distribution. We call the model that does not have a component selection as ”L-MVAE without dropout”. We train this model under the same lifelong task learning as above and the NLL results on the first task (MNIST) are shown in Fig. 9b, where it can be observed that this model would lose its performance during the following tasks when not following the dropout approach described in Section III-C. These results are because all experts are activated during the learning of the following tasks if the Dirichlet parameters are not changed accordingly.
In the following experiments we train the L-MVAE model under the lifelong learning of MNIST, Fashion, SVHN and CIFAR10, where we evaluate MELBO, from Eq. (2), for each training step in the first task and the results are shown in Fig. 9c where we also consider a single VAE model with optimal ELBO training on MNIST (MELBO and ELBO are estimated by using the negative reconstruction errors and KL divergence). From these results, MELBO is always bounded by this optimal ELBO and still represents a lower bound on the sample log-likelihood since ELBO, according to Theorem 2 from Section IV. We also train a single expert with GRM and a mixture model with 4 experts under MNIST, Fashion, SVHN and CIFAR10 lifelong learning. We consider the classification error rate as the risk of a model evaluated on the testing set and the accumulated errors are calculated by summing up the risks on the testing sets of all learnt tasks. We consider 10 epochs for each task training and plot the results in Fig. 10. We observe that when considering a single model tends to have a large risk while learning additional tasks. The proposed L-MVAE mixture model always has a lower risk than a single VAE.
|MIX+Wasserstein GAN in ||No||4.04|
|DCGAN  in ||No||4.89|
|ALI  in ||No||4.97|
|PixelCNN++  in ||No||5.51|
|WGAN in ||No||3.82|
|Datasets||L-MVAE-Dynamic||BatchEnsemble ||L-MVAE-Dynamic||BatchEnsemble ||L-MVAE-Dynamic||BatchEnsemble |
Vii-G Transfer metric and transfer learning
In this section, we evaluate how quickly L-MVAE learns a new task when presented with a new database for training. The learning of the probabilistic representation of a new dataset by L-MVAE, can be interpreted as a knowledge transfer process from one domain to another. This results in mixing the information being learnt by the expert from the new database with the information already stored in the networks’ parameters, corresponding to the previously learnt tasks. In this paper, we propose a new metric, assessing the ability for transferring information during the lifelong learning when learning each new task :
where is the performance score of the -th mixture component of parameters for the -th task, and represents a given batch of images sampled from the -th database, and is the performance metric, considered as either the Mean Square Error (MSE), or it can be the classification accuracy, depending on the application of each task. represents the image reconstructed by the L-MVAE model considering the given batch of images corresponding to the -th task. The proposed metric can measure the training efficiency when a model is trained with a new task, representing the information transfer ability of the model when learning new tasks.
We train the L-MVAE model under the MNIST, Fashion, SVHN and CIFAR10 lifelong learning setting. The transfer learning ability during the lifelong learning is evaluated in Fig. 11, by considering MSE as in Eq. (27). It can be observed that L-MVAE converges quickly when learning the probabilistic representation of a new database. The baseline is considered as our model trained on a single dataset, MNIST. The average reconstruction errors, calculated using Eq. (27) are provided in Figures 12 a-c for Fashion, SVHN and CIFAR10 databases. The proposed approach adapts quickly to learning a new task when compared to the baseline. We further investigate the difference of the knowledge transfer ability when learning the tasks in a different order. We train our model under the CelebA to CACD and CelebA to Omniglot, respectively. Then we measure the negative log-likelihood of the model for the second task and the results are presented in Fig. 13. It can be observed that learning CACD as the prior task can significantly accelerate the convergence when the future task shares similar visual concepts to the prior task.
Vii-H Studying the over-regularization factors during training
In this section, we discuss the over-regularization problem in the proposed L-MVAE mixture model. A strong penalty on the KL divergence term in the VAE framework  can allow the variational distribution to match the prior distributions exactly, so . However, this may lead to a poor representation of the underlying data structure for . To solve this problem, we implement each expert by using -VAE , which includes a penalty term on KL divergence, expressed as :
In the beginning of the training, we use a small which gradually increases up to 1.0, during each task training, in the mixture objective function from Eq. (2), after replacing by using Eq. (28). We train the mixture model L-MVAE under the MNIST, Fashion, SVHN and CIFAR10 lifelong setting (MFSC sequence) as well as when considering learning these databases in reversed order, denoted as CSFM. We evaluate the Inception Score (IS) on 5000 testing samples from CIFAR10 and the corresponding reconstructions obtained by L-MVAE-MFSC and L-MVAE-CSFM, representing when training with“MFSC” and “CSFM”, respectively. The reconstruction results measured by Mean Squared Error (MSE), the structural similarity index measure (SSIM) 
and Peak-Signal-to-Noise Ratio (PSNR), provided in Table VII, show that L-MVAE achieves competitive results when compared to BatchEnsemble , a state of the art ensemble model, trained only on CIFAR10. The results also show that the order of learning the four databases does not have a significant impact on the L-MVAE training.
Vii-I The results for the expandable mixture model
In this section, we evaluate the performance of the proposed expansion mechanism and compare to other generative models, such as BatchEnsemble . In order to allow BatchEnsemble to perform unsupervised learning tasks, we implement each ensemble member as a VAE. We use MSE, SSIM and PSNR for the evaluation of image reconstruction quality. We train L-MVAE and BatchEnsemble under MNIST, Fashion, SVHN and CIFAR10 lifelong learning. We consider and L-MVAE adds three new components in the mixture model following the training. The performance of the reconstruction is provided in Table VIII, where L-MVAE-Dynamic outperforms BatchEnsemble on three criteria. The classification tasks when for the lifelong learning of MNIST, Fashion, SVHN, and CIFAR10 are provided in Table IX. After the training, L-MVAE-Dynamic has four components and outperforms BatchEnsemble. We also consider a long sequence of tasks : MNIST, SVHN, Fashion, InverseFashion (IFashion), InverseMNIST (IMNIST), RatedFashion (RFashion), CIFAR10 (MSFIIRC), where we consider in Eq. (25) for MSFIIRC and provide the results in Tabel X where L-MVAE-Dynamic has five components after the lifelong learning. The first and the third components are reused when learning RMNIST and RFashion, respectively, which demonstrates that the appropriate expert that shares similar knowledge with a future task is chosen. Also, L-MVAE-Dynamic achieves the best results in each task when compared to BatchEnsemble.
This paper proposes a novel mixture system able to learn successively several tasks, called Lifelong Mixtures of VAEs (L-MVAE) model. Each time when a new database is available, L-MVAE model adapts its weights in order to learn its corresponding probabilistic representation, without forgetting the information learnt from the previous tasks. A mixing-coefficient is used to determine which experts are activated or inactivated during the lifelong learning, preventing catastrophic forgetting. The L-MVAE model is also enabled with an expanding component mechanism, which depends on the complexity of the new task with respect to those already learnt. The proposed lifelong learning framework is applied for supervised, unsupervised and in semi-supervised learning.
-  (2018) Life-long disentangled representation learning with cross-domain latent homologies. In Advances in Neural Inf. Proc. Systems (NeurIPS), pp. 9873–9883. Cited by: §I, §I, §II.
-  (2017) Expert gate: lifelong learning with a network of experts. In , pp. 3366–3375. Cited by: §II, §IV.
-  (2019) Gradient based sample selection for online continual learning. In Advances in Neural Inf. Proc. Systems (NeurIPS), pp. 11816–11825. Cited by: §I.
Generalization and equilibrium in generative adversarial nets (GANs).
Proc. Int. Conf. on Machine Learning (ICML), vol. PMLR 70, pp. 224–232. External Links: Cited by: TABLE VII.
-  (2018) Understanding disentangling in -VAE. In NIPS Workshop on Learning Disentangled Representations, External Links: Cited by: §II, §V.
On tiny episodic memories in continual learning.
Proc. ICML Workshop on Multi-Task and Lifelong Reinforcement Learning, External Links: Cited by: §I.
-  (2019) Efficient lifelong learning with A-GEM. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: §II.
-  (2017) Symmetric variational autoencoder and connections to adversarial learning. External Links: Cited by: TABLE VII.
-  (2017) AdaNet: adaptive structural learning of artificial neural networks. In Proc. of Int. Conf. on Machine Learning (ICML), vol. PMLR 70, pp. 874–883. Cited by: §I.
-  (2007) Boosting for transfer learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 193–200. Cited by: §II.
-  (2018) Deep unsupervised clustering with Gaussian mixture variational autoencoders. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: 1st item, §II.
-  (2017) Adversarially learned inference. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: TABLE VII.
-  (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §II.
-  (2019) Auto-encoding total correlation explanation. In Proc. Int. Conf. on Art. Intel. and Stat. (AISTATS), vol. PMLR 89, pp. 1157–1166. Cited by: §II.
-  (2014) Generative adversarial nets. In Advances in Neural Inf. Proc. Systems (NIPS), pp. 2672–2680. Cited by: §II.
-  (1954) Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series 33. Cited by: §V.
-  (2020) Improved schemes for episodic memory-based lifelong learning. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Cited by: §I, §II.
-  (2017) -VAE: learning basic visual concepts with a constrained variational framework. In Proc. Int. Conf. on Learning Representations (ICLR), Cited by: §II, §V, §VII-H.
-  (2015) Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, External Links: Cited by: §I, §I, §II.
-  (2010) Image quality metrics: PSNR vs. SSIM. In Proc. Int. Conf. on Pattern Recognition (ICPR), pp. 2366–2369. Cited by: §VII-H.
-  (2017) Categorical reparameterization with Gumbel-Softmax. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: §V.
-  (2016) Less-forgetting learning in deep neural networks. In Proc. AAAI Conf. on Artif. Intel., pp. 3358–3365. Cited by: §I, §II.
-  (2018) Learning disentangled joint continuous and discrete representations. In Proc. Int. Conf. on Machine Learning (ICML), PMLR 80, pp. 2649–2658. Cited by: §II.
-  (2014) Semi-supervised learning with deep generative models. In Advances in Neural Inf. Proc. Systems (NIPS), pp. 3581–3589. Cited by: §VII-B, TABLE II.
-  (2013) Auto-encoding variational Bayes. External Links: Cited by: §II, §III-B, §VII-H.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proc. of the National Academy of Sciences (PNAS) 114 (13), pp. 3521–3526. Cited by: §I, §II, §II.
-  (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §VII-A.
Multi-source neural variational inference.
Proc. AAAI Conf. on Artificial Intelligence, pp. 4114–4121. Cited by: 1st item, §II, §III-C.
-  (2020) Continual learning with bayesian neural networks for non-stationary data. In Proc. Int. Conf. on Learning Representations (ICLR), Cited by: §II.
-  (1998) Gradient-based learning applied to document recognition. Proc. of the IEEE 86 (11), pp. 2278–2324. Cited by: §VII-A.
-  (2017) Learning without forgetting. IEEE Trans. on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §I, §II.
-  (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §II.
-  (2019) Disentangling disentanglement in variational autoencoders. In Proc. Int. Conf. on Machine Learning (ICML), PMLR 97, pp. 4402–4412. Cited by: §II.
-  (2017) Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Inf. Proc. Systems (NeurIPS), pp. 5925–5935. Cited by: §V, §VII-B, TABLE II.
-  (2006) Variational learning for Gaussian mixtures. IEEE Trans. on Systems, Man, and Cybernetics, Part B (Cybernetics) 36 (4), pp. 849–862. Cited by: §III-B.
-  (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, pp. . Cited by: §VII-A.
-  (2018) Variational continual learning. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: §II.
Continual lifelong learning with neural networks: a review.
Proc. of the ACM India Joint Int. Conf. on Data Science and Management of Data, pp. 362–365. Cited by: §I.
-  (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans. on Systems Man and Cybernetics, Part C 31 (4), pp. 497–508. Cited by: §I, §I, §II.
-  (2017) Adversarial symmetric variational autoencoder. In Advances in Neural Inf. Proc. Systems (NeurIPS), pp. 4333–4342. Cited by: TABLE VII.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: TABLE VII.
-  (2018) Lifelong generative modeling. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: §I, §I, §II, §VII-A, §VII-E, §VII-E, TABLE I, TABLE V.
-  (2019) Continual unsupervised representation learning. In Advances in Neural Inf. Proc. Systems (NeurIPS), External Links: Cited by: §I, §VII-A, §VII-B, §VII-E, §VII-E, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V, TABLE VI.
-  (2017) Life-long learning based on dynamic combination model. Applied Soft Computing 56, pp. 398–404. Cited by: §I, §II.
-  (2018) Online structured Laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 31, pp. 3742–3752. Cited by: §II, §II.
-  (2016) Progressive neural networks. External Links: Cited by: §I.
-  (2016) Improved techniques for training GANs. In Advances in Neural Inf. Proc. Systems (NIPS), pp. 2234–2242. Cited by: §VII-E.
-  (2017) PixelCNN++: improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: TABLE VII.
-  (2019) Variational mixture-of-experts autoencoders for multi-modal deep generative models. In Advances in Neural Inf. Proc. Systems (NeurIPS), pp. 15718–15729. Cited by: 1st item, §II, §III-C.
-  (2017) Continual learning with deep generative replay. In Advances in Neural Inf. Proc. Systems (NeurIPS), pp. 2990–2999. Cited by: §I, §VII-A, §VII-A, TABLE I.
-  (2020) Calibrating CNNs for lifelong learning. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §I.
-  (2018) Three scenarios for continual learning. In NeurIPS Continual Learning workshop, External Links: Cited by: §I.
-  (2020) BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning. In Proc. Int. Conf. on Learning Representations (ICLR), External Links: Cited by: §VII-H, §VII-I, TABLE X, TABLE VIII, TABLE IX.
-  (2015) Sparse multivariate Gaussian mixture regression. IEEE Trans. on Neural Networks and Learning Systems 26 (5), pp. 1098–1108. Cited by: §III-B.
-  (2018) Memory replay gans: learning to generate new categories without forgetting. In Advances In Neural Inf. Proc. Systems (NeurIPS), pp. 5962–5972. Cited by: §I.
-  (2018) Multimodal generative models for scalable weakly-supervised learning. In Advances in Neural Information Processing Systems, pp. 5575–5585. Cited by: 1st item, §II.
-  (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. External Links: Cited by: §VII-A.
Error-driven incremental learning in deep convolutional neural network for large-scale image classification. In Proc. of ACM Int. Conf. on Multimedia, pp. 177–186. Cited by: §I.
-  (2021) Deep mixture generative autoencoders. IEEE Trans. on Neural Networks and Learning Systems. Cited by: §III-A.
-  (2021) Lifelong Teacher-Student network learning. IEEE Trans. on Pattern Analysis and Machine Intelligence. Cited by: §II.
-  (2020) Learning latent representations across multiple data domains using lifelong VAEGAN. In Proc. European Conf. on Computer Vision (ECCV), vol LNCS 12365, pp. 777–795. Cited by: §II.
-  (2017) Continual learning through synaptic intelligence. In Proc. of Int. Conf. on Machine Learning (ICML), vol. PMLR 70, pp. 3987–3995. Cited by: §I.
-  (2019) Lifelong GAN: continual learning for conditional image generation. In Proc. IEEE Int. Conf. on Computer Vision (ICCV), pp. 2759–2768. Cited by: §I.
Online incremental feature learning with denoising autoencoders. In Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), vol. PMLR 22, pp. 1453–1461. Cited by: §I.