Deep Bayesian Unsupervised Lifelong Learning

by   Tingting Zhao, et al.

Lifelong Learning (LL) refers to the ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Much attention has been given lately to Supervised Lifelong Learning (SLL) with a stream of labelled data. In contrast, we focus on resolving challenges in Unsupervised Lifelong Learning (ULL) with streaming unlabelled data when the data distribution and the unknown class labels evolve over time. Bayesian framework is natural to incorporate past knowledge and sequentially update the belief with new data. We develop a fully Bayesian inference framework for ULL with a novel end-to-end Deep Bayesian Unsupervised Lifelong Learning (DBULL) algorithm, which can progressively discover new clusters without forgetting the past with unlabelled data while learning latent representations. To efficiently maintain past knowledge, we develop a novel knowledge preservation mechanism via sufficient statistics of the latent representation for raw data. To detect the potential new clusters on the fly, we develop an automatic cluster discovery and redundancy removal strategy in our inference inspired by Nonparametric Bayesian statistics techniques. We demonstrate the effectiveness of our approach using image and text corpora benchmark datasets in both LL and batch settings.



There are no comments yet.


page 8


Adaptive Nonparametric Variational Autoencoder

Clustering is used to find structure in unlabeled data by grouping simil...

Learn-Prune-Share for Lifelong Learning

In lifelong learning, we wish to maintain and update a model (e.g., a ne...

Learning Adaptive Embedding Considering Incremental Class

Class-Incremental Learning (CIL) aims to train a reliable model with the...

Towards Lifelong Learning of End-to-end ASR

Automatic speech recognition (ASR) technologies today are primarily opti...

Theoretical Understanding of the Information Flow on Continual Learning Performance

Continual learning (CL) is a setting in which an agent has to learn from...

Adaptive Online Incremental Learning for Evolving Data Streams

Recent years have witnessed growing interests in online incremental lear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With exposure to a continuous stream of information, human beings are able to learn and discover novel clusters continually by incorporating past knowledge; however, traditional machine learning algorithms mainly focus on static data distributions. Training a model with new information often interferes with previously learned knowledge, which typically compromises performance on previous datasets

(mccloskey1989catastrophic, ). In order to empower algorithms with the ability of adapting to emerging data while preserving the performance on seen data, a new machine learning paradigm called ll has recently gained some attention.

ll, also known as continual learning, was first proposed in thrun1995lifelong . It provides a paradigm to exploit past knowledge and learn continually by transferring previously learned knowledge to solve similar but new problems in a dynamic environment, without performance degradation on old tasks. ll is still an emerging field and most existing research work (thrun1996learning, ; ruvolo2013ella, ; chen2015lifelong, )

has focused on sll, where the boundaries between different tasks are known and each task refers to a supervised learning problem with output labels provided. The term

task can have different meanings under various contexts. For example, different tasks can represent different subsets of classes or labels in the same supervised learning problem (sarwar2019incremental, ) and can also represent supervised learning problems in different fields, where researchers target to perform continual learning across different domains (hou2018lifelong, )

or lifelong transfer learning

(ruvolo2013active, ; isele2016using, ).

While most research on ll has focused on resolving challenges in sll problems with class labels provided, we instead consider ull problems, where the learning system is interacting with a non-stationary stream of unlabelled data and the cluster labels are unknown. One objective of ull is to discover new clusters by interacting with the environment dynamically and adapting to the changes in the unlabeled data without external supervision or knowledge. To avoid confusion, it is worth pointing out that our work assumes that the non-stationary streaming unlabelled data come from a single domain; and we target to develop a single dynamic model that can perform well on all sequential data at the end of each training stage without forgetting previous knowledge. This setting can serve as a reasonable starting point for ull. We leave the challenges in ull across different problem domains for future work.

To retain knowledge from past data and learn new clusters from new data continually, good representations of the raw data make it easier to extract information and, in turn, support effective learning. Thus, it is more computationally appealing if we can discover new clusters in a low-dimensional latent space instead of the complex original data space while performing representation learning. To achieve this, we propose dbull, which is a flexible probabilistic generative model that can adapt to new data and expand with new clusters while seamlessly learning deep representations in a Bayesian framework.

A critical objective of ll is to achieve consistently good performance incrementally as new data arrive in a streaming fashion, without performance decrease on previous data, even if the data may have been completely overwritten. Compared with a traditional batch learning setting, there are additional challenges to resolve in a ll setting. One important question is how to design a knowledge preservation scheme to efficiently maintain previously learned information. To discover the new clusters automatically with streaming, unlabelled data, another challenge is how to design a dynamic model that can expand with incoming data to perform unsupervised learning. The last challenge is how to design an end-to-end inference algorithm to obtain good performance in an incremental learning way. To answer these questions, we make the following contributions:

  • To solve the challenges in ull, we provide a fully Bayesian formulation that performs representation learning, clustering and automatic new cluster discovery simultaneously via our end-to-end novel variational inference strategy dbull.

  • To efficiently extract and maintain knowledge seen in earlier data, we provide innovation in our incremental inference strategy by first using sufficient statistics in the latent space in an ll context.

  • To discover new clusters in the emerging data, we choose a nonparametric Bayesian prior to allow the model to grow dynamically. We develop a sequential Bayesian inference strategy to perform representation learning simultaneously with our proposed cerr trick to discover new clusters on the fly without imposing bounds on the number of clusters, unlike most existing algorithms, using a truncated dp (blei2006variational, ).

  • To show the effectiveness of dbull, we conduct experiments on image and text benchmarks. dbull can achieve superior performance compared with state-of-the-art methods in both ull and classical batch settings.

2 Related Work

2.1 Alleviating Catastrophic Forgetting in Lifelong Learning

Research on ll aims to learn knowledge in a continual fashion without performance degradation on previous tasks when trained for new data. Reference parisi2019continual have provided a comprehensive review on ll with neural networks. The main challenge of ll using dnn is that they often suffer from a phenomenon called catastrophic forgetting or catastrophic interference, where a model’s performance on previous tasks may decrease abruptly due to the interference of training with new information (mccloskey1989catastrophic, ; mcclelland1995there, ). Recent work aims at adapting a learned model to new information while ensuring the performance on previous data does not decrease. Currently, there are no universal past knowledge preservation schemes for different algorithms and settings (chen2018lifelong, ) in ll. Regularization methods target to reduce interference of new learning by minimizing changes to certain parameters that are important to previous learning tasks (kirkpatrick2017overcoming, ; zenke2017continual, ). Alternative approaches based on rehearsal have also been proposed to alleviate catastrophic forgetting while training dnn sequentially. Rehearsal methods use either past data (robins1995catastrophic, ; xu2018lifelong, ), coreset data summarization (nguyen2017variational, ) or a generative model (shin2017continual, ) to capture the data distribution of previously seen data.

However, most of the existing ll methods focus on supervised learning tasks (kirkpatrick2017overcoming, ; nguyen2017variational, ; hou2018lifelong, ; shin2017continual, ), where each learning task performs supervised learning with output labels provided. In comparison, we propose a novel alternative knowledge preservation scheme in an unsupervised learning context via sufficient statistics. This is in contrast to existing work which uses previous model parameters (chen2015lifelong, ; shu2016lifelong, ), representative items exacted from previous models (shu2017lifelong, ) or past raw data, or coreset data summarization (robins1995catastrophic, ; xu2018lifelong, ; nguyen2017variational, ) as previous knowledge for new tasks. Our proposal to use sufficient statistics is novel and has the advantage of preserving past knowledge without the need of storing previous data while allowing incremental updates as new data arrive by taking advantage of the additive property of sufficient statistics.

2.2 Comparable Methods in Unsupervised Lifelong Learning

Recently, rao2019continual proposed curl to deal with a fully ull setting with unknown cluster labels. We have developed our idea independently in parallel with curl but in a fully Bayesian framework. curl is the most related and comparable method to ours in the literature. curl focuses on learning representations and discovering new clusters using a threshold method. One major drawback of curl is that it has over-clustering issues as shown in their real data experiment. We also show this empirically and demonstrate the improvement of our method over curl in our experiment section. In contrast to curl, we provide a novel probabilistic framework with a nonparametric Bayesian prior to allow the model to expand without bound automatically instead of using an ad hoc threshold method as in curl. We develop a novel end-to-end variational inference strategy for learning deep representations and detecting novel clusters in ull simultaneously.

2.3 Bayesian Lifelong Learning

A Bayesian formulation is a natural choice for ll since it provides a systematic way to incorporate previously learnt information in the prior distribution and obtain a posterior distribution that combines both prior belief and new information. The sequential nature of Bayes theorem also paves the way to recursively update an approximation of the posterior distribution and then use it as a new prior to guide the learning for new data in ll.

In nguyen2017variational , the authors propose a Bayesian formulation in ll. Although both under a Bayesian framework, our work is different from nguyen2017variational due to different objectives, inference strategies and knowledge preservation techniques. In nguyen2017variational , the authors provide a variational online inference framework for deep discriminative models and deep generative models, where they studied the approximate posterior distribution of the parameters in dnn in a continual fashion. However, their method does not have the capacity to find out the latent clustering structure of the data or detect new clusters for emerging data. In contrast, we develop a novel Bayesian framework for representation learning and discovering latent clustering structure and new clusters on the fly together with a novel end-to-end variational inference strategy in an ull context.

2.4 Deep Generative Unsupervised Learning Methods in a Batch Setting

Recent research has focused on combining deep generative models to learn good representations of the original data and conduct clustering analysis in an unsupervised learning context

(kingma2014auto, ; johnson2016composing, ; xie2016unsupervised, ; jiang2017variational, ; goyal2017nonparametric, ). However, the latest existing methods are designed for an independent and identically distributed (i.i.d.) batch training mode instead of a ll context. The majority of these methods are in a static unsupervised learning setting, where the number of clusters are fixed in advance. Thus, these methods cannot detect potential new clusters when new data arrive or the data distribution changes. These methods cannot adapt to a ll setting.

To summarize, our work fills the gap by providing a fully Bayesian framework for ull, which has the unique capacity to use a deep generative model for representation learning while performing new cluster discovery on the fly with a nonparametric Bayesian prior and our proposed cerr technique. To alleviate catastrophic forgetting challenge in ll, we propose to use sufficient statistics to maintain knowledge as a novel alternative to existing methods. We further develop an end-to-end Bayesian inference strategy dbull to achieve our goal.

3 Model

3.1 Problem Formulation

In our ull setting, a sequence of datasets , arrive in a streaming order. When a new dataset arrives in memory, the previous dataset is no longer available. Our goal is to automatically learn the clusters (unlabeled classes) in each dataset.

Let represent the unlabeled observation of the current dataset in memory, where

can be a high-dimensional data space. We assume that a low-dimensional latent representation

can be learned from and in turn can be used to reconstruct . We assume that the variation among observations can be captured by its latent representation . Thus, we let represent the unknown cluster membership of for observation .

We target to find: (1) a good low-dimensional latent representation from to efficiently extract knowledge from the original data; (2) the clustering structure within the new dataset with the capacity to discover potentially novel clusters without forgetting the previously learned clusters of the seen datasets; and, (3) an incremental learning strategy to optimize the cluster learning performance for a new dataset without dramatically degrading the clustering performance in seen datasets.

We summarize our work in a flow chart in Fig. 1, provide its graphical model representation in Fig 2, and describe the generative process of our graphical model for dbull in the next section.

Figure 1: Flow chart of DBULL, best viewed in color. Given observations , dbull learns its latent representation via an encoder

and performs clustering under a Dirichlet Process Gaussian mixture model and reconstruct the original observation via a decoder

, where and denote the parameters in the encoder and decoder respectively. To perform Lifelong Learning in an incremental fashion when dealing with streaming data, we introduce two novel components: Sufficient Statistics for knowledge preservation and Cluster Expansion and Redundancy Removal to create and merge clusters.

Figure 2:

Graphical model representation of DBULL. Nodes denote random variables, edges denote possible dependence, and plates denote replication. Solid lines denote the generative model; Dashed lines denote the variational approximation.

3.2 Generative Process of DBULL

The generative process for dbull is as follows.

  • Draw a latent cluster membership

    , where the vector

    comes from the stick-breaking construction of a Dirichlet Process (DP).

  • Draw a latent representation vector , where is the cluster membership sampled from (a).

  • Generate data from in the original data space.

In (a), is the categorical distribution parameterized by , where we denote the th element of as

, which is the probability for cluster

. The value of depends on a vector of scalars coming from the stick-breaking construction of a dp (sethuraman1982convergence, ) and we describe an iterative process to draw in Section 4.2. curl uses a latent mixture of Gaussian components to capture the clustering structure in an unsupervised learning context. In comparison, we adopt the DP mixture model in the latent space with the advantages that the number of mixture components can be random and grow without bound as new data arrive, which is an appealing property desired by ll. We further explain in Section 4.2 why DP is an appropriate prior for our problem in details. In (b), is considered a low-dimensional latent representation of the original data . We describe in Section 4.2 that a DP Gaussian mixture is used for modelling since it is often assumed that the variation in is able to reflect the variation within . The current representation in (b) is for easy understanding. In (c), we assume that the generative model is parameterized by and , where is chosen as dnn due to its powerful function approximation and good feature learning capabilities (hornik1991approximation, ; kingma2014semi, ; nalisnick2016approximate, ).

Under this generative process, the joint probability density function can be factorized as


where and represents the parameters of the th mixture component (or cluster), and represent the prior distribution for and .

Next, we discuss how to choose appropriate and to endow our model with the flexibility to grow the number of mixture components without bound with new data in a ll setting.

4 Why Bayesian for DBULL

In this section, we illustrate why Bayesian framework is a natural choice for our ull setting. Recall that we have a sequence of datasets from a single domain arriving in a streaming order. To mimic a ll setting, we assume each time only one dataset can fit in memory. One key question in ll is how to efficiently maintain past knowledge to guide future learning.

4.1 Bayesian Reasoning for Lifelong Learning

Bayesian framework is a suitable solution to this type of learning since it learns a posterior distribution, or an approximation of a posterior distribution, that takes advantage of both the prior belief and the additional information in the new dataset. The sequential nature of Bayes theorem ensures valid recursive updates on an approximation of the posterior distribution given the observations. Later, the approximation of the posterior distribution serves as the new prior to guide future learning for new data in ll. Before describing our inference strategy, we first explain why utilizing the Bayesian updating rule is valid for our problem.

Given datasets , where , the posterior after considering the th dataset is


which reflects that the posterior of tasks and datasets can be considered as the prior for the next task and dataset. If we know exactly the normalizing constant for and , repeatedly updating (2) is streaming without the need of reusing past data. However, it is often intractable to compute the normalizing constant exactly. Thus, an approximation of the posterior distribution is necessary to update (2) since the exact posterior is infeasible to obtain.

4.2 Dirichlet Process Prior

The dp is often used as a nonparametric prior for partitioning exchangeable observations into discrete clusters. The dp mixture is a flexible mixture model where the number of mixture components can be random and grow without bound as more data arrive. These properties make it a natural choice for our ll setting. In practice, we show in our inference how we expand and merge the number of mixture components as new data arrive by starting from only one cluster in Section 5.6. Next, we briefly review dp and introduce our dp Gaussian mixture model to derive the joint probability density defined in (1).

A dp is characterized by a base distribution and a parameter denoted as . A constructive definition of dp via a stick-breaking process is of the form , where is a discrete measure concentrated at , which is a random sample from the base distribution with mixing proportion (ishwaran2001gibbs, ). In dp, the s are random weights independent of but satisfy and . The weights can be drawn through an iterative process:

where .

Under the generative process of dbull in Section 3.2, these s represent the probabilities for each cluster (mixture component) used in step (a) and can be seen as the parameters of the Gaussian mixture for in step (b). Thus, given our generative process, the corresponding joint probability density for our model is


For a Gaussian mixture model, the base distribution is often chosen as the Normal-Wishart (NW) denoted as to generate the mixture parameters , where , and is the dimension of the latent vector . The values of the hyper-parameter are conventional choices in the Bayesian nonparameteric literature for Gaussian mixture. Moreover, the performance of our method is robust to the hyper-parameter values.

5 Inference for DBULL

There are several new challenges to develop an end-to-end inference algorithm for our problem under the ull setting compared with the batch setting: one has to deal with catastrophic forgetting, mechanisms for past knowledge preservation, and dynamic model expansion capacity for novel cluster discovery. For pedagogical reasons, we first describe our general parameter learning strategy via variational inference for DBULL in a standard batch setting. We then describe how we resolve the additional challenges in the lifelong (streaming) learning setting. We describe our novel components in the inference algorithm in terms of a new knowledge preservation scheme via sufficient statistics in Sections 5.4 and an automatic cerr strategy in Section 5.6. A summary of our algorithm in the ll setting is provided in Algorithm 1. Our implementation is available at The implementation details are provided in Appendix C. We explain the contribution of the sufficient statistics to the probabilistic density function of our problem and knowledge preservation in Section 5.5.

5.1 Variational Inference and ELBO Derivation

In practice, it is often infeasible to obtain the exact posterior distribution since the normalizing constant in the posterior distribution is intractable. mcmc methods are a family of algorithms that provide a systematic way to sample from the posterior distribution but is often slow in a high-dimensional parameter space. Thus, effective alternative methods are needed. Variational inference is a promising alternative, which approximates the posterior distribution by casting inference as an optimization problem. It aims to find a surrogate distribution that is the most similar to the distribution of interest over a class of tractable distributions that can minimize the Kullback-Leibler (KL) divergence to the exact posterior distribution. Minimizing the KL divergence between and in our setting is equivalent to maximizing the Evidence Lower Bound (ELBO), where is the variational posterior distribution used to approximate the true posterior distribution. To make it easier for the readers to understand the core idea, we provide a high-level explanation of variational inference and mathematical details can be found in Appendix A.

Given the generative process in Section 3.2 and using Jensen’s inequality,


For simplicity, we assume that . Thus, the ELBO is


We assume our variational distribution takes the form of


where we denote , which is a neural network parameterized by , is the number of mixture components in the DP of the variational distribution, , , and is a Multinomial distribution. The notation definitions in equation (5), (6) and (7) are provided in Table 1. Our inference strategy starts with only one mixture component and uses cerr technique described in Section 5.6 to either increase or merge the number of clusters.

5.2 General Parameter Learning Strategy

In equation (5), there are mainly two types of parameters which we need to optimize. The first type includes parameters and in the neural network. The other type involves the latent cluster membership and the parameters for the dp Gaussian mixture model.

In order to perform joint inference for both types of parameters, we adopt the alternating optimization strategy. First we update the neural network parameters ( and ) to learn the latent representation given the DP Gaussian mixture parameters. This is achieved by optimizing , which only involves the first three terms of equation (5) that make a contribution to optimize , and . Under our variational distribution assumptions in (6), by taking advantage of the reparameterization trick (kingma2014auto, )

and the Monte Carlo estimate of expectations, we obtain

Notations in the ELBO
: parameters in the decoder.
: parameters in the encoder.
: the total number of observations.
: the total number of clusters.
: the dimension of the latent representation .
: the th dimension of the th observation.
: cluster membership for the th observation.
: the number of Monte Carlo samples in Stochastic
Gradient Variational Bayes (SGVB).
: the posterior scalar precision in NW distribution.
: the posterior mean of cluster .
: the

th posterior degrees of freedom of NW.

Table 1: Notations in .

We provide the notations in Table 1. The derivation details are provided in Appendix A. Then, we update the dp Gaussian mixture parameters and the cluster membership given the current neural network parameters , and the latent representation . This allows us to use improved latent representation to infer latent cluster memberships and the updated clustering will in turn facilitate learning latent knowledge representation. The update equations for dp mixture model parameters can be found in blei2006variational . We describe the core idea of automatic cerr in our inference in Section 5.6 to explain how we start with only one cluster and achieve dynamic model expansion by creating new mixture components (clusters) given new data in ll.

Our general parameter learning strategy via variational inference may seem straightforward for a batch setting at first glance. However, both the derivation and the implementation is nontrivial especially when incorporating our new components in the end-to-end inference procedure to address the additional challenges in a ll setting. For illustration purposes, we choose to describe the high level core idea of our inference procedure. The main difficulty lies in how to adapt our inference algorithm from a batch setting to a ll setting, which requires us to overcome catastrophic forgetting, maintain past knowledge and develop a dynamic model that can expand with automatic cluster discovery and redundancy removal capacity. Next, we describe our novel solutions.

5.3 Our Ingredients for Alleviating Catastrophic Forgetting

Catastrophic forgetting or catastrophic interference is a dramatic issue for dnn as witnessed in sll (kirkpatrick2017overcoming, ; shin2017continual, ). In our ull setting, the issue is even more challenging since we have more sources than sll that may lead to abrupt model performance decrease due to the interference of training with new data. The first source is the same as in sll when the dnn forget previously learned information upon learning new information. Additionally, in an unsupervised setting, the model is not able to recover the learned cluster membership and clustering related parameters in the dp mixture model when the previous data is no longer available, or when the previous learned information of dnn has been wiped out upon learning new information, since the clustering structure learned depends on the latent representation of the raw data, which is determined by the dnn’ parameters and the data distributions.

To resolve these issues, we develop our own novel solution via a combination of two ingredients: (1) generating and replaying a fixed small number of samples based on our generative process in Section 3.2 given the current dnn and dp Gaussian mixture parameter estimates, which is a computationally effective byproduct of our algorithm; and, (2) developing a novel hierarchical sufficient statistics knowledge preservation strategy to remember the clustering information in an unsupervised setting.

We choose to replay a number of generative samples to preserve the previous data distribution instead of using a subset of past real data, since storing past data may require large memory and such data storage and replay may not be feasible in real big data applications. More details of replaying deep generative samples over real data in ll have been discussed in shin2017continual . Moreover, our proposal to use sufficient statistics is novel and has the advantage of allowing incremental updates of the clustering information as new data arrive without the need of access to previous data because of the additive property of sufficient statistics. We introduce this novel strategy in the next section.

5.4 Sufficient Statistics for Knowledge Preservation

As ll is an emerging field, and there is no well-accepted knowledge definition or an appropriate representation scheme to efficiently maintain past knowledge from seen data. Researchers have adopted prior distributions (nguyen2017variational, ) or model parameters (lee2019learning, ) to represent past knowledge in most sll problems, where achieving high prediction accuracy incrementally is the main objective. However, there is no guidance on preserving past knowledge in an unsupervised learning setup.

We propose a novel knowledge preservation strategy in dbull. In our problem, there are two types of knowledge to maintain. The first contains previously learned dnn’ parameters needed to encode the latent knowledge representation of the raw data and the reconstruction of the real data from . The other involves the dp Gaussian mixture parameters to represent different cluster characteristics and different cluster mixing proportions. Our novel knowledge representation scheme uses hierarchical sufficient statistics to preserve the information related to the dp Gaussian mixture. We develop a sequential updating rule to update our knowledge.

Assume that we have encountered datasets and each time only one dataset can be in memory. While in memory, each dataset can be divided into mini-batches . To define the sufficient statistics, we first define the global parameters of the dp Gaussian mixture as probabilities of each mixture component (cluster) and the mixture parameters for each cluster . We define the local parameters as the cluster membership for each observation in memory. To remember the characteristics of all encountered data and the local information of the current dataset, we memorize three levels of sufficient statistics. The th mini-batch sufficient statistics of the current dataset , where and

is the sufficient statistics to represent a distribution within the exponential family (Gaussian distribution is within the exponential family and

in our case) and represents the estimated probability of the th observations in mini-batch belonging to cluster . We also define the stream sufficient statistics of dataset and the overall sufficient statistics of all encountered datasets .

To efficiently maintain and update our knowledge, we develop our updating algorithm as: (1) substract the old summary of each mini-batch and update the local parameters; (2) compute a new summary for each mini-batch; and, (3) update the stream sufficient statistics for each cluster learned in the current dataset.


For the dataset in the learning phase, we repeat the updating process multiple iterations to refine our training while learning the dp Gaussian mixture parameters and the cluster membership. Finally, we update the overall sufficient statistics by . The correctness of the algorithm is guaranteed by the additive property of the sufficient statistics.

5.5 Contribution of Sufficient Statistics to Alleviate Forgetting

The sufficient statistics alleviate forgetting by preserving data characteristics and allow sequential updates in ll without the need of saving real data. To be precise, the sufficient statistics allow us to update the log-likelihood and the ELBO sequentially since both terms are linear functions of the expectation of the sufficient statistics. Given the expected sufficient statistics, we are able to evaluate the first two terms of in equation (7) and in the joint probability density function of our model in equation (3). Next, we provide mathematical derivations to illustrate this.

Define sufficient statistics of all data , where is the number of clusters. Define , where and denotes the probability of the th observation belonging to cluster , , and is the total number of observations. Given the sufficient statistics and current mixture parameters and , we can evaluate in the joint probability density function in equation (3) without storing each latent representation for all data. Similarly, we can also evaluate the first two terms of in equation (7) with notations defined in Table 1.

5.6 Cluster Expansion and Redundancy Removal Strategy

Our model starts with one cluster and, as we have new data with different characteristics, we expect the model to either dynamically grow the number of clusters or merge clusters if they have similar characteristics. To achieve this, we perform birth and merge moves in a similar fashion as the Nonparametric Bayesian literature (hughes2013memoized, ) to allow automatic cerr. However, we would like to emphasize that our work is different from hughes2013memoized since our merge moves have extra constraints. To avoid losing information about clusters learned earlier, we only allow merge moves between two novel clusters from the birth move or one existing cluster with a newborn cluster. Two previously existing clusters cannot be merged. Reference hughes2013memoized is designed for a batch learning, thus, it does not require this constraint (but in ll this constraint is important for avoiding information loss).

It is challenging to give birth to new clusters with streaming data since the number of observations may not be sufficient to inform good proposals. To resolve this issue, we follow hughes2013memoized by collecting a subsample of data for each learned cluster . Then, we cache the samples in the subsample if the probability of the th observation to be assigned to cluster is bigger than a threshold of value 0.1. This value has been suggested by hughes2013memoized . In this paper, we try to choose commonly used parameters in the literature and avoid dataset specific tuning as much as possible. We fit the dp Gaussian mixture to the cached samples with one cluster and expand the model with 10 novel clusters. However, only adopting the birth moves may overcluster the observations into different clusters. After the birth move, we merge the clusters by (1) selecting candidate clusters to merge and by (2) merging two selected clusters if ELBO improves. The candidate clusters are selected if the marginal likelihood of two merged clusters is bigger than the marginal likelihood when keeping the two clusters separate.

1:  Initialization: Initialize the

dnn, variational distributions and the hyperparameters for

2:  for  datasets in memory do
3:     for  do
4:         for  iterations do
5:            Update the weights of the encoder and decoder via to maximize in equation 7 given current dpmm parameters, where

is the learning rate when using stochastic gradient descent.

6:         end for
7:         Compute the deep representation of observations using the encoder .
8:         for  mini-batch of dataset  do
9:            while The ELBO of the dpmm has not converged do
10:               Visit the th mini-batch of in a full pass in the th dataset.
11:               Update the mini-batch sufficient statistics and stream sufficient statistics in Section 5.4 via equation 8 and 9.
12:               Update the local and global parameters of dpmm using mini-batch of dataset via standard variational inference for dpmm blei2006variational provided in Appendix A.
13:               Perform cluster expansion described in Section 5.6 to propose new clusters.
14:               Perform redundancy removal described in Section 5.6 to merge clusters if the ELBO improves.
15:            end while
16:         end for
17:         Update the overall sufficient statistics defined in Section 5.4 via equation 10.
18:     end for
19:  end for
20:  Output: Learned dnn, variational approximation to the posterior, deep latent representation, cluster assignment for each observation, and cluster representative in the latent space.
Algorithm 1 Variational Inference for DBULL

6 Experiments

Datasets. We adopt the most common text and image benchmark datasets in ll to evaluate the performance of our method. The MNIST database of 70,000 handwritten digit images (lecun1998gradient, ) is widely used to evaluate deep generative models (kingma2014auto, ; xie2016unsupervised, ; johnson2016composing, ; goyal2017nonparametric, ; jiang2017variational, ) for representation learning and ll models in both supervised (kirkpatrick2017overcoming, ; nguyen2017variational, ; shin2017continual, ) and unsupervised learning contexts (rao2019continual, ). To provide a fair comparison with state-of-the-art competing methods and easy interpretation, we mainly use MNIST to evaluate the performance of our method and interpret our results with intuitive visualization patterns. To examine our method on more complex datasets, we use text Reuters10k (lewis2004rcv1, ) and image STL-10 (coates2011analysis, )

databases. STL-10 is at least as hard as a well-known image database CIFAR-10

(krizhevsky2009learning, ) since STL-10 has fewer labeled training examples within each class. The summary statistics for the datasets are provided in Table 2.

Dataset # Samples Dimension # Classes
MNIST 70000 784 10
Reuters10k 10000 2000 4
STL-10 13000 2048 10
Table 2: Summary statistics for benchmark datasets.

We adopt the same neural network architecture as in jiang2017variational . All values of the tuning parameters and the implementation details in the dnn are provided in C. Our implementation is publicly available at

Competing Methods in Unsupervised Lifelong Learning. curl is the only lifelong unsupervised learning method currently with both representation learning and new cluster discovery capacity, which makes curl the latest, most related, and comparable method to ours. We use CURL-D to represent curl without the true number of clusters provided but to detect it Dynamically given unlabelled streaming data.

Competing Methods in a Classic Batch Setting. Although designed for ll, to show the generality of dbull in a batch training mode, we compare it against recent deep (generative) methods with representation learning and clustering capacity designed for batch settings, including DEC (xie2016unsupervised, ), VaDE (jiang2017variational, ), CURL-F (rao2019continual, ) and VAE+DP. CURL-F represents curl with the true number of clusters provided as a Fixed value. VAE+DP fits a vae kingma2014auto to learn latent representations first and then uses a dp to learn clustering in two separate steps. We list the capabilities of different methods in Table 3.

Lifelong Learning Batch Setting
Representation Learning yes yes yes yes yes yes
Learns # of Clusters yes yes no no yes no
Dynamic Expansion yes yes no no yes no
Overcome Forgetting yes yes no no no yes
Table 3: DBULL and competing methods capacity comparison.

Evaluation Metrics. One of the main objectives of our method is to perform new cluster discovery with streaming non-stationary data. Thus, it is desired if our method can achieve superior clustering quality in both ull and batch settings. We adopt the clustering quality metrics including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Homogeneity Score (HS), Completeness Score (CS) and V-measure Score (VM). These are all normalized metrics ranging from zero to one, and larger values indicate better clustering quality. NMI, ARI and VM of value one represent perfect clustering as the ground truth. CS is a symmetrical metric to HS. Detailed definitions for these metrics can be found in rosenberg2007v .

6.1 Lifelong Learning Performance Comparison

Experiment Objective. It is desired if ll methods can adapt a learned model to new data while retaining the information learned earlier. The objective of this experiment is to demonstrate dbull has such desired capacity and effectiveness compared with state-of-the-art ll methods such that there is no dramatic performance decrease on past data even if the model has been updated with new information.

Experiment Setup. To evaluate the performance of dbull, we adopt the most common experiment setup called Split MNIST in ll, which used images from MNIST (zenke2017continual, ; nguyen2017variational, ). We divide MNIST into 5 disjoint subsets with each subset containing 10,000 random samples of two digit classes in the order of digits 0-1, 2-3, 4-5, 6-7 and 8-9, denoted as , , , , and . Each dataset is divided into 20 subsets that arrive in a sequential order to mimic a ll setting. We denote as all data from to , where .

Discussion on Performance. To check if our method has dramatic performance loss due to catastrophic forgetting, we sequentially train our method dbull and its ll competitor CURL-D on . We define as training on , where . We measure the performance of after training , , , , with datasets , and , the performance of after training with datasets , etc. We report the ll clustering quality performance for each task after sequential training five tasks in Fig. 3 and Fig. 4.

Figure 3: Clustering quality performances in terms of ARS and CS measured on each task after sequential training from to across every 500 iterations for each task.

Figure 4: Clustering quality performances in terms of HS, NMI and VM measured on each task after sequential training from to across every 500 iterations for each task.

Fig. 3 and Fig. 4 reflect that dbull has better performance in handling catastrophic forgetting than CURL-D since dbull has slightly less performance drop than CURL-D for previous tasks in almost all scenarios in terms of nearly all clustering metrics.

Figure 5: Number of clusters detected by CURL-D, DBULL after sequentially training from to across every 500 iterations for each task compared with the ground truth.

Fig. 5 reflects that dbull has advantages over CURL-D in handling overclustering issues. Since each task has two digits, the true number of clusters seen after training each task sequentially is 2, 4, 6, 8, 10. The number of clusters automatically detected by dbull after training , is 4, 6, 8, 10, 12. dbull clusters digit 1 into three clusters of different handwritten patterns in . For other digits, dbull discovers each new digit into one exact cluster as the ground truth. In contrast, CURL-D clustered digits 0-1 into 14-16 clusters of and obtained 23-25 clusters for 10 digits after training five tasks sequentially. We provide visualization of the reconstructed cluster mean from the DP mixture model via our trained decoder of dbull in Fig. 6.

Figure 6: Decoded images using the DP Gaussian mixture posterior mean after sequentially trained from to using DBULL.

Besides the overall clustering quality reported, we also provide the precision and recall of dbull to view the performance for each digit after sequentially training all tasks. CURL-D overclusters the 10 digits into 25 clusters, making it hard to report the precision and recall of each digit. To visualize the results, the three sub-clusters of digit one by dbull have been merged into one cluster. Overall, there is no significant performance loss of previous tasks after sequentially training multiple tasks for digits 0, 1, 3, 4, 6, 8, 9. Digit 2 has experienced precision decrease after training

of digits 6 and 7 since dbull has trouble in differentiating some samples from digits 2 and 7.

Figure 7: Precision and recall for each digit of DBULL evaluated after sequentially training from to across every 500 iterations for each task.

6.2 Batch Setting Clustering Performance Comparison

Experiment Objective. The goal of this experiment is to demonstrate the generality our ll method dbull, which can achieve comparable performance as competing methods in an unsupervised batch setting.

Experiment Setup. To examine our method performance in a batch setting, we test it on more complex datasets including Reuters10k obtained from the original Reuters (lewis2004rcv1, ) and image STL-10 (coates2011analysis, ). We use the same Reuters10k and STL-10 dataset from xie2016unsupervised ; jiang2017variational . The details of Reuters10k and STL-10 are provided in Appendix B. For all datasets, we randomly select 80% of the samples as training and evaluate the performance on the rest 20% of the samples for all methods.

Discussion on Performance. The true number of clusters is provided to competing methods: DEC, VaDE and CURL-F in advance since the total number of clusters is required. DBULL, CURL-D, VAE+DP have less information than DEC, VaDE and CURL-F since they have no knowledge about the true number of clusters. DBULL, CURL-D and VAE+DP all start with one cluster and detect the number of clusters on the fly. Thus, if DBULL can achieve similar performance to DEC, VaDE and CURL-F and outperforms its ll counterpart CURL-D, it demonstrates dbull’s effectiveness. Table 4

shows dbull performs the best in NMI, VM for MNIST and NMI, ARI and VM for STL10 and outperforms CURL-D in MNIST and STL10. Moreover, dbull and DEC are more stable in terms of all evaluation metrics because of smaller standard error than other methods. We also report the number of clusters found by DBULL, CURL-D for MNIST, Reuters10k and STL-10 in Table 

5 out of five replications. Table 5 shows that DBULL handles overclustering issues better in comparison with CURL-D. In summary, Table 4 and 5 demonstrate DBULL’s effectiveness in a batch setting. If we fix the number of clusters as the true one for methods VaDE, DEC and CURL-F while training, the clustering accuracy can be considered as the classification accuracy. However, we are not able to set the number of clusters as the true one in DBULL. As we have seen in Table 5, out of five replications, the number of clusters found by DBULL is from 11 to 15. To compute a clustering accuracy for DBULL, taking MNIST as an example, we count the number of correctly clustered samples from the biggest 10 clusters and divide it by the total number of samples in the testing phase. We report the accuracy comparison in Table 6.

Dataset Method NMI ARI
MNIST DEC 84.67 (2.25) 83.67 (4.53)
VaDE 80.35 (4.68) 74.06 (9.11)
VAE+DP 81.70 (0.825) 70.49 (1.654)

CURL-F 69.76 (2.51) 56.47 (4.11)
CURL-D 63.51 (1.32) 36.84 (1.98)
DBULL 85.72 (1.02) 83.53 (2.35)

DEC 46.56 (5.36) 46.86 (7.98)
VaDE 41.64 (4.73) 38.49 (5.44)
VAE + DP 41.62 (2.99) 37.93 (4.57)
CURL-F 51.92 (3.22) 47.72 (4.00)
CURL-D 46.31 (1.83) 22.00 (3.60)
DBULL 45.32 (1.79) 42.66 (5.73)
STL10 DEC 71.92 (2.66) 58.73 (5.09)
VaDE 68.35 (3.85) 59.42 (6.84)
VAE+DP 43.18 (1.41) 26.58 (1.32)
CURL-F 66.98 (3.38) 51.24 (4.06)
CURL-D 65.71 (1.33) 37.96 (4.69)
DBULL 75.26 (0.53) 70.72 (0.81)

Method HS VM
MNIST DEC 84.67 (2.25) 84.67 (2.25)
VaDE 79.86 (4.93) 80.36 (4.69)
VAE+DP 91.27 (0.215) 81.19 (0.904)
CURL-F 68.60 (2.56) 69.75 (2.51)
CURL-D 76.35 (1.53) 62.45 (1.32)
DBULL 89.34 (0.25) 85.65 (0.51)
Reuters10k DEC 48.44 (5.44) 46.52(5.36)
VaDE 43.64 (4.88) 41.60 (4.73)
VAE + DP 46.64 (3.85) 41.34 (2.94)
CURL-D 66.90 (2.09) 43.34 (2.00)
CURL-F 54.38 (3.49) 51.86 (3.21)
DBULL 48.88 (1.86) 45.40 (2.04)
STL10 DEC 68.47 (3.48) 71.83 (2.72)
VaDE 67.24 (4.23) 68.37 (3.92)
VAE+DP 42.28 (1.03) 43.16 (1.39)
CURL-F 65.46 (3.27) 66.96 (3.37)
CURL-D 80.86 (2.94) 64.31 (1.24)
DBULL 77.61 (1.29) 75.22 (0.52)
Table 4: Clustering quality (%) comparison averaged over five replications with both the average value and the standard error (in the parenthesis) provided.
Datasets True # of Clusters DBULL CURL-D
MNIST 10 11-15 34
Reuters10k 4 5-10 40
STL-10 10 12-15 50
Table 5: Number of clusters found by DBULL and CURL-D out of five replications, where the upperbounds for the number of clusters for Reuters10k and STL-10 are set at 40 and 50.
Datasets DEC VaDE CURL-F (best) CURL-F (average) DBULL
MNIST 84.30% 94.46% 84% 79.38% (4.26%) 92.27%
Table 6: Clustering accuracy for VaDE, DEC, CURL-F and DBULL. We report the best accuracy for DBULL since DEC (xie2016unsupervised, ) and VaDE (jiang2017variational, ) only report their best accuracy. CURL-F (rao2019continual, ) reports both the best accuracy and the average accuracy with the standard error.

7 Conclusion

In this work, we introduce our approach DBULL for unsupervised LL problems. DBULL is a novel end-to-end approximate Bayesian inference algorithm, which is able to perform automatic new task discovery via our proposed dynamic model expansion strategy, adapt to changes in the evolving data distributions, and overcome forgetting using our proposed information extraction mechanism via summary sufficient statistics while learning the underlying representation simultaneously. Experiments on MNIST, Reuters10k and STL-10 demonstrate that DBULL has competitive performance compared with state-of-the-art methods in both a batch setting and an unsupervised LL setting.


The work described was supported in part by Award Numbers U01 HL089856 from the National Heart, Lung, and Blood Institute and NIH/NCI R01 CA199673.


  • (1) M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, Vol. 24, Elsevier, 1989, pp. 109–165.
  • (2) S. Thrun, T. M. Mitchell, Lifelong robot learning, Robotics and autonomous systems 15 (1-2) (1995) 25–46.
  • (3) S. Thrun, Is learning the n-th thing any easier than learning the first?, in: Advances in neural information processing systems, 1996, pp. 640–646.
  • (4) P. Ruvolo, E. Eaton, Ella: An efficient lifelong learning algorithm, in: International Conference on Machine Learning, 2013, pp. 507–515.
  • (5)

    Z. Chen, N. Ma, B. Liu, Lifelong learning for sentiment classification, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 750–756.

  • (6)

    S. S. Sarwar, A. Ankit, K. Roy, Incremental learning in deep convolutional neural networks using partial network sharing, IEEE Access.

  • (7)

    S. Hou, X. Pan, C. Change Loy, Z. Wang, D. Lin, Lifelong learning via progressive distillation and retrospection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 437–452.

  • (8)

    P. Ruvolo, E. Eaton, Active task selection for lifelong machine learning, in: Twenty-seventh AAAI conference on artificial intelligence, 2013.

  • (9) D. Isele, M. Rostami, E. Eaton, Using task features for zero-shot knowledge transfer in lifelong learning., in: IJCAI, 2016, pp. 1620–1626.
  • (10) D. M. Blei, M. I. Jordan, et al., Variational inference for dirichlet process mixtures, Bayesian analysis 1 (1) (2006) 121–143.
  • (11) G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: A review, Neural Networks.
  • (12) J. L. McClelland, B. L. McNaughton, R. C. O’Reilly, Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory., Psychological review 102 (3) (1995) 419.
  • (13) Z. Chen, B. Liu, Lifelong Machine Learning, Morgan & Claypool Publishers, 2018.
  • (14) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (13) (2017) 3521–3526.
  • (15) F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 3987–3995.
  • (16) A. Robins, Catastrophic forgetting, rehearsal and pseudorehearsal, Connection Science 7 (2) (1995) 123–146.
  • (17) H. Xu, B. Liu, L. Shu, P. S. Yu, Lifelong domain word embedding via meta-learning, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4510–4516.
  • (18) C. V. Nguyen, Y. Li, T. D. Bui, R. E. Turner, Variational continual learning, in: International Conference on Learning Representations (ICLR), 2018.
  • (19) H. Shin, J. K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in: Advances in Neural Information Processing Systems, 2017, pp. 2990–2999.
  • (20) L. Shu, B. Liu, H. Xu, A. Kim, Lifelong-rl: Lifelong relaxation labeling for separating entities and aspects in opinion targets, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2016, NIH Public Access, 2016, p. 225.
  • (21) L. Shu, H. Xu, B. Liu, Lifelong learning crf for supervised aspect extraction, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017.
  • (22) D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, R. Hadsell, Continual unsupervised representation learning, in: Advances in Neural Information Processing Systems, 2019, pp. 7645–7655.
  • (23) D. P. Kingma, M. Welling, Auto-encoding variational bayes, in: International Conference on Learning Representations (ICLR), 2014.
  • (24) M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, S. R. Datta, Composing graphical models with neural networks for structured representations and fast inference, in: Advances in neural information processing systems, 2016, pp. 2946–2954.
  • (25) J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: International conference on machine learning, 2016, pp. 478–487.
  • (26) Z. Jiang, Y. Zheng, H. Tan, B. Tang, H. Zhou, Variational deep embedding: an unsupervised and generative approach to clustering, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, AAAI Press, 2017, pp. 1965–1972.
  • (27) P. Goyal, Z. Hu, X. Liang, C. Wang, E. P. Xing, Nonparametric variational auto-encoders for hierarchical representation learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5094–5102.
  • (28) J. Sethuraman, R. C. Tiwari, Convergence of dirichlet measures and the interpretation of their parameter, in: Statistical decision theory and related topics III, Elsevier, 1982, pp. 305–315.
  • (29) K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks 4 (2) (1991) 251–257.
  • (30)

    D. P. Kingma, S. Mohamed, D. J. Rezende, M. Welling, Semi-supervised learning with deep generative models, in: Advances in neural information processing systems, 2014, pp. 3581–3589.

  • (31)

    E. Nalisnick, L. Hertel, P. Smyth, Approximate inference for deep latent gaussian mixtures, in: NIPS Workshop on Bayesian Deep Learning, Vol. 2, 2016.

  • (32) H. Ishwaran, L. F. James, Gibbs sampling methods for stick-breaking priors, Journal of the American Statistical Association 96 (453) (2001) 161–173.
  • (33) S. Lee, J. Stokes, E. Eatonr, Learning shared knowledge for deep lifelong learning using deconvolutional networks, in: 6th Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 2837–2844.
  • (34) M. C. Hughes, E. Sudderth, Memoized online variational inference for dirichlet process mixture models, in: Advances in Neural Information Processing Systems, 2013, pp. 1133–1141.
  • (35) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
  • (36) D. D. Lewis, Y. Yang, T. G. Rose, F. Li, Rcv1: A new benchmark collection for text categorization research, Journal of machine learning research 5 (Apr) (2004) 361–397.
  • (37) A. Coates, A. Ng, H. Lee, An analysis of single-layer networks in unsupervised feature learning, in: Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 215–223.
  • (38) A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images.
  • (39) A. Rosenberg, J. Hirschberg, V-measure: A conditional entropy-based external cluster evaluation measure, in: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), 2007.
  • (40)

    K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

  • (41) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.

Appendix A Derivation of

Here we provide the derivation of described in Section 5.1. The notations are defined in Table 7.

Recall that in Section 5.1, the ELBO is


only involves the first three terms of equation (11) that make a contribution to optimize , and since we adopt the alternating optimization technique.

We assume our variational distribution takes the form of


where we denote , which is a neural network, is the number of mixture components in the DP of the variational distribution, , , and is a Multinomial distribution. Under the assumptions in (12), we derive each of the first three terms in Equation 11 to obtain .

Notations in the ELBO
: the total number of observations.
: the number of Monte Carlo samples in Stochastic
Gradient Variational Bayes (SGVB).
: the th observation.
: cluster membership for the th observation.
: the scalar precision in NW distribution.
: the posterior mean of cluster .
: the th posterior degrees of freedom of NW.
: variational parameters of the th NW components.

: variational parameters of a Beta distribution for the

th component in Equation 12.
: variational parameters of the NW distribution for .
: the variational parameters of a categorical distribution
for the cluster membership for each observation.
Table 7: Notations in the ELBO.

(1) :
We assume in the generative model that and is parameterized by a neural network and . Using the reparameterization trick (kingma2014auto, ) and the Monte Carlo estimate of expectations, we have



Recall that , where and where is a neural network. Following kingma2014auto

, we use the reparameterization and sampling trick to allow backpropagation, for

where is the number of Monte Carlo samples, we have

Define We have