1 Introduction
With exposure to a continuous stream of information, human beings are able to learn and discover novel clusters continually by incorporating past knowledge; however, traditional machine learning algorithms mainly focus on static data distributions. Training a model with new information often interferes with previously learned knowledge, which typically compromises performance on previous datasets
(mccloskey1989catastrophic, ). In order to empower algorithms with the ability of adapting to emerging data while preserving the performance on seen data, a new machine learning paradigm called ll has recently gained some attention.ll, also known as continual learning, was first proposed in thrun1995lifelong . It provides a paradigm to exploit past knowledge and learn continually by transferring previously learned knowledge to solve similar but new problems in a dynamic environment, without performance degradation on old tasks. ll is still an emerging field and most existing research work (thrun1996learning, ; ruvolo2013ella, ; chen2015lifelong, )
has focused on sll, where the boundaries between different tasks are known and each task refers to a supervised learning problem with output labels provided. The term
task can have different meanings under various contexts. For example, different tasks can represent different subsets of classes or labels in the same supervised learning problem (sarwar2019incremental, ) and can also represent supervised learning problems in different fields, where researchers target to perform continual learning across different domains (hou2018lifelong, )or lifelong transfer learning
(ruvolo2013active, ; isele2016using, ).While most research on ll has focused on resolving challenges in sll problems with class labels provided, we instead consider ull problems, where the learning system is interacting with a nonstationary stream of unlabelled data and the cluster labels are unknown. One objective of ull is to discover new clusters by interacting with the environment dynamically and adapting to the changes in the unlabeled data without external supervision or knowledge. To avoid confusion, it is worth pointing out that our work assumes that the nonstationary streaming unlabelled data come from a single domain; and we target to develop a single dynamic model that can perform well on all sequential data at the end of each training stage without forgetting previous knowledge. This setting can serve as a reasonable starting point for ull. We leave the challenges in ull across different problem domains for future work.
To retain knowledge from past data and learn new clusters from new data continually, good representations of the raw data make it easier to extract information and, in turn, support effective learning. Thus, it is more computationally appealing if we can discover new clusters in a lowdimensional latent space instead of the complex original data space while performing representation learning. To achieve this, we propose dbull, which is a flexible probabilistic generative model that can adapt to new data and expand with new clusters while seamlessly learning deep representations in a Bayesian framework.
A critical objective of ll is to achieve consistently good performance incrementally as new data arrive in a streaming fashion, without performance decrease on previous data, even if the data may have been completely overwritten. Compared with a traditional batch learning setting, there are additional challenges to resolve in a ll setting. One important question is how to design a knowledge preservation scheme to efficiently maintain previously learned information. To discover the new clusters automatically with streaming, unlabelled data, another challenge is how to design a dynamic model that can expand with incoming data to perform unsupervised learning. The last challenge is how to design an endtoend inference algorithm to obtain good performance in an incremental learning way. To answer these questions, we make the following contributions:

To solve the challenges in ull, we provide a fully Bayesian formulation that performs representation learning, clustering and automatic new cluster discovery simultaneously via our endtoend novel variational inference strategy dbull.

To efficiently extract and maintain knowledge seen in earlier data, we provide innovation in our incremental inference strategy by first using sufficient statistics in the latent space in an ll context.

To discover new clusters in the emerging data, we choose a nonparametric Bayesian prior to allow the model to grow dynamically. We develop a sequential Bayesian inference strategy to perform representation learning simultaneously with our proposed cerr trick to discover new clusters on the fly without imposing bounds on the number of clusters, unlike most existing algorithms, using a truncated dp (blei2006variational, ).

To show the effectiveness of dbull, we conduct experiments on image and text benchmarks. dbull can achieve superior performance compared with stateoftheart methods in both ull and classical batch settings.
2 Related Work
2.1 Alleviating Catastrophic Forgetting in Lifelong Learning
Research on ll aims to learn knowledge in a continual fashion without performance degradation on previous tasks when trained for new data. Reference parisi2019continual have provided a comprehensive review on ll with neural networks. The main challenge of ll using dnn is that they often suffer from a phenomenon called catastrophic forgetting or catastrophic interference, where a model’s performance on previous tasks may decrease abruptly due to the interference of training with new information (mccloskey1989catastrophic, ; mcclelland1995there, ). Recent work aims at adapting a learned model to new information while ensuring the performance on previous data does not decrease. Currently, there are no universal past knowledge preservation schemes for different algorithms and settings (chen2018lifelong, ) in ll. Regularization methods target to reduce interference of new learning by minimizing changes to certain parameters that are important to previous learning tasks (kirkpatrick2017overcoming, ; zenke2017continual, ). Alternative approaches based on rehearsal have also been proposed to alleviate catastrophic forgetting while training dnn sequentially. Rehearsal methods use either past data (robins1995catastrophic, ; xu2018lifelong, ), coreset data summarization (nguyen2017variational, ) or a generative model (shin2017continual, ) to capture the data distribution of previously seen data.
However, most of the existing ll methods focus on supervised learning tasks (kirkpatrick2017overcoming, ; nguyen2017variational, ; hou2018lifelong, ; shin2017continual, ), where each learning task performs supervised learning with output labels provided. In comparison, we propose a novel alternative knowledge preservation scheme in an unsupervised learning context via sufficient statistics. This is in contrast to existing work which uses previous model parameters (chen2015lifelong, ; shu2016lifelong, ), representative items exacted from previous models (shu2017lifelong, ) or past raw data, or coreset data summarization (robins1995catastrophic, ; xu2018lifelong, ; nguyen2017variational, ) as previous knowledge for new tasks. Our proposal to use sufficient statistics is novel and has the advantage of preserving past knowledge without the need of storing previous data while allowing incremental updates as new data arrive by taking advantage of the additive property of sufficient statistics.
2.2 Comparable Methods in Unsupervised Lifelong Learning
Recently, rao2019continual proposed curl to deal with a fully ull setting with unknown cluster labels. We have developed our idea independently in parallel with curl but in a fully Bayesian framework. curl is the most related and comparable method to ours in the literature. curl focuses on learning representations and discovering new clusters using a threshold method. One major drawback of curl is that it has overclustering issues as shown in their real data experiment. We also show this empirically and demonstrate the improvement of our method over curl in our experiment section. In contrast to curl, we provide a novel probabilistic framework with a nonparametric Bayesian prior to allow the model to expand without bound automatically instead of using an ad hoc threshold method as in curl. We develop a novel endtoend variational inference strategy for learning deep representations and detecting novel clusters in ull simultaneously.
2.3 Bayesian Lifelong Learning
A Bayesian formulation is a natural choice for ll since it provides a systematic way to incorporate previously learnt information in the prior distribution and obtain a posterior distribution that combines both prior belief and new information. The sequential nature of Bayes theorem also paves the way to recursively update an approximation of the posterior distribution and then use it as a new prior to guide the learning for new data in ll.
In nguyen2017variational , the authors propose a Bayesian formulation in ll. Although both under a Bayesian framework, our work is different from nguyen2017variational due to different objectives, inference strategies and knowledge preservation techniques. In nguyen2017variational , the authors provide a variational online inference framework for deep discriminative models and deep generative models, where they studied the approximate posterior distribution of the parameters in dnn in a continual fashion. However, their method does not have the capacity to find out the latent clustering structure of the data or detect new clusters for emerging data. In contrast, we develop a novel Bayesian framework for representation learning and discovering latent clustering structure and new clusters on the fly together with a novel endtoend variational inference strategy in an ull context.
2.4 Deep Generative Unsupervised Learning Methods in a Batch Setting
Recent research has focused on combining deep generative models to learn good representations of the original data and conduct clustering analysis in an unsupervised learning context
(kingma2014auto, ; johnson2016composing, ; xie2016unsupervised, ; jiang2017variational, ; goyal2017nonparametric, ). However, the latest existing methods are designed for an independent and identically distributed (i.i.d.) batch training mode instead of a ll context. The majority of these methods are in a static unsupervised learning setting, where the number of clusters are fixed in advance. Thus, these methods cannot detect potential new clusters when new data arrive or the data distribution changes. These methods cannot adapt to a ll setting.To summarize, our work fills the gap by providing a fully Bayesian framework for ull, which has the unique capacity to use a deep generative model for representation learning while performing new cluster discovery on the fly with a nonparametric Bayesian prior and our proposed cerr technique. To alleviate catastrophic forgetting challenge in ll, we propose to use sufficient statistics to maintain knowledge as a novel alternative to existing methods. We further develop an endtoend Bayesian inference strategy dbull to achieve our goal.
3 Model
3.1 Problem Formulation
In our ull setting, a sequence of datasets , arrive in a streaming order. When a new dataset arrives in memory, the previous dataset is no longer available. Our goal is to automatically learn the clusters (unlabeled classes) in each dataset.
Let represent the unlabeled observation of the current dataset in memory, where
can be a highdimensional data space. We assume that a lowdimensional latent representation
can be learned from and in turn can be used to reconstruct . We assume that the variation among observations can be captured by its latent representation . Thus, we let represent the unknown cluster membership of for observation .We target to find: (1) a good lowdimensional latent representation from to efficiently extract knowledge from the original data; (2) the clustering structure within the new dataset with the capacity to discover potentially novel clusters without forgetting the previously learned clusters of the seen datasets; and, (3) an incremental learning strategy to optimize the cluster learning performance for a new dataset without dramatically degrading the clustering performance in seen datasets.
3.2 Generative Process of DBULL
The generative process for dbull is as follows.

Draw a latent cluster membership
, where the vector
comes from the stickbreaking construction of a Dirichlet Process (DP). 
Draw a latent representation vector , where is the cluster membership sampled from (a).

Generate data from in the original data space.
In (a), is the categorical distribution parameterized by , where we denote the th element of as
, which is the probability for cluster
. The value of depends on a vector of scalars coming from the stickbreaking construction of a dp (sethuraman1982convergence, ) and we describe an iterative process to draw in Section 4.2. curl uses a latent mixture of Gaussian components to capture the clustering structure in an unsupervised learning context. In comparison, we adopt the DP mixture model in the latent space with the advantages that the number of mixture components can be random and grow without bound as new data arrive, which is an appealing property desired by ll. We further explain in Section 4.2 why DP is an appropriate prior for our problem in details. In (b), is considered a lowdimensional latent representation of the original data . We describe in Section 4.2 that a DP Gaussian mixture is used for modelling since it is often assumed that the variation in is able to reflect the variation within . The current representation in (b) is for easy understanding. In (c), we assume that the generative model is parameterized by and , where is chosen as dnn due to its powerful function approximation and good feature learning capabilities (hornik1991approximation, ; kingma2014semi, ; nalisnick2016approximate, ).Under this generative process, the joint probability density function can be factorized as
(1) 
where and represents the parameters of the th mixture component (or cluster), and represent the prior distribution for and .
Next, we discuss how to choose appropriate and to endow our model with the flexibility to grow the number of mixture components without bound with new data in a ll setting.
4 Why Bayesian for DBULL
In this section, we illustrate why Bayesian framework is a natural choice for our ull setting. Recall that we have a sequence of datasets from a single domain arriving in a streaming order. To mimic a ll setting, we assume each time only one dataset can fit in memory. One key question in ll is how to efficiently maintain past knowledge to guide future learning.
4.1 Bayesian Reasoning for Lifelong Learning
Bayesian framework is a suitable solution to this type of learning since it learns a posterior distribution, or an approximation of a posterior distribution, that takes advantage of both the prior belief and the additional information in the new dataset. The sequential nature of Bayes theorem ensures valid recursive updates on an approximation of the posterior distribution given the observations. Later, the approximation of the posterior distribution serves as the new prior to guide future learning for new data in ll. Before describing our inference strategy, we first explain why utilizing the Bayesian updating rule is valid for our problem.
Given datasets , where , the posterior after considering the th dataset is
(2) 
which reflects that the posterior of tasks and datasets can be considered as the prior for the next task and dataset. If we know exactly the normalizing constant for and , repeatedly updating (2) is streaming without the need of reusing past data. However, it is often intractable to compute the normalizing constant exactly. Thus, an approximation of the posterior distribution is necessary to update (2) since the exact posterior is infeasible to obtain.
4.2 Dirichlet Process Prior
The dp is often used as a nonparametric prior for partitioning exchangeable observations into discrete clusters. The dp mixture is a flexible mixture model where the number of mixture components can be random and grow without bound as more data arrive. These properties make it a natural choice for our ll setting. In practice, we show in our inference how we expand and merge the number of mixture components as new data arrive by starting from only one cluster in Section 5.6. Next, we briefly review dp and introduce our dp Gaussian mixture model to derive the joint probability density defined in (1).
A dp is characterized by a base distribution and a parameter denoted as . A constructive definition of dp via a stickbreaking process is of the form , where is a discrete measure concentrated at , which is a random sample from the base distribution with mixing proportion (ishwaran2001gibbs, ). In dp, the s are random weights independent of but satisfy and . The weights can be drawn through an iterative process:
where .
Under the generative process of dbull in Section 3.2, these s represent the probabilities for each cluster (mixture component) used in step (a) and can be seen as the parameters of the Gaussian mixture for in step (b). Thus, given our generative process, the corresponding joint probability density for our model is
(3) 
For a Gaussian mixture model, the base distribution is often chosen as the NormalWishart (NW) denoted as to generate the mixture parameters , where , and is the dimension of the latent vector . The values of the hyperparameter are conventional choices in the Bayesian nonparameteric literature for Gaussian mixture. Moreover, the performance of our method is robust to the hyperparameter values.
5 Inference for DBULL
There are several new challenges to develop an endtoend inference algorithm for our problem under the ull setting compared with the batch setting: one has to deal with catastrophic forgetting, mechanisms for past knowledge preservation, and dynamic model expansion capacity for novel cluster discovery. For pedagogical reasons, we first describe our general parameter learning strategy via variational inference for DBULL in a standard batch setting. We then describe how we resolve the additional challenges in the lifelong (streaming) learning setting. We describe our novel components in the inference algorithm in terms of a new knowledge preservation scheme via sufficient statistics in Sections 5.4 and an automatic cerr strategy in Section 5.6. A summary of our algorithm in the ll setting is provided in Algorithm 1. Our implementation is available at https://github.com/KingSpencer/DBULL. The implementation details are provided in Appendix C. We explain the contribution of the sufficient statistics to the probabilistic density function of our problem and knowledge preservation in Section 5.5.
5.1 Variational Inference and ELBO Derivation
In practice, it is often infeasible to obtain the exact posterior distribution since the normalizing constant in the posterior distribution is intractable. mcmc methods are a family of algorithms that provide a systematic way to sample from the posterior distribution but is often slow in a highdimensional parameter space. Thus, effective alternative methods are needed. Variational inference is a promising alternative, which approximates the posterior distribution by casting inference as an optimization problem. It aims to find a surrogate distribution that is the most similar to the distribution of interest over a class of tractable distributions that can minimize the KullbackLeibler (KL) divergence to the exact posterior distribution. Minimizing the KL divergence between and in our setting is equivalent to maximizing the Evidence Lower Bound (ELBO), where is the variational posterior distribution used to approximate the true posterior distribution. To make it easier for the readers to understand the core idea, we provide a highlevel explanation of variational inference and mathematical details can be found in Appendix A.
Given the generative process in Section 3.2 and using Jensen’s inequality,
(4)  
For simplicity, we assume that . Thus, the ELBO is
(5) 
We assume our variational distribution takes the form of
(6) 
where we denote , which is a neural network parameterized by , is the number of mixture components in the DP of the variational distribution, , , and is a Multinomial distribution. The notation definitions in equation (5), (6) and (7) are provided in Table 1. Our inference strategy starts with only one mixture component and uses cerr technique described in Section 5.6 to either increase or merge the number of clusters.
5.2 General Parameter Learning Strategy
In equation (5), there are mainly two types of parameters which we need to optimize. The first type includes parameters and in the neural network. The other type involves the latent cluster membership and the parameters for the dp Gaussian mixture model.
In order to perform joint inference for both types of parameters, we adopt the alternating optimization strategy. First we update the neural network parameters ( and ) to learn the latent representation given the DP Gaussian mixture parameters. This is achieved by optimizing , which only involves the first three terms of equation (5) that make a contribution to optimize , and . Under our variational distribution assumptions in (6), by taking advantage of the reparameterization trick (kingma2014auto, )
and the Monte Carlo estimate of expectations, we obtain
(7) 
Notations in the ELBO 
: parameters in the decoder. 
: parameters in the encoder. 
: the total number of observations. 
: the total number of clusters. 
: the dimension of the latent representation . 
. 
: the th dimension of the th observation. 
: cluster membership for the th observation. 
. 
: the number of Monte Carlo samples in Stochastic 
Gradient Variational Bayes (SGVB). 
. 
: the posterior scalar precision in NW distribution. 
: the posterior mean of cluster . 
. 
: the th posterior degrees of freedom of NW. 
We provide the notations in Table 1. The derivation details are provided in Appendix A. Then, we update the dp Gaussian mixture parameters and the cluster membership given the current neural network parameters , and the latent representation . This allows us to use improved latent representation to infer latent cluster memberships and the updated clustering will in turn facilitate learning latent knowledge representation. The update equations for dp mixture model parameters can be found in blei2006variational . We describe the core idea of automatic cerr in our inference in Section 5.6 to explain how we start with only one cluster and achieve dynamic model expansion by creating new mixture components (clusters) given new data in ll.
Our general parameter learning strategy via variational inference may seem straightforward for a batch setting at first glance. However, both the derivation and the implementation is nontrivial especially when incorporating our new components in the endtoend inference procedure to address the additional challenges in a ll setting. For illustration purposes, we choose to describe the high level core idea of our inference procedure. The main difficulty lies in how to adapt our inference algorithm from a batch setting to a ll setting, which requires us to overcome catastrophic forgetting, maintain past knowledge and develop a dynamic model that can expand with automatic cluster discovery and redundancy removal capacity. Next, we describe our novel solutions.
5.3 Our Ingredients for Alleviating Catastrophic Forgetting
Catastrophic forgetting or catastrophic interference is a dramatic issue for dnn as witnessed in sll (kirkpatrick2017overcoming, ; shin2017continual, ). In our ull setting, the issue is even more challenging since we have more sources than sll that may lead to abrupt model performance decrease due to the interference of training with new data. The first source is the same as in sll when the dnn forget previously learned information upon learning new information. Additionally, in an unsupervised setting, the model is not able to recover the learned cluster membership and clustering related parameters in the dp mixture model when the previous data is no longer available, or when the previous learned information of dnn has been wiped out upon learning new information, since the clustering structure learned depends on the latent representation of the raw data, which is determined by the dnn’ parameters and the data distributions.
To resolve these issues, we develop our own novel solution via a combination of two ingredients: (1) generating and replaying a fixed small number of samples based on our generative process in Section 3.2 given the current dnn and dp Gaussian mixture parameter estimates, which is a computationally effective byproduct of our algorithm; and, (2) developing a novel hierarchical sufficient statistics knowledge preservation strategy to remember the clustering information in an unsupervised setting.
We choose to replay a number of generative samples to preserve the previous data distribution instead of using a subset of past real data, since storing past data may require large memory and such data storage and replay may not be feasible in real big data applications. More details of replaying deep generative samples over real data in ll have been discussed in shin2017continual . Moreover, our proposal to use sufficient statistics is novel and has the advantage of allowing incremental updates of the clustering information as new data arrive without the need of access to previous data because of the additive property of sufficient statistics. We introduce this novel strategy in the next section.
5.4 Sufficient Statistics for Knowledge Preservation
As ll is an emerging field, and there is no wellaccepted knowledge definition or an appropriate representation scheme to efficiently maintain past knowledge from seen data. Researchers have adopted prior distributions (nguyen2017variational, ) or model parameters (lee2019learning, ) to represent past knowledge in most sll problems, where achieving high prediction accuracy incrementally is the main objective. However, there is no guidance on preserving past knowledge in an unsupervised learning setup.
We propose a novel knowledge preservation strategy in dbull. In our problem, there are two types of knowledge to maintain. The first contains previously learned dnn’ parameters needed to encode the latent knowledge representation of the raw data and the reconstruction of the real data from . The other involves the dp Gaussian mixture parameters to represent different cluster characteristics and different cluster mixing proportions. Our novel knowledge representation scheme uses hierarchical sufficient statistics to preserve the information related to the dp Gaussian mixture. We develop a sequential updating rule to update our knowledge.
Assume that we have encountered datasets and each time only one dataset can be in memory. While in memory, each dataset can be divided into minibatches . To define the sufficient statistics, we first define the global parameters of the dp Gaussian mixture as probabilities of each mixture component (cluster) and the mixture parameters for each cluster . We define the local parameters as the cluster membership for each observation in memory. To remember the characteristics of all encountered data and the local information of the current dataset, we memorize three levels of sufficient statistics. The th minibatch sufficient statistics of the current dataset , where and
is the sufficient statistics to represent a distribution within the exponential family (Gaussian distribution is within the exponential family and
in our case) and represents the estimated probability of the th observations in minibatch belonging to cluster . We also define the stream sufficient statistics of dataset and the overall sufficient statistics of all encountered datasets .To efficiently maintain and update our knowledge, we develop our updating algorithm as: (1) substract the old summary of each minibatch and update the local parameters; (2) compute a new summary for each minibatch; and, (3) update the stream sufficient statistics for each cluster learned in the current dataset.
(8)  
(9)  
(10) 
For the dataset in the learning phase, we repeat the updating process multiple iterations to refine our training while learning the dp Gaussian mixture parameters and the cluster membership. Finally, we update the overall sufficient statistics by . The correctness of the algorithm is guaranteed by the additive property of the sufficient statistics.
5.5 Contribution of Sufficient Statistics to Alleviate Forgetting
The sufficient statistics alleviate forgetting by preserving data characteristics and allow sequential updates in ll without the need of saving real data. To be precise, the sufficient statistics allow us to update the loglikelihood and the ELBO sequentially since both terms are linear functions of the expectation of the sufficient statistics. Given the expected sufficient statistics, we are able to evaluate the first two terms of in equation (7) and in the joint probability density function of our model in equation (3). Next, we provide mathematical derivations to illustrate this.
Define sufficient statistics of all data , where is the number of clusters. Define , where and denotes the probability of the th observation belonging to cluster , , and is the total number of observations. Given the sufficient statistics and current mixture parameters and , we can evaluate in the joint probability density function in equation (3) without storing each latent representation for all data. Similarly, we can also evaluate the first two terms of in equation (7) with notations defined in Table 1.
5.6 Cluster Expansion and Redundancy Removal Strategy
Our model starts with one cluster and, as we have new data with different characteristics, we expect the model to either dynamically grow the number of clusters or merge clusters if they have similar characteristics. To achieve this, we perform birth and merge moves in a similar fashion as the Nonparametric Bayesian literature (hughes2013memoized, ) to allow automatic cerr. However, we would like to emphasize that our work is different from hughes2013memoized since our merge moves have extra constraints. To avoid losing information about clusters learned earlier, we only allow merge moves between two novel clusters from the birth move or one existing cluster with a newborn cluster. Two previously existing clusters cannot be merged. Reference hughes2013memoized is designed for a batch learning, thus, it does not require this constraint (but in ll this constraint is important for avoiding information loss).
It is challenging to give birth to new clusters with streaming data since the number of observations may not be sufficient to inform good proposals. To resolve this issue, we follow hughes2013memoized by collecting a subsample of data for each learned cluster . Then, we cache the samples in the subsample if the probability of the th observation to be assigned to cluster is bigger than a threshold of value 0.1. This value has been suggested by hughes2013memoized . In this paper, we try to choose commonly used parameters in the literature and avoid dataset specific tuning as much as possible. We fit the dp Gaussian mixture to the cached samples with one cluster and expand the model with 10 novel clusters. However, only adopting the birth moves may overcluster the observations into different clusters. After the birth move, we merge the clusters by (1) selecting candidate clusters to merge and by (2) merging two selected clusters if ELBO improves. The candidate clusters are selected if the marginal likelihood of two merged clusters is bigger than the marginal likelihood when keeping the two clusters separate.
6 Experiments
Datasets. We adopt the most common text and image benchmark datasets in ll to evaluate the performance of our method. The MNIST database of 70,000 handwritten digit images (lecun1998gradient, ) is widely used to evaluate deep generative models (kingma2014auto, ; xie2016unsupervised, ; johnson2016composing, ; goyal2017nonparametric, ; jiang2017variational, ) for representation learning and ll models in both supervised (kirkpatrick2017overcoming, ; nguyen2017variational, ; shin2017continual, ) and unsupervised learning contexts (rao2019continual, ). To provide a fair comparison with stateoftheart competing methods and easy interpretation, we mainly use MNIST to evaluate the performance of our method and interpret our results with intuitive visualization patterns. To examine our method on more complex datasets, we use text Reuters10k (lewis2004rcv1, ) and image STL10 (coates2011analysis, )
databases. STL10 is at least as hard as a wellknown image database CIFAR10
(krizhevsky2009learning, ) since STL10 has fewer labeled training examples within each class. The summary statistics for the datasets are provided in Table 2.Dataset  # Samples  Dimension  # Classes 

MNIST  70000  784  10 
Reuters10k  10000  2000  4 
STL10  13000  2048  10 
We adopt the same neural network architecture as in jiang2017variational . All values of the tuning parameters and the implementation details in the dnn are provided in C. Our implementation is publicly available at https://github.com/KingSpencer/DBULL.
Competing Methods in Unsupervised Lifelong Learning. curl is the only lifelong unsupervised learning method currently with both representation learning and new cluster discovery capacity, which makes curl the latest, most related, and comparable method to ours. We use CURLD to represent curl without the true number of clusters provided but to detect it Dynamically given unlabelled streaming data.
Competing Methods in a Classic Batch Setting. Although designed for ll, to show the generality of dbull in a batch training mode, we compare it against recent deep (generative) methods with representation learning and clustering capacity designed for batch settings, including DEC (xie2016unsupervised, ), VaDE (jiang2017variational, ), CURLF (rao2019continual, ) and VAE+DP. CURLF represents curl with the true number of clusters provided as a Fixed value. VAE+DP fits a vae kingma2014auto to learn latent representations first and then uses a dp to learn clustering in two separate steps. We list the capabilities of different methods in Table 3.
Lifelong Learning  Batch Setting  

DBULL  CURLD  DEC  VaDE  VAE+DP  CURLF  
Representation Learning  yes  yes  yes  yes  yes  yes 
Learns # of Clusters  yes  yes  no  no  yes  no 
Dynamic Expansion  yes  yes  no  no  yes  no 
Overcome Forgetting  yes  yes  no  no  no  yes 
Evaluation Metrics. One of the main objectives of our method is to perform new cluster discovery with streaming nonstationary data. Thus, it is desired if our method can achieve superior clustering quality in both ull and batch settings. We adopt the clustering quality metrics including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), Homogeneity Score (HS), Completeness Score (CS) and Vmeasure Score (VM). These are all normalized metrics ranging from zero to one, and larger values indicate better clustering quality. NMI, ARI and VM of value one represent perfect clustering as the ground truth. CS is a symmetrical metric to HS. Detailed definitions for these metrics can be found in rosenberg2007v .
6.1 Lifelong Learning Performance Comparison
Experiment Objective. It is desired if ll methods can adapt a learned model to new data while retaining the information learned earlier. The objective of this experiment is to demonstrate dbull has such desired capacity and effectiveness compared with stateoftheart ll methods such that there is no dramatic performance decrease on past data even if the model has been updated with new information.
Experiment Setup. To evaluate the performance of dbull, we adopt the most common experiment setup called Split MNIST in ll, which used images from MNIST (zenke2017continual, ; nguyen2017variational, ). We divide MNIST into 5 disjoint subsets with each subset containing 10,000 random samples of two digit classes in the order of digits 01, 23, 45, 67 and 89, denoted as , , , , and . Each dataset is divided into 20 subsets that arrive in a sequential order to mimic a ll setting. We denote as all data from to , where .
Discussion on Performance. To check if our method has dramatic performance loss due to catastrophic forgetting, we sequentially train our method dbull and its ll competitor CURLD on . We define as training on , where . We measure the performance of after training , , , , with datasets , and , the performance of after training with datasets , etc. We report the ll clustering quality performance for each task after sequential training five tasks in Fig. 3 and Fig. 4.
Fig. 3 and Fig. 4 reflect that dbull has better performance in handling catastrophic forgetting than CURLD since dbull has slightly less performance drop than CURLD for previous tasks in almost all scenarios in terms of nearly all clustering metrics.
Fig. 5 reflects that dbull has advantages over CURLD in handling overclustering issues. Since each task has two digits, the true number of clusters seen after training each task sequentially is 2, 4, 6, 8, 10. The number of clusters automatically detected by dbull after training , is 4, 6, 8, 10, 12. dbull clusters digit 1 into three clusters of different handwritten patterns in . For other digits, dbull discovers each new digit into one exact cluster as the ground truth. In contrast, CURLD clustered digits 01 into 1416 clusters of and obtained 2325 clusters for 10 digits after training five tasks sequentially. We provide visualization of the reconstructed cluster mean from the DP mixture model via our trained decoder of dbull in Fig. 6.
Besides the overall clustering quality reported, we also provide the precision and recall of dbull to view the performance for each digit after sequentially training all tasks. CURLD overclusters the 10 digits into 25 clusters, making it hard to report the precision and recall of each digit. To visualize the results, the three subclusters of digit one by dbull have been merged into one cluster. Overall, there is no significant performance loss of previous tasks after sequentially training multiple tasks for digits 0, 1, 3, 4, 6, 8, 9. Digit 2 has experienced precision decrease after training
of digits 6 and 7 since dbull has trouble in differentiating some samples from digits 2 and 7.6.2 Batch Setting Clustering Performance Comparison
Experiment Objective. The goal of this experiment is to demonstrate the generality our ll method dbull, which can achieve comparable performance as competing methods in an unsupervised batch setting.
Experiment Setup. To examine our method performance in a batch setting, we test it on more complex datasets including Reuters10k obtained from the original Reuters (lewis2004rcv1, ) and image STL10 (coates2011analysis, ). We use the same Reuters10k and STL10 dataset from xie2016unsupervised ; jiang2017variational . The details of Reuters10k and STL10 are provided in Appendix B. For all datasets, we randomly select 80% of the samples as training and evaluate the performance on the rest 20% of the samples for all methods.
Discussion on Performance. The true number of clusters is provided to competing methods: DEC, VaDE and CURLF in advance since the total number of clusters is required. DBULL, CURLD, VAE+DP have less information than DEC, VaDE and CURLF since they have no knowledge about the true number of clusters. DBULL, CURLD and VAE+DP all start with one cluster and detect the number of clusters on the fly. Thus, if DBULL can achieve similar performance to DEC, VaDE and CURLF and outperforms its ll counterpart CURLD, it demonstrates dbull’s effectiveness. Table 4
shows dbull performs the best in NMI, VM for MNIST and NMI, ARI and VM for STL10 and outperforms CURLD in MNIST and STL10. Moreover, dbull and DEC are more stable in terms of all evaluation metrics because of smaller standard error than other methods. We also report the number of clusters found by DBULL, CURLD for MNIST, Reuters10k and STL10 in Table
5 out of five replications. Table 5 shows that DBULL handles overclustering issues better in comparison with CURLD. In summary, Table 4 and 5 demonstrate DBULL’s effectiveness in a batch setting. If we fix the number of clusters as the true one for methods VaDE, DEC and CURLF while training, the clustering accuracy can be considered as the classification accuracy. However, we are not able to set the number of clusters as the true one in DBULL. As we have seen in Table 5, out of five replications, the number of clusters found by DBULL is from 11 to 15. To compute a clustering accuracy for DBULL, taking MNIST as an example, we count the number of correctly clustered samples from the biggest 10 clusters and divide it by the total number of samples in the testing phase. We report the accuracy comparison in Table 6.Dataset  Method  NMI  ARI 

MNIST  DEC  84.67 (2.25)  83.67 (4.53) 
VaDE  80.35 (4.68)  74.06 (9.11)  
VAE+DP  81.70 (0.825)  70.49 (1.654)  

CURLF  69.76 (2.51)  56.47 (4.11) 
CURLD  63.51 (1.32)  36.84 (1.98)  
DBULL  85.72 (1.02)  83.53 (2.35)  
Reuters10k 
DEC  46.56 (5.36)  46.86 (7.98) 
VaDE  41.64 (4.73)  38.49 (5.44)  
VAE + DP  41.62 (2.99)  37.93 (4.57)  
CURLF  51.92 (3.22)  47.72 (4.00)  
CURLD  46.31 (1.83)  22.00 (3.60)  
DBULL  45.32 (1.79)  42.66 (5.73)  
STL10  DEC  71.92 (2.66)  58.73 (5.09) 
VaDE  68.35 (3.85)  59.42 (6.84)  
VAE+DP  43.18 (1.41)  26.58 (1.32)  
CURLF  66.98 (3.38)  51.24 (4.06)  
CURLD  65.71 (1.33)  37.96 (4.69)  
DBULL  75.26 (0.53)  70.72 (0.81)  
Dataset 
Method  HS  VM 
MNIST  DEC  84.67 (2.25)  84.67 (2.25) 
VaDE  79.86 (4.93)  80.36 (4.69)  
VAE+DP  91.27 (0.215)  81.19 (0.904)  
CURLF  68.60 (2.56)  69.75 (2.51)  
CURLD  76.35 (1.53)  62.45 (1.32)  
DBULL  89.34 (0.25)  85.65 (0.51)  
Reuters10k  DEC  48.44 (5.44)  46.52(5.36) 
VaDE  43.64 (4.88)  41.60 (4.73)  
VAE + DP  46.64 (3.85)  41.34 (2.94)  
CURLD  66.90 (2.09)  43.34 (2.00)  
CURLF  54.38 (3.49)  51.86 (3.21)  
DBULL  48.88 (1.86)  45.40 (2.04)  
STL10  DEC  68.47 (3.48)  71.83 (2.72) 
VaDE  67.24 (4.23)  68.37 (3.92)  
VAE+DP  42.28 (1.03)  43.16 (1.39)  
CURLF  65.46 (3.27)  66.96 (3.37)  
CURLD  80.86 (2.94)  64.31 (1.24)  
DBULL  77.61 (1.29)  75.22 (0.52) 
Datasets  True # of Clusters  DBULL  CURLD 
MNIST  10  1115  34 
Reuters10k  4  510  40 
STL10  10  1215  50 
Datasets  DEC  VaDE  CURLF (best)  CURLF (average)  DBULL 
MNIST  84.30%  94.46%  84%  79.38% (4.26%)  92.27% 
7 Conclusion
In this work, we introduce our approach DBULL for unsupervised LL problems. DBULL is a novel endtoend approximate Bayesian inference algorithm, which is able to perform automatic new task discovery via our proposed dynamic model expansion strategy, adapt to changes in the evolving data distributions, and overcome forgetting using our proposed information extraction mechanism via summary sufficient statistics while learning the underlying representation simultaneously. Experiments on MNIST, Reuters10k and STL10 demonstrate that DBULL has competitive performance compared with stateoftheart methods in both a batch setting and an unsupervised LL setting.
Acknowledgment
The work described was supported in part by Award Numbers U01 HL089856 from the National Heart, Lung, and Blood Institute and NIH/NCI R01 CA199673.
References
 (1) M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, Vol. 24, Elsevier, 1989, pp. 109–165.
 (2) S. Thrun, T. M. Mitchell, Lifelong robot learning, Robotics and autonomous systems 15 (12) (1995) 25–46.
 (3) S. Thrun, Is learning the nth thing any easier than learning the first?, in: Advances in neural information processing systems, 1996, pp. 640–646.
 (4) P. Ruvolo, E. Eaton, Ella: An efficient lifelong learning algorithm, in: International Conference on Machine Learning, 2013, pp. 507–515.

(5)
Z. Chen, N. Ma, B. Liu, Lifelong learning for sentiment classification, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2015, pp. 750–756.

(6)
S. S. Sarwar, A. Ankit, K. Roy, Incremental learning in deep convolutional neural networks using partial network sharing, IEEE Access.

(7)
S. Hou, X. Pan, C. Change Loy, Z. Wang, D. Lin, Lifelong learning via progressive distillation and retrospection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 437–452.

(8)
P. Ruvolo, E. Eaton, Active task selection for lifelong machine learning, in: Twentyseventh AAAI conference on artificial intelligence, 2013.
 (9) D. Isele, M. Rostami, E. Eaton, Using task features for zeroshot knowledge transfer in lifelong learning., in: IJCAI, 2016, pp. 1620–1626.
 (10) D. M. Blei, M. I. Jordan, et al., Variational inference for dirichlet process mixtures, Bayesian analysis 1 (1) (2006) 121–143.
 (11) G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, S. Wermter, Continual lifelong learning with neural networks: A review, Neural Networks.
 (12) J. L. McClelland, B. L. McNaughton, R. C. O’Reilly, Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory., Psychological review 102 (3) (1995) 419.
 (13) Z. Chen, B. Liu, Lifelong Machine Learning, Morgan & Claypool Publishers, 2018.
 (14) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. GrabskaBarwinska, et al., Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (13) (2017) 3521–3526.
 (15) F. Zenke, B. Poole, S. Ganguli, Continual learning through synaptic intelligence, in: Proceedings of the 34th International Conference on Machine LearningVolume 70, JMLR. org, 2017, pp. 3987–3995.
 (16) A. Robins, Catastrophic forgetting, rehearsal and pseudorehearsal, Connection Science 7 (2) (1995) 123–146.
 (17) H. Xu, B. Liu, L. Shu, P. S. Yu, Lifelong domain word embedding via metalearning, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4510–4516.
 (18) C. V. Nguyen, Y. Li, T. D. Bui, R. E. Turner, Variational continual learning, in: International Conference on Learning Representations (ICLR), 2018.
 (19) H. Shin, J. K. Lee, J. Kim, J. Kim, Continual learning with deep generative replay, in: Advances in Neural Information Processing Systems, 2017, pp. 2990–2999.
 (20) L. Shu, B. Liu, H. Xu, A. Kim, Lifelongrl: Lifelong relaxation labeling for separating entities and aspects in opinion targets, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2016, NIH Public Access, 2016, p. 225.
 (21) L. Shu, H. Xu, B. Liu, Lifelong learning crf for supervised aspect extraction, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017.
 (22) D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, R. Hadsell, Continual unsupervised representation learning, in: Advances in Neural Information Processing Systems, 2019, pp. 7645–7655.
 (23) D. P. Kingma, M. Welling, Autoencoding variational bayes, in: International Conference on Learning Representations (ICLR), 2014.
 (24) M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, S. R. Datta, Composing graphical models with neural networks for structured representations and fast inference, in: Advances in neural information processing systems, 2016, pp. 2946–2954.
 (25) J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: International conference on machine learning, 2016, pp. 478–487.
 (26) Z. Jiang, Y. Zheng, H. Tan, B. Tang, H. Zhou, Variational deep embedding: an unsupervised and generative approach to clustering, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, AAAI Press, 2017, pp. 1965–1972.
 (27) P. Goyal, Z. Hu, X. Liang, C. Wang, E. P. Xing, Nonparametric variational autoencoders for hierarchical representation learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5094–5102.
 (28) J. Sethuraman, R. C. Tiwari, Convergence of dirichlet measures and the interpretation of their parameter, in: Statistical decision theory and related topics III, Elsevier, 1982, pp. 305–315.
 (29) K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural networks 4 (2) (1991) 251–257.

(30)
D. P. Kingma, S. Mohamed, D. J. Rezende, M. Welling, Semisupervised learning with deep generative models, in: Advances in neural information processing systems, 2014, pp. 3581–3589.

(31)
E. Nalisnick, L. Hertel, P. Smyth, Approximate inference for deep latent gaussian mixtures, in: NIPS Workshop on Bayesian Deep Learning, Vol. 2, 2016.
 (32) H. Ishwaran, L. F. James, Gibbs sampling methods for stickbreaking priors, Journal of the American Statistical Association 96 (453) (2001) 161–173.
 (33) S. Lee, J. Stokes, E. Eatonr, Learning shared knowledge for deep lifelong learning using deconvolutional networks, in: 6th Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 2837–2844.
 (34) M. C. Hughes, E. Sudderth, Memoized online variational inference for dirichlet process mixture models, in: Advances in Neural Information Processing Systems, 2013, pp. 1133–1141.
 (35) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradientbased learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
 (36) D. D. Lewis, Y. Yang, T. G. Rose, F. Li, Rcv1: A new benchmark collection for text categorization research, Journal of machine learning research 5 (Apr) (2004) 361–397.
 (37) A. Coates, A. Ng, H. Lee, An analysis of singlelayer networks in unsupervised feature learning, in: Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 215–223.
 (38) A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny images.
 (39) A. Rosenberg, J. Hirschberg, Vmeasure: A conditional entropybased external cluster evaluation measure, in: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLPCoNLL), 2007.

(40)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 (41) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
Appendix A Derivation of
Recall that in Section 5.1, the ELBO is
(11) 
only involves the first three terms of equation (11) that make a contribution to optimize , and since we adopt the alternating optimization technique.
We assume our variational distribution takes the form of
(12) 
where we denote , which is a neural network, is the number of mixture components in the DP of the variational distribution, , , and is a Multinomial distribution. Under the assumptions in (12), we derive each of the first three terms in Equation 11 to obtain .
Notations in the ELBO 
: the total number of observations. 
: the number of Monte Carlo samples in Stochastic 
Gradient Variational Bayes (SGVB). 
. 
: the th observation. 
: cluster membership for the th observation. 
. 
. 
: the scalar precision in NW distribution. 
: the posterior mean of cluster . 
. 
: the th posterior degrees of freedom of NW. 
: variational parameters of the th NW components. 
: variational parameters of a Beta distribution for the 
th component in Equation 12. 
: variational parameters of the NW distribution for . 
: the variational parameters of a categorical distribution 
for the cluster membership for each observation. 
(1) :
We assume in the generative model that and is parameterized by a neural network and . Using the reparameterization trick (kingma2014auto, ) and the Monte Carlo estimate of expectations, we have
(13) 
(2)
Recall that , where and where is a neural network. Following kingma2014auto
, we use the reparameterization and sampling trick to allow backpropagation, for
where is the number of Monte Carlo samples, we haveDefine We have
Comments
There are no comments yet.