1 Introduction
Humans have the impressive ability to learn many different concepts and perform different tasks in a sequential lifelong setting. For example, infants learn to interact with objects in their environment without clear specification of tasks (taskagnostic), in a sequential fashion without forgetting (nonstationary), from temporally correlated visual inputs (noni.i.d), and with minimal external supervision (unsupervised). For a learning system such as a robot deployed in the real world, it is highly desirable to satisfy these desiderata as well. In contrast, learning algorithms often require input samples to be shuffled in order to satisfy the i.i.d. assumption, and have been shown to perform poorly when trained on sequential data, with newer tasks or concepts overwriting older ones; a phenomenon known as catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2013). As a result, there has been renewed research focus on the continual learning problem in recent years (e.g. Kirkpatrick et al., 2017; Nguyen et al., 2017; Zenke et al., 2017; Shin et al., 2017), with several approaches addressing catastrophic forgetting as well as backwards or forwards transfer—using the current task to improve performance on past or future tasks. However, most of these techniques have focused on a sequence of tasks in which both the identity of the task (task label
) and boundaries between tasks are provided; moreover, they often focus on the supervised learning setting, where
class labels for each data point are given. Thus, many of these methods fail to capture some of the aforementioned properties of realworld continual learning, with unknown task labels or poorly defined task boundaries, or when abundant classlabelled data is not available. In this paper, we propose to address the more general unsupervised continual learning setting (also suggested separately by Smith et al. (2019)), in which task labels and boundaries are not provided to the learner, and hence the focus is on unsupervised task learning. The tasks could correspond to either unsupervised representation learning, or learning skills without extrinsic reward if applied to the reinforcement learning domain. In this sense, the problem setting is “unsupervised” in two ways: in terms of the absence of task labels (or indeed welldefined tasks themselves), and in terms of the absence of external supervision such as class labels, regression targets, or external rewards. The two aspects may seem independent, but considering the unsupervised learning problem encourages solutions that aim to capture all fundamental properties of the data, which in turn might encourage, or reinforce, particular ways of addressing the task boundary problem. Hence the two aspects are connected through the type of solutions they necessitate, and it is beneficial to consider them jointly. We argue that this is an important and challenging open problem, as it enables continual learning in environments without clearly defined tasks and goals, and with minimal external supervision. Relaxing these constraints is crucial to performing lifelong learning in the real world.Our approach, named Continual Unsupervised Representation Learning (CURL), learns a taskspecific representation on top of a larger set of shared parameters, and deals with task ambiguity by performing task inference within the model. We endow the model with the ability to dynamically expand its capacity to capture new tasks, and suggest methods to minimise catastrophic forgetting. The model is experimentally evaluated in a variety of unsupervised settings: when tasks or classes are presented sequentially, when training data are shuffled, and with ambiguous task boundaries when transitions are continuous rather than discrete. We also demonstrate that despite focusing on unsupervised learning, the method can be trivially adapted to supervised learning while removing the reliance on task knowledge and class labels. The experiments demonstrate competitive performance with respect to previous work, with the additional ability to learn without supervision in a continual learning setting, and indicate the efficacy of the different components of the proposed method.
2 Model
We begin by defining the CURL model and training loss, then introduce methods to perform dynamic expansion, and propose a generative replay mechanism to combat forgetting.
2.1 Inference over tasks
To address the problem, we utilise the following generative model (Figure 2):
(1)  
with the joint probability factorising as
. Here, the categorical variable
indicates the current task, which is then used to instantiate the taskspecific Gaussian parameters for latent variable , which is then decoded to produce the input . is a fixed uniform prior, with component weights specified by . In the representation learning scenario, can be interpreted as representing some discrete clusters in the data, with then representing a mixture of Gaussians which encodes both the inter and intracluster variation. Posterior inference of in this model is intractable, so we employ an approximate variational posterior .Each of these components is parameterised by a neural network: the input is encoded to a shared representation, the mixture probabilities
are determined by an output softmax “task inference” head, and the Gaussian parameters for are produced by the output of a componentspecific latent encoding head (one for each component ). The componentspecific prior parameters and are parameterised as a linear layer (followed by a softplus nonlinearity for the latter) using a onehot representation of as the input. Finally, the decoder is a single network that maps from the mixtureofGaussians latent space to the reconstruction . The architecture is shown in Figure 2, where for simplicity, we denote the parameters of the Gaussian by . The loss for this model is the evidence lower bound (ELBO) given by:The expectation over can be computed exactly by marginalising over the categorical options, but the expectation over is intractable, and requires sampling. The resulting Monte Carlo approximation comprises a set of familiar terms, some of which correspond clearly to the singlecomponent VAE (Kingma & Welling, 2013; Rezende et al., 2014):
(3)  
where
is sampled using the reparametrisation trick. Of course, this can be generalised to multiple samples in a similar fashion to the ImportanceWeighted Autoencoder (IWAE)
(Burda et al., 2015).Intuitively, this loss encourages the model to reconstruct the data and perform clustering where possible. For a given data point, the model can choose to have high entropy over , in which case all of the componentwise losses must be low, or assign high for some , and use that component to model the datum well. By exploiting diversity in the input data, the model can learn to utilise different components for different discrete structures (such as classes) in the data.
2.2 Componentconstrained learning
While our main aim is to operate in an unsupervised setting, there may be cases in which one may wish to train a specific component, or when labels can be generated in a selfsupervised fashion. In such cases where labels are available, we can use a supervised loss, adapted from Eqn. 3:
(4)  
Here, instead of marginalising over as in Equation 3, the componentwise ELBO (the first two terms) is computed only for the known label . Furthermore, the final term in the original ELBO is replaced with a supervised crossentropy term encouraging
to match the label, which reduces to the log posterior probability of the observed label. This loss will be utilised and further discussed in Sections
2.3 and 2.4.2.3 Dynamic expansion
To determine the number of mixture components, we opt for a dynamic expansion approach in which capacity is added as needed, by maintaining a small set of poorlymodelled samples and then initialising and fitting a new component to this set when it reaches a critical size. In a similar fashion to existing techniques such as the ForgetMeNot process (Milan et al., 2016) and Dirichlet process (Teh, 2010), we rely on a threshold to determine when to instantiate a new component. More concretely, we denote a subset of parameters corresponding to the parameters unique to each component (i.e. the softmax output in and the Gaussian component in and ). During training, any sample with a loglikelihood less than a threshold is added to set (where the loglikelihood is approximated by the ELBO). Then, when the set reaches size , we initialise the parameters of the new component to the current component that has greatest probability over :
(5) 
The new component is then tuned to , by performing a small fixed number of iterations of gradient descent on all parameters , using the componentconstrained ELBO (Eqn. 4) with label .
Intuitively, this process encourages forward transfer, by initialising new concepts to the “closest” existing concept learned by the model and then finetuning to a small number of instances. The additional capacity used for each expansion is only in the topmost layer of the encoder, with parameters, compared to for the rest of the shared model. That is, while dynamic expansion incorporates a new highlevel concept, the underlying lowlevel representations in the encoder, and the entire decoder, are both shared among all tasks.
2.4 Combatting forgetting via mixture generative replay
A shared lowlevel representation can mean that learning new tasks interferes with previous ones, leading to forgetting. One relevant technique to address this is Deep Generative Replay (DGR) (Shin et al., 2017), in which samples from a learned generative model are reused in learning. We propose to adapt and extend DGR to the mixture setting to perform unsupervised learning without forgetting. In contrast to the original DGR work, our approach is inherently generative, such that a generative replaybased approach can be incorporated holistically into the framework at minimal cost. We note that many other existing methods (e.g., Kirkpatrick et al. (2017)) could straightforwardly be adapted to our approach, but our experiments demonstrated generative replay to be simple and effective.
To be more precise, during training, the model alternates between batches of real data, with samples drawn from the current training distribution, and generated data, with samples produced by the previous snapshot of the model (with parameters ):
(6) 
where represents a choice of prior distribution for the categorical . While the uniform prior is a natural choice, this fails to consider the degree to which different components are used, and can therefore result in poor sample quality. To address this, the model maintains a count over components by accumulating the mean of posterior over all previous timesteps, thereby favouring the components that have been used the most. We refer to this process as mixture generative replay (MGR).
While MGR ensures tasks or concepts that have been previously learned by the model are reused for learning, it places no constraint on which components are used to model them. Given that each generated datum is conditioned on a sampled , we can use as a selfsupervised learning signal and encourage mixture components to remain consistent with respect to the model snapshot, by using the componentconstrained loss from Eqn. 4.
The only remaining question is when to update the previous model snapshot . For this, we explore two cases, with snapshots taken at periodic fixed intervals, or immediately before performing dynamic
expansion. The intuition behind the latter is that dynamic expansion is performed when there is a sufficient shift in the input distribution, and consolidating previously learned information is beneficial prior to adding a newly observed concept. This is also advantageous as it eliminates the additional snapshot period hyperparameter.
3 Related Work
Generative models
A number of related approaches aim to learn a discriminative latent space using generative models. Building on the original VAE (Kingma & Welling, 2013), Nalisnick et al. (2016) utilise a latent mixture of Gaussians, aiming to capture class structure in an unsupervised fashion, and propose a Bayesian nonparametric prior, further developed in (Nalisnick & Smyth, 2017). Similarly, Joo et al. (2019) suggest a Dirichlet posterior in latent space to avoid some of the previously observed componentcollapsing phenomena. Lastly, Jiang et al. (2017) propose Variational Deep Embedding (VaDE) focused on the goal of clustering in an i.i.d setting. While VaDE has the same generative process as CURL, it assumes a meanfield approximation, with and conditionally independent given the input. In the case of CURL, conditioning on ensures we can adequately capture the inter and intra class uncertainty of a sample within the same structured latent space .
Continual learning
A large body of work has addressed the continual learning problem (Parisi et al., 2019). Regularisationbased methods minimise changes to parameters that are crucial for earlier tasks, with some parameterwise weight to measure importance (Kirkpatrick et al., 2017; Nguyen et al., 2017; Zenke et al., 2017; Aljundi et al., 2018; Schwarz et al., 2018). Related techniques seek to ensure the performance on previous data does not decrease, by employing constrained optimisation (LopezPaz et al., 2017; Chaudhry et al., 2018) or distilling the information from old models or tasks (Li & Hoiem, 2018). In a similar vein, other methods encourage new tasks to utilise previously unused parameters, either by finding “free” linear parameter subspaces (He & Jaeger, 2018); learning an attention mask over parameters (Serra et al., 2018); or using an agent to find new activation paths through a network (Fernando et al., 2017). Expansionbased models dynamically increase capacity to allow for additional tasks (Rusu et al., 2016; Yoon et al., 2017; Draelos et al., 2017), and optionally prune the network to constrain capacity (Zhou et al., 2012; Golkar et al., 2019). Another popular approach is that of rehearsalbased methods (Robins, 1995), where the data distribution from earlier tasks is captured by samples from a generative model trained concurrently (Shin et al., 2017; van de Ven & Tolias, 2018; Ostapenko et al., 2018). Farquhar & Gal (2018) combine such methods with regularisationbased approaches under a Bayesian interpretation. Alternatively, Rebuffi et al. (2017) learn classspecific exemplars instead of a generative model. However, these methods usually require task identities, rely on welldefined task boundaries, and are often evaluated on a sequence of supervised learning tasks.
Taskagnostic continual learning
Some recent work has investigated continual learning without task labels or boundaries. Hsu et al. (2018) and van de Ven & Tolias (2019) identify the scenarios of incremental task, domain, and class learning; which operate without task labels in the latter cases, but all focus on supervised learning tasks. Aljundi et al. (2019)
propose a taskfree approach to continual learning related to ours, which mitigates forgetting using the regularisationbased Memory Aware Synapses (MAS) approach
(Aljundi et al., 2018), maintains a hard example buffer to better estimate the regularisation weights, and detects when to update these weights (usually performed at known task boundaries in previous work).
Zeno et al. (2018)propose a Bayesian taskagnostic learning update rule for the mean and variance of each parameter, and demonstrate its ability to handle ambiguous task boundaries. However, it is only applied to supervised tasks, and can exploit the “label” trick, inferring the task based on the class label. In contrast,
Achille et al. (2018) address the problem of unsupervised learning in a sequential setting by learning a disentangled latent space with taskspecific attention masks, but the main focus is on learning across datasets, and the method relies on abrupt shifts in data distribution between datasets. Our approach builds upon this existing body of work, addressing the full unsupervised continual learning problem, where task labels and boundaries are unknown, and the tasks themselves are without class supervision. We argue that addressing this problem is critical in order to tackle continual learning in challenging, realworld scenarios.4 Experiments
In the following sections, we empirically evaluate a) whether our method learns a meaningful classdiscriminable latent space in the unsupervised sequential learning setting, without forgetting, even when task boundaries are unclear; b) the importance of the dynamic expansion and generative replay techniques to performance; and c) how CURL performs on external benchmarks when trained i.i.d or adapted to learn in a supervised fashion. Code for all experiments can be found at https://github.com/deepmind/deepmindresearch/.
4.1 Evaluation settings and datasets
One desired outcome of our approach is the ability to learn classdiscriminative latent representations from nonstationary input data. We evaluate this using cluster accuracy (the accuracy obtained when assigning each mixture component to its most represented class), and with the accuracy of a kNearest Neighbours (kNN) classifier in latent space. The former measures the amount of classrelevant information encoded into the categorical variable
, while the latter measures the discriminability of the entire latent space without imposing structure (such as a linear boundary).For the evaluation we extensively utilise the MNIST (LeCun et al., 2010) and Omniglot (Lake et al., 2011) datasets, and further information can be found in Appendix B. We investigate a number of different evaluation settings: i.i.d, where the model sees shuffled training data; sequential, where the model sees classes sequentially; and continuous drift, similar to the sequential case, but with classes gradually introduced by slowly increasing the number of samples from the new class within a batch.
4.2 Continual classdiscriminative representation learning
We begin by analysing our approach, and follow this with evaluation on external benchmarks in later sections. First, we measure the ability to perform classdiscriminative representation learning in the sequential setting on MNIST, where each of the classes is observed for training steps (further experimental details can be found in Appendix C.1). Figure 4a shows the cluster accuracy for a number of variants of CURL. We observe the importance of both dynamic expansion and mixture generative replay (MGR) to learn a coherent representation without forgetting. Figure 4b shows the classwise accuracies during training, for the model with MGR and expansion. Interestingly, while many existing continual learning approaches appear to forget earlier classes (see e.g. Nguyen et al. (2017)), these classes are well modelled by CURL, and the confusion is more observed between similar classes (such as s and s; or s and
s). Indeed, this is reflected in the classconfusion matrix after training (Figure
4c). This implies the model adequately addresses catastrophic forgetting, but could improve in terms of plasticity, i.e., learning new concepts. Further analysis can be found in Appendix A.1, showing generated samples; and Appendix A.2, analysing the dynamic expansion buffers.4.3 Ablation studies
Next, we perform an ablation study to gauge the impact of the expansion threshold for continual learning, in terms of cluster accuracy and number of components used, as shown in Figure 3. As the threshold value is increased, samples are more frequently stored into the “poorlymodelled” buffer, and the model expands more aggressively throughout learning. Consequently, for sequential learning, the number of components ranges from to , the cluster accuracy varies up to a maximum of , and the NN error also marginally decreases over this range. Furthermore, without any dynamic expansion, the result is significantly poorer at accuracy, and when discovering the same number of components with dynamic expansion (, obtained with an expansion threshold of ), the equivalent performance is at . Thus, the dynamic expansion threshold conveniently provides a tuning parameter to perform capacity estimation, trading off cluster accuracy with the memory cost of using additional components in the latent mixture. Interestingly, if we perform the same analysis for i.i.d. data (also in Figure 3), we observe a similar tradeoff; though the final performance is slightly poorer than when starting with an equivalent, fixed number of mixture components ().
We also further analyse mixture generative replay (MGR) with an ablation study in Table 1. We evaluate standard and selfsupervised MGR (SMGR), and compare between the case where snapshots are taken on expansion (i.e., no task information is needed), or at fixed intervals (either at , the duration of training on each class, or , ten times more frequently). Intuitively, the period is important as it determines how quickly a shifting data distribution is consolidated into the model: if too short, the generated data will drift with the model, leading to forgetting. The results in Table 1 point to a number of interesting observations. First, both MGR and SMGR are sensitive to the fixed snapshot period: the performance is unsurprisingly optimal when snapshots are taken as the training class changes, but drops significantly when performed more frequently, and also uses a greater number of clusters in the process. Second, by taking snapshots before dynamic expansion instead, this performance can largely be recovered, and without any knowledge of the task boundaries. Third, perhaps surprisingly, SMGR harms performance compared to MGR. This may be due to the fact that mixture components already tend to be consistent in latent space throughout learning, and SMGR may be reducing plasticity; further analysis can be found in Appendix A.3. Lastly, we can also observe the benefits of MGR, with the MNIST case exhibiting far poorer performance and utilising many more components in the process. Interestingly, the Omniglot case without MGR performs well, but at the cost of significantly more components: expansion itself is able to partly address catastrophic forgetting by effectively oversegmenting the data.
Benchmark  MNIST  Omniglot  

Scenario  # clusters  Cluster acc (%)  NN error (%)  # clusters  Cluster acc (%)  NN error (%) 
MGR (fixed, T)  
MGR (fixed, 0.1T)  
MGR (dyn)  
SMGR (fixed, T)  
SMGR (fixed, 0.1T)  
SMGR (dyn)  
CURL (no MGR) 
4.4 Learning with poorlydefined task boundaries
Next, we evaluate CURL in the continuous drift setting, and compare to the standard sequential setting. The overall performance on MNIST and Omniglot is shown in Table 2, using MGR with either fixed or dynamic snapshots. We observe that despite having unclear task boundaries, where classes are gradually introduced, the continuous case generally exhibits better performance than the case with welldefined task boundaries. We also closely investigate the mixture component dynamics during learning, by obtaining the top components (most used over the course of learning) and plotting their posterior probabilities over time (Figure 5). From the discrete taskchange domain (left), we observe that probabilities change sharply with the hard task boundaries (every steps); and many mixture components are quite sparsely activated, modelling either a single class, or a few classes. Some of the mixture components also observe “echoes”, where the sharp change to a new class in the data distribution activates the component temporarily before dynamic expansion is performed. In the continuous drift case (right of Figure 5), the mixture probabilities exhibit similar behaviours, but are much smoother in response to the gradually changing data distribution. Further, without a sharp distributional shift, the “echoes” are not observed.
Benchmark  MNIST  Omniglot  

Scenario  # clusters  Cluster acc (%)  NN error (%)  # clusters  Cluster acc (%)  NN error (%) 
Seq. w/ MGR (fixed)  
Seq. w/ MGR (dyn)  
Cont. w/ MGR (fixed)  
Cont. w/ MGR (dyn) 
4.5 External benchmarks
Supervised continual learning
While focused on taskagnostic continual learning in unsupervised settings, CURL can also be trivially adapted to supervised tasks simply by training with the supervised loss in Eqn. 4. We evaluate on the split MNIST benchmark, where the data are split into five tasks, each classifying between two classes, and the model is trained on each task sequentially. If we evaluate the overall accuracy after training, this is called incremental class learning; and if we provide the model with the appropriate task label and evaluate the binary classification accuracy for each task, this is incremental task learning (Hsu et al., 2018; van de Ven & Tolias, 2019). Experimental details can be found in Appendix C.2. The results in Table 3 demonstrate that the proposed unsupervised approach can easily and effectively be adapted to supervised tasks, achieving competitive results for both scenarios. While all methods perform quite well on incremental task learning, CURL is outperformed only by iCARL (Rebuffi et al., 2017) on incremental class learning, which was specifically proposed for this task. Interestingly, the result is also better than DGR, suggesting that by holistically incorporating the generative process and classifier into the same model, and focusing on the broader unsupervised, taskagnostic perspective, CURL is still effective in the supervised domain.
Benchmark  MNIST ( = 50)  Omniglot ( = 100)  

Evaluation  NN error  NN error  NN error  NN error  NN error  NN error 
VAE^{3}^{3}3Performance numbers are obtained from Joo et al. (2019), with consistent architectures and hyperparameters.  
SBVAE^{†}^{†}footnotemark:  
DirVAE^{†}^{†}footnotemark:  
CURL (i.i.d)  
VaDE (bigger net)        
CURL w/ MGR (seq)  
Raw pixels^{†}^{†}footnotemark: 
Unsupervised i.i.d learning
We also demonstrate the ability of the underlying model to learn in a more traditional setting with the entire dataset shuffled, and compare with existing work in clustering and representation learning: the VAE (Kingma & Welling, 2013), DirichletVAE (Joo et al., 2019), SBVAE (Nalisnick & Smyth, 2017), and VaDE (Jiang et al., 2017). We utilise the same architecture and hyperparameter settings as in Joo et al. (2019) for consistency, with latent spaces of dimension and for MNIST and Omniglot respectively; and full details of the experimental setup can be found in Appendix C.3. We note that the NN error values are much better here than in Section 4.3; this is due to a higher dimensional latent space and hence they cannot be directly compared (see Appendix A.4).
The uppermost group in Table 4 show the results on i.i.d MNIST and Omniglot. The CURL generative model trained i.i.d (without MGR, and with dynamic expansion) is competitive with the stateoftheart on MNIST (bettered only by VaDE, which incorporates a larger architecture) and Omniglot (bettered only by DirVAE). While not the main focus of this paper, this demonstrates the ability of the proposed generative model to learn a structured, discriminable latent space, even in more standard learning settings with shuffled data. Table 4 also shows the performance of CURL trained in the sequential setting. We observe that, despite learning from sequential data, these results are competitive with the stateoftheart approaches that operate on i.i.d. data.
5 Conclusions
In this work, we introduced an approach to address the unsupervised continual learning problem, in which task labels and boundaries are unknown, and the tasks themselves lack class labels or other external supervision. Our approach, named CURL, performs task inference via a mixtureofGaussians latent space, and uses dynamic expansion and mixture generative replay (MGR) to instantiate new concepts and minimise catastrophic forgetting. Experiments on MNIST and Omniglot showed that CURL was able to learn meaningful classdiscriminative representations without forgetting in a sequential class setting (even with poorly defined task boundaries). External benchmarks also demonstrated the method to be competitive with respect to previous work when adapted to unsupervised learning from i.i.d data, and to supervised incremental class learning. Future directions will investigate additional techniques to alleviate forgetting, and the extension to the reinforcement learning domain.
References

Achille et al. (2018)
Achille, Alessandro, Eccles, Tom, Matthey, Loic, Burgess, Chris, Watters,
Nicholas, Lerchner, Alexander, and Higgins, Irina.
Lifelong disentangled representation learning with crossdomain latent homologies.
In Advances in Neural Information Processing Systems, pp. 9873–9883, 2018. 
Aljundi et al. (2018)
Aljundi, Rahaf, Babiloni, Francesca, Elhoseiny, Mohamed, Rohrbach, Marcus, and
Tuytelaars, Tinne.
Memory aware synapses: Learning what (not) to forget.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 139–154, 2018.  Aljundi et al. (2019) Aljundi, Rahaf, Tuytelaars, Tinne, et al. Taskfree continual learning. Proceedings CVPR 2019, 2019.
 Burda et al. (2015) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Chaudhry et al. (2018) Chaudhry, Arslan, Ranzato, Marc’Aurelio, Rohrbach, Marcus, and Elhoseiny, Mohamed. Efficient lifelong learning with agem. arXiv preprint arXiv:1812.00420, 2018.

Draelos et al. (2017)
Draelos, Timothy J, Miner, Nadine E, Lamb, Christopher C, Cox, Jonathan A,
Vineyard, Craig M, Carlson, Kristofor D, Severa, William M, James, Conrad D,
and Aimone, James B.
Neurogenesis deep learning: Extending deep networks to accommodate new classes.
In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 526–533. IEEE, 2017.  Farquhar & Gal (2018) Farquhar, Sebastian and Gal, Yarin. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018.
 Fernando et al. (2017) Fernando, Chrisantha, Banarse, Dylan, Blundell, Charles, Zwols, Yori, Ha, David, Rusu, Andrei A, Pritzel, Alexander, and Wierstra, Daan. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734, 2017.
 Golkar et al. (2019) Golkar, Siavash, Kagan, Michael, and Cho, Kyunghyun. Continual learning via neural pruning. arXiv preprint arXiv:1903.04476, 2019.
 Goodfellow et al. (2013) Goodfellow, Ian J, Mirza, Mehdi, Xiao, Da, Courville, Aaron, and Bengio, Yoshua. An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211, 2013.

He & Jaeger (2018)
He, Xu and Jaeger, Herbert.
Overcoming catastrophic interference using conceptoraided backpropagation.
2018.  Hsu et al. (2018) Hsu, YenChang, Liu, YenCheng, and Kira, Zsolt. Reevaluating continual learning scenarios: A categorization and case for strong baselines. arXiv preprint arXiv:1810.12488, 2018.

Jiang et al. (2017)
Jiang, Zhuxi, Zheng, Yin, Tan, Huachun, Tang, Bangsheng, and Zhou, Hanning.
Variational deep embedding: an unsupervised and generative approach
to clustering.
In
Proceedings of the 26th International Joint Conference on Artificial Intelligence
, pp. 1965–1972. AAAI Press, 2017.  Joo et al. (2019) Joo, Weonyoung, Lee, Wonsung, Park, Sungrae, and Moon, IlChul. Dirichlet variational autoencoder. arXiv preprint arXiv:1901.02739, 2019.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kirkpatrick et al. (2017) Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, GrabskaBarwinska, Agnieszka, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
 Lake et al. (2011) Lake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33, 2011.
 LeCun et al. (2010) LeCun, Yann, Cortes, Corinna, and Burges, CJ. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.
 Li & Hoiem (2018) Li, Zhizhong and Hoiem, Derek. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2018.
 LopezPaz et al. (2017) LopezPaz, David et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476, 2017.
 McCloskey & Cohen (1989) McCloskey, Michael and Cohen, Neal J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
 Milan et al. (2016) Milan, Kieran, Veness, Joel, Kirkpatrick, James, Bowling, Michael, Koop, Anna, and Hassabis, Demis. The forgetmenot process. In Advances in Neural Information Processing Systems, pp. 3702–3710, 2016.
 Nalisnick & Smyth (2017) Nalisnick, Eric and Smyth, Padhraic. Stickbreaking variational autoencoders. In International Conference on Learning Representations (ICLR), 2017.
 Nalisnick et al. (2016) Nalisnick, Eric, Hertel, Lars, and Smyth, Padhraic. Approximate inference for deep latent gaussian mixtures. In NIPS Workshop on Bayesian Deep Learning, volume 2, 2016.
 Nguyen et al. (2017) Nguyen, Cuong V, Li, Yingzhen, Bui, Thang D, and Turner, Richard E. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
 Ostapenko et al. (2018) Ostapenko, Oleksiy, Puscas, Mihai, Klein, Tassilo, and Nabi, Moin. Learning to remember: Dynamic generative memory for continual learning. 2018.
 Parisi et al. (2019) Parisi, German I, Kemker, Ronald, Part, Jose L, Kanan, Christopher, and Wermter, Stefan. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.

Rebuffi et al. (2017)
Rebuffi, SylvestreAlvise, Kolesnikov, Alexander, Sperl, Georg, and Lampert,
Christoph H.
icarl: Incremental classifier and representation learning.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2001–2010, 2017.  Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Robins (1995) Robins, Anthony. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
 Rusu et al. (2016) Rusu, Andrei A, Rabinowitz, Neil C, Desjardins, Guillaume, Soyer, Hubert, Kirkpatrick, James, Kavukcuoglu, Koray, Pascanu, Razvan, and Hadsell, Raia. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.

Schwarz et al. (2018)
Schwarz, Jonathan, Czarnecki, Wojciech, Luketina, Jelena, GrabskaBarwinska,
Agnieszka, Teh, Yee Whye, Pascanu, Razvan, and Hadsell, Raia.
Progress & compress: A scalable framework for continual learning.
In
International Conference on Machine Learning
, pp. 4535–4544, 2018.  Serra et al. (2018) Serra, Joan, Suris, Didac, Miron, Marius, and Karatzoglou, Alexandros. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pp. 4555–4564, 2018.
 Shin et al. (2017) Shin, Hanul, Lee, Jung Kwon, Kim, Jaehong, and Kim, Jiwon. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999, 2017.
 Smith et al. (2019) Smith, James, Baer, Seth, Kira, Zsolt, and Dovrolis, Constantine. Unsupervised continual learning and selftaught associative memory hierarchies. arXiv preprint arXiv:1904.02021, 2019.
 Teh (2010) Teh, Yee Whye. Dirichlet process. Encyclopedia of machine learning, pp. 280–287, 2010.
 van de Ven & Tolias (2018) van de Ven, Gido M and Tolias, Andreas S. Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635, 2018.
 van de Ven & Tolias (2019) van de Ven, Gido M and Tolias, Andreas S. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734, 2019.
 Yoon et al. (2017) Yoon, Jaehong, Yang, Eunho, Lee, Jeongtae, and Hwang, Sung Ju. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
 Zenke et al. (2017) Zenke, Friedemann, Poole, Ben, and Ganguli, Surya. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3987–3995. JMLR. org, 2017.
 Zeno et al. (2018) Zeno, Chen, Golan, Itay, Hoffer, Elad, and Soudry, Daniel. Task agnostic continual learning using online variational bayes. arXiv preprint arXiv:1803.10123, 2018.

Zhou et al. (2012)
Zhou, Guanyu, Sohn, Kihyuk, and Lee, Honglak.
Online incremental feature learning with denoising autoencoders.
In Artificial intelligence and statistics, pp. 1453–1461, 2012.
Appendix A Additional experiments
a.1 Generated samples
The primary aim of this paper is to learn a meaningful classdiscriminative representation of the data. We use mixture generative replay to contrast catastrophic forgetting, hence sample quality is not our main interest. Nonetheless, we are interested in checking whether the model is able to capture the variety of the data and what level of sample quality is sufficient to retain what has been learned. We illustrate samples generated after sequentially learning on each class, in Figure 7. We observe from the later samples (bottom rows) that classes that are observed early on are still preserved within the model. Interestingly, while most of the samples are clear, some indicate degraded but still identifiable versions of previous symbols (such as some of the zeros after other classes are introduced). Given that the learned latent representations are still classdiscriminative, we hypothesise that perhaps merely capturing the “essence” of a particular class may be more useful for representation learning than striving for a pixelperfect reconstruction. We leave a thorough analysis of this idea to future work.
a.2 Data buffers for dynamic expansion
We can also observe samples from the data buffers used for each dynamic expansion, to understand when the model chooses to expand capacity (Figure 7
). We note that some buffers holds samples from multiple classes: this occurs when there are few outliers from a single class, insufficient to initialise a new component on their own. Nonetheless, we observe an intuitive trend over most of the buffers: in many cases for a class, the first expansion denotes the change in distribution that corresponds to the introduction of the class; and the second expansion responds to outlying or challenging examples from the class. This can be observed, for instance, for the initialisation of a component for “twos” and then for “curly twos”, or for “threes” and “challenging threes”. This highlights the ability of the approach to incorporate new components to account for both data distributional shift and hard example modelling.
a.3 Latent structure
To analyse the structure of our latent space, we encode our test set to latent space, and observe how the latent space changes during training (Figure 8). Figure 8(a) colours the points by ground truth class label (not available to the model during learning), and Figure 8(b) colours them by the most probable mixture component (i.e. ). Each subplot represents a particular step during training, and only points from classes seen so far are used in the plot. We observe from Figure 8(a) that as classes are incrementally introduced, they occupy a relatively disjoint region of latent space, and are consistent over time. That is, the introduction of new classes does not appear to catastrophically interfere with the learned representations of previous ones.
We also observe similar properties in Figure 8(b), with the clusters covering individual classes with reasonable accuracy, and maintaining a consistent position over time. This may offer some explanation for the inefficacy of SMGR  there is little value to incorporating the selfsupervised loss for component consistency given that the mixture components are already consistent in which classes they model.
a.4 Discussion on kNN accuracy measure
We exploit the NN accuracy in our experiments to measure the classdiscriminability of latent space, without imposing any specific parametric structure in terms of the boundary between classes. As a simple nonparametric approach, it serves this purpose, and has also been used extensively in previous evaluations as a result (Nalisnick & Smyth, 2017; Joo et al., 2019). However, there are some interesting properties that are worth considering. Firstly, as demonstrated by Table 5, the measure is highly dependent on dimensionality, making it difficult to compare across different latent space sizes. In our experiments, all architectural aspects, including latent size, are kept fixed for an experiment to account for this. Secondly, Simple baselines like classifying on raw pixels perform surprisingly well (e.g., approximately kNN error on MNIST); this is due in part to the much larger dimensionality than the latent spaces used for evaluation, but we postulate that this is also likely due to the image statistics of the datasets used: given MNIST and Omniglot have many lowvariance black pixels and a relatively small amount of intraclass variance within the centre pixels, the raw pixels are themselves discriminative. Given the inability to directly compare between different dimensionalities, this simple baseline is only provided for context.
Latent space dimension ()  

8  16  32  64  128  256  784  1568  
NN error 
Appendix B Datasets
The MNIST dataset comprises handwritten samples of ten digits, split into training samples, validation samples, and test samples. The Omniglot dataset comprises samples from each of characters, grouped into different alphabets. While multiple splits are possible, we utilise a common method from previous work, with samples from each character in the training set, and in the test set. In this case, we use the alphabets as the class labels for evaluation. For all experimental runs, we train models for training steps. In the sequential cases, this is equally divided between all classes, resulting in and steps per class for MNIST and Omniglot, respectively.
Appendix C Experimental setup
We used a single main setup for most of the experiments, with the exception of the external benchmarks, which are described separately in the following sections. For all experiments, we train models for
steps and report / plot the mean and standard deviation over
different random seeds. Each random seed for an experiment is trained using a single Tesla V100 GPU. Each run takes approximately hour for the supervised SplitMNIST scenario, hours for the main unsupervised MNIST sequential run, and hours for Omniglot. The main bottleneck in run time is marginalising over all components, so this could be optimised in future implementations.c.1 Main setup
Mnist
For MNIST, we employed an MLP encoder with layer sizes of
to form the shared representation; followed by a softmax layer for
, with a maximum capacity of components; and a linear layer with output dimensions, for each component, to output the posterior parameters of the corresponding dimensional latent Gaussian. Half of the output dimensions were used directly for the mean, while the other half were passed through a softplus activation to produce the variance. The decoder consisted of two fully connected layers ofdimensions, followed by a Bernoulli output layer for the reconstruction. Both encoder and decoder networks employed ReLU nonlinearities, and training was performed using the ADAM optimiser with learning rate
.This architecture was obtained by performing a small hyperparameter sweep, also considering an encoder with two dimensional fullyconnected layers, and a decoder with layer sizes , but we observed only small differences in performance.
For dynamic expansion, we set the threshold for the loglikelihood (approximated by the ELBO) at , and anything below this value is considered a poorlyexplained sample and added to the buffer. This was obtained by performing a sweep over values of , where this parameter can be use to directly modulate the expansion rate, and hence the capacity of the model (i.e. it is a way to automatically estimate the required number of components K). A fixed consolidation period of steps was used after each expansion before the model was reeligible for expansion: this ensures that the model is able to learn from the data and fit a new component, and only flag poorly defined samples once learning has matured. For all hyperparameter sweeps, model selection was performed using the validation set.
Omniglot
For Omniglot, we employed an MLP encoder with layer sizes of to form the shared representation; followed by a softmax layer for , with components; and a linear layer with output dimensions, for each component, to output the posterior parameters of the dimensional corresponding Gaussian. Half of the output dimensions were used directly for the mean, while the other half were passed through a softplus activation to produce the variance. The decoder consisted of one fullyconnected layer of dimensions, followed by a Bernoulli output layer for the reconstruction. Both encoder and decoder networks employed ReLU nonlinearities, and training was performed using the ADAM optimiser with learning rate
For dynamic expansion, the same hyperparameters were used as in the MNIST case.
c.2 Supervised continual learning benchmark
For this external comparison, we employ the same hyperparameters as those reported in the previous work (Hsu et al., 2018; van de Ven & Tolias, 2019). The encoder comprised two fullyconnected ReLU layers with dimensions, and we use a dimensional latent space with a capacity of components (matching the number of labels). The decoder also comprises two fullyconnected ReLU layer with dimensions, and Bernoulli outputs. Dynamic expansion is performed in a supervised fashion (i.e. expanding when new class labels are introduced), and MGR is also used, with snapshotting at the end of each task.
c.3 Unsupervised i.i.d learning benchmark
For the external comparison, we employ the same hyperparameters as those reported in the previous work (Joo et al., 2019; Nalisnick & Smyth, 2017). For MNIST, the encoder comprised two fullyconnected ReLU layers with dimensions, and a dimensional latent space with a maximum mixture capacity of components. The decoder comprised a single fullyconnected ReLU layer with dimensions, and Bernoulli outputs. For Omniglot, the same architecture was used, but with a dimensional latent space, and with maximum mixture components. By default, we employ dynamic expansion, and use MGR for the sequential learning case, using the “dynamic” snapshot approach. Training was performed using the ADAM optimiser with learning rate .