1 Introduction
Nonparametric Bayesian priors, such as the Dirichlet Process (DP), have been widely adopted in the probabilistic graphical community. Their ability to generate an infinite amount of probability distributions using a discrete latent variable makes them ideally suited for automatic model selection. The most famous applications of the DP have been however limited to classical probabilistic graphical models such as Dirichlet Process Mixture Models and Hierarchical Dirichlet Process Hidden Markov Models
[1, 4, 20].Recently, deep generative models such as Deep Latent Gaussian Models (DLGMs) and Variational AutoEncoders (VAEs)
[9, 17] have shown huge success in modeling and generating complex data structures such as images. Various proposals to generalize these models to the mixture and nonparametric mixture cases have been made [15, 14, 3, 7]. Introducing such priors on top of the deep generative model can improve its generative capabilities, preserve class structure in the latent representation space, and offer a nonparametric way of performing model selection with respect to the size of the generative model.The main challenge posed by such models lies in the inference process. Deep generative models with continuous latent variables owe their success mainly to the reparameterization trick [9, 17]
. This approach provides an efficient and scalable method for obtaining low variance estimates of the gradient of the variational lower bound with respect to variational posterior parameters. Applying this approach directly to the variational posterior of the DP is not straightforward, due to the fact that a reparameterization trick for the beta distributions is hard to obtain
[18]. One approach to bypass this issue have been proposed by [14], where the authors used the Kumaraswamy distribution [11] as a higher entropy alternative for the beta distribution in the variational posterior. However, by deriving the nature of the variational posterior directly from the variational lower bound, we can show that the appropriate distribution is in fact the beta distribution.In this paper we provide an alternative treatment of the variational posterior of the DPDLGMM, where we combine classical variational inference to derive the variational posteriors of the beta distributions and cluster hidden variables, and neural variational inference for the hidden variables of the latent Gaussian model. This leads to gradient ascent updates over the parameters present in nonlinear transformations where the reparameterization trick can be applied knowing the cluster assignment. As for the remaining parameters, closedform solutions can be obtained by maximization of the evidence lower bound.
2 Dirichlet Process Deep Latent Gaussian Mixture Models
Generalizing deep latent Gaussian models to the Dirichlet process mixture case can be obtained by adding a Dirichlet process prior on the hidden cluster assignments. We denote these cluster assignments by . Following the assignment of a cluster hidden variable, a deep latent Gaussian model is defined for the assigned cluster similar to [17]. We adopt the stickbreaking construction of the Dirichlet Process [19]. The generative process of the model (figure 1) is given by:
where is the
layer hidden representation constructed using a nonlinear transformation
represented by a neural network for the cluster assignment
. For simplicity, we consider diagonal covariance matrices for each layer where the diagonal elements are , hence represents the elementwise product. The generalization to full covariance matrices is straightforward using the Cholesky decomposition.We denote by
the concentration parameter of the Dirichlet process which is a hyperparameter to be tuned manually. The term
represents the emission distribution of the observable, usually chosen to be a normal distribution for continuous variables or the Bernoulli distribution for binary variables. We denote the parameters of the generative model by:
The model thus has an infinite number of parameters due to the Dirichlet process prior. Furthermore, the posterior distribution of the hidden variables cannot be computed in closedform. In order to perform inference on the model we need to use approximate methods such as Markov Chain Monte Carlo (MCMC) or Variational Inference. MCMC methods are not suitable for high dimensional models such as the DPDLGMM, where convergence of the Markov chain to the true posterior can prove to be slow and hard to diagnose
[2].In the next section, we develop a structured variational inference algorithm for DPDLGMM. We show that by choosing a suitable structure for the variational posterior, closedform solutions can be obtained for the updates of the truncated variational posteriors of the beta distributions, the variational posteriors of the cluster hidden variables, and the optimal prior parameters maximizing the evidence lower bound.
3 Structured Variational Inference
For a brief review of variational methods, we denote by the samples present in the dataset supposed to be independent and identically distributed. The loglikelihood of the model is intractable due to the required marginalization of all the hidden variables. In order to bypass this marginalization, we introduce an approximate distribution and use Jensen’s inequality to obtain a lower bound [8]:
(1) 
We can show that if the distribution is a good approximation of the true posterior, maximizing the evidence lower bound (ELBO) with respect to the model parameters is equivalent to maximizing the loglikelihood. For deep generative models, most stateoftheart methods use inference networks to construct the posterior distribution [17, 14]. For deep mixture models with discrete latent variables, this approach leads to a mixture density variational posterior where the reparameterization trick requires additional investigation [5]. Our approach combines standard variational Bayes and neural variational inference. We approximate the true posterior using the following structured variational posterior:
(2) 
where is a truncation level for the variational posterior of the beta distributions obtained by supposing that [1]. We assume a factorized posterior over the hidden layers , where the intralayer dependencies are conserved.
3.1 Deriving the variational posteriors and
Deriving the nature of the posterior distributions of the hidden layers using the variational approach is intractable due to the nonlinearities present in the model. Thus, we take a similar approach to [17], and we assume that the variational posterior is specified by an inference network, where the parameters of the distribution are the outputs of deep neural networks and of parameters for the layer and the cluster:
In contrast to the hidden layers, we can use the proposed variational posterior of equation (2) to derive closedform solutions for and . Let us consider the Kullback–Leibler definition of the ELBO :
By plugging the variational posterior and isolating terms and terms, we can analytically derive the optimal distributions and maximizing :
where the fixed point equations for the variational parameters and are:
(3)  
(4) 
(5) 
The fixed point equation of , requires the evaluation of the expectation over the hidden layers, this can be performed by sampling from the variational posterior of each hidden layer and then forwarding the sample using the generative model:
(6) 
A key insight here is the following: if a cluster is incapable of reconstructing a sample from the variational posterior, this will reinforce the belief that should not be assigned to that cluster. Furthermore, the estimation of the expectation can be performed using the same reparameterization trick that we will develop in section 3.3.
3.2 ClosedForm updates for and
In addition to the variational posteriors of the beta distributions and the cluster assignments, closedform solutions can be obtained for the updates of and . Let us reconsider the evidence lower bound of equation (3), where we isolate only terms dependent on the prior parameters. We have:
where represents the covariance matrix of the layer. By setting the derivative of with respect to the parameters to zero, we obtain:
(7) 
(8) 
where to extract the diagonal elements we perform an elementwise multiplication by the identity matrix
I. The update rules obtained are similar to the MStep of a classical Gaussian Mixture Model, except in this case the updates are performed on the last hidden layer of the generative model, and the Estep of equation (5) takes into account all the hidden layers. Detailed derivation of the previous equations are presented in the supplementary material.3.3 Stochastic Backpropagation
We next show how to perform stochastic backpropagation in order to maximize
with respect to the parameters and . Similarly to the previous section, we isolate the terms in the evidence lower bound dependent on and . We have:(9) 
By taking the expectation over the hidden cluster variables , we obtain conditional expectations over the hidden layers knowing the cluster assignment. In order to backpropagate gradients of and , it suffices to perform a reparameterization trick for each cluster assignment at each hidden layer (proof in Appendix A). We can achieve this by sampling:
a sample from the posterior of the hidden layer can then be obtained by the following transformation:
where is supposed to be a diagonal matrix for simplicity. Following the previous analysis, we can derive an algorithm to perform inference on the proposed model, where between iterations of the fixed point update steps, epochs of gradient ascent are performed to obtain a local maximum of the ELBO with respect to and . Algorithm 1 summarizes the process.
4 SemiSupervised Learning (SSL)
4.1 SSL using the DPDGLMM
In this section, similarly to [10] we consider a partially labeled dataset , where is the labeled part, represents the label of the sample , and represents the unlabeled part. The log likelihood can be divided for the labeled and unlabeled parts as:
The last equation follows from the fact that . By dividing the labeled and unlabeled parts of the dataset, we can follow the same approach presented in section 3 in order to derive a variational inference algorithm. In this case, the fixed point updates and the gradient ascent steps remain unchanged if we set for a labeled sample.
4.2 The predictive distribution
In order to make predictions using the model, we need to evaluate the predictive distribution. Given a new sample , the objective is to evaluate the following quantity . This task requires an intractable marginalization over all the other hidden variables. However, similarly to [1], we can use the variational posterior to approximate the true posterior, which in turn leads to simpler expectation terms:
(10) 
where represents the forward pass over the generative model. The expectation with respect to the beta terms can be computed in closedform as a product of expectations over the beta posteriors. The second expectation can be evaluated using the MonteCarlo estimator of equation (6).
5 Experiments
5.1 Evaluation of the semisupervised classification
We evaluate the semisupervised classification capabilities of the model. We train our DPDLGMM model on the MNIST dataset [12] with trainvalidtest splits equal to similarly to [14]
, with 10 % labelisation randomly drawn. We run the process for 5 iterations, and we evaluate our model on the test set. We report the mean and standard deviation of the classification error in percentages in Table
1. Our method produces a competitive score with existing stateofthe art methods: Deep Generative Models (DGM) [10] and StickBreaking Deep Generative Models (SBDGM) [14]. Unlike the previous approaches, the loss was not upweighted for the labeled samples. Figure 2 shows the tSNE projections [13] obtained with 10 % of the labels provided. We notice that by introducing a small fraction of labels the class structure was highly preserved in the latent space.5.2 Data generation and visualization
To further test our model, we generate samples for each cluster from the models trained on both the MNIST and SVHN [16] datasets. The MNIST model is trained in an unsupervised manner, and the SVHN model is trained with semisupervision where we provide 1000 randomly generated labels. The samples obtained are represented in figure 4. For the unsupervised model, we notice that the clusters are representative of the shape of each digit. We plot the tSNE projections of the MNIST test set of the unsupervised model in Figure 3. We notice that the digits belonging to the same true class tend to group with each other. However, two groups of the same class can be very separated in the embedding space. The interpretation we can draw from this effect is that the DPDLGMM tends to separate the latent space in order to distinguish between the variations of hidden representations of the same class. The clusters obtained are not always representative of the true classes which is a common effect with infinite mixture models. In a full unsupervised setting, data can be explained by multiple correct clusterings. This effect can simply be countered by adding a small supervision (figure 2).
6 Conclusion
In this paper, we have presented a variational inference method for Dirichlet Process Deep Latent Gaussian Mixture Models. Our approach combines classical variational inference and neural variational inference. The algorithm derived is thus a standard variational inference algorithm, with fixed point updates over a subset of the parameters presenting linear dependencies. The parameters present in nonlinear transformations are updated using standard gradient ascent where the reparameterization trick can be applied for the variational posterior of the stochastic hidden layers knowing the cluster assignments. Our approach shows promising results both for the unsupervised and semisupervised cases. In future work, stochastic variational inference can be explored to speedup the training procedure. Our approach can also be generalized to other types of deep probabilistic graphical models.
Appendix A Proof of the reparameterization trick knowing the cluster assignment
The evidence lower bound of our model can be written in its general form as:
By introducing the following transformation:
and using the density transformation lemma:
we have:
where is a sample drawn from , thus we can backpropagate stochastic gradients for each class assignment.
Appendix B Stochastic Variational Inference
Updating the and parameters using epochs of gradient ascent significantly adds to the complexity of Algorithm 1. One possible approach is to perform stochastic variational inference [6] for fixed point update equations. This allows for the use of the same batch of data for the gradient ascent steps of and and the stochastic updates of the fixed point equations. Let us consider a batch , the updates in this case are:
where , , , , are computed for the minibatch using equations (3),(4),(7), and (8) respectively. In order to guarantee convergence must satisfy:
References
 Blei et al. [2006] Blei, D. M., Jordan, M. I., et al. Variational inference for dirichlet process mixtures. Bayesian analysis, 1(1):121–143, 2006.
 Blei et al. [2017] Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
 Dilokthanakul et al. [2016] Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
 Fox et al. [2008] Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. An hdphmm for systems with state persistence. In Proceedings of the 25th international conference on Machine learning, pp. 312–319, 2008.
 Graves [2016] Graves, A. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016.
 Hoffman et al. [2013] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Jiang et al. [2016] Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.
 Jordan et al. [1999] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kingma & Welling [2013] Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. [2014] Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589, 2014.

Kumaraswamy [1980]
Kumaraswamy, P.
A generalized probability density function for doublebounded random processes.
Journal of Hydrology, 46(12):79–88, 1980.  LeCun & Cortes [2010] LeCun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Maaten & Hinton [2008] Maaten, L. v. d. and Hinton, G. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Nalisnick & Smyth [2016] Nalisnick, E. and Smyth, P. Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016.

Nalisnick et al. [2016]
Nalisnick, E., Hertel, L., and Smyth, P.
Approximate inference for deep latent gaussian mixtures.
In
NIPS Workshop on Bayesian Deep Learning
, volume 2, 2016.  Netzer et al. [2011] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.
 Rezende et al. [2014] Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Ruiz et al. [2016] Ruiz, F. R., AUEB, M. T. R., and Blei, D. The generalized reparameterization gradient. In Advances in neural information processing systems, pp. 460–468, 2016.
 Sethuraman [1994] Sethuraman, J. A constructive definition of dirichlet priors. Statistica sinica, pp. 639–650, 1994.
 Zhang et al. [2016] Zhang, A., Gultekin, S., and Paisley, J. Stochastic variational inference for the hdphmm. In Artificial Intelligence and Statistics, pp. 800–808, 2016.