On the Variational Posterior of Dirichlet Process Deep Latent Gaussian Mixture Models

06/16/2020 ∙ by Amine Echraibi, et al. ∙ 21

Thanks to the reparameterization trick, deep latent Gaussian models have shown tremendous success recently in learning latent representations. The ability to couple them however with nonparamet-ric priors such as the Dirichlet Process (DP) hasn't seen similar success due to its non parameteriz-able nature. In this paper, we present an alternative treatment of the variational posterior of the Dirichlet Process Deep Latent Gaussian Mixture Model (DP-DLGMM), where we show that the prior cluster parameters and the variational posteriors of the beta distributions and cluster hidden variables can be updated in closed-form. This leads to a standard reparameterization trick on the Gaussian latent variables knowing the cluster assignments. We demonstrate our approach on standard benchmark datasets, we show that our model is capable of generating realistic samples for each cluster obtained, and manifests competitive performance in a semi-supervised setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonparametric Bayesian priors, such as the Dirichlet Process (DP), have been widely adopted in the probabilistic graphical community. Their ability to generate an infinite amount of probability distributions using a discrete latent variable makes them ideally suited for automatic model selection. The most famous applications of the DP have been however limited to classical probabilistic graphical models such as Dirichlet Process Mixture Models and Hierarchical Dirichlet Process Hidden Markov Models

[1, 4, 20].

Recently, deep generative models such as Deep Latent Gaussian Models (DLGMs) and Variational AutoEncoders (VAEs)

[9, 17] have shown huge success in modeling and generating complex data structures such as images. Various proposals to generalize these models to the mixture and nonparametric mixture cases have been made [15, 14, 3, 7]. Introducing such priors on top of the deep generative model can improve its generative capabilities, preserve class structure in the latent representation space, and offer a nonparametric way of performing model selection with respect to the size of the generative model.

The main challenge posed by such models lies in the inference process. Deep generative models with continuous latent variables owe their success mainly to the reparameterization trick [9, 17]

. This approach provides an efficient and scalable method for obtaining low variance estimates of the gradient of the variational lower bound with respect to variational posterior parameters. Applying this approach directly to the variational posterior of the DP is not straightforward, due to the fact that a reparameterization trick for the beta distributions is hard to obtain

[18]. One approach to bypass this issue have been proposed by [14], where the authors used the Kumaraswamy distribution [11] as a higher entropy alternative for the beta distribution in the variational posterior. However, by deriving the nature of the variational posterior directly from the variational lower bound, we can show that the appropriate distribution is in fact the beta distribution.

In this paper we provide an alternative treatment of the variational posterior of the DP-DLGMM, where we combine classical variational inference to derive the variational posteriors of the beta distributions and cluster hidden variables, and neural variational inference for the hidden variables of the latent Gaussian model. This leads to gradient ascent updates over the parameters present in nonlinear transformations where the reparameterization trick can be applied knowing the cluster assignment. As for the remaining parameters, closed-form solutions can be obtained by maximization of the evidence lower bound.

2 Dirichlet Process Deep Latent Gaussian Mixture Models

Generalizing deep latent Gaussian models to the Dirichlet process mixture case can be obtained by adding a Dirichlet process prior on the hidden cluster assignments. We denote these cluster assignments by . Following the assignment of a cluster hidden variable, a deep latent Gaussian model is defined for the assigned cluster similar to [17]. We adopt the stick-breaking construction of the Dirichlet Process [19]. The generative process of the model (figure 1) is given by:

where is the

layer hidden representation constructed using a nonlinear transformation

represented by a neural network for the cluster assignment

. For simplicity, we consider diagonal covariance matrices for each layer where the diagonal elements are , hence represents the element-wise product. The generalization to full covariance matrices is straightforward using the Cholesky decomposition.

We denote by

the concentration parameter of the Dirichlet process which is a hyperparameter to be tuned manually. The term

represents the emission distribution of the observable

, usually chosen to be a normal distribution for continuous variables or the Bernoulli distribution for binary variables. We denote the parameters of the generative model by:

The model thus has an infinite number of parameters due to the Dirichlet process prior. Furthermore, the posterior distribution of the hidden variables cannot be computed in closed-form. In order to perform inference on the model we need to use approximate methods such as Markov Chain Monte Carlo (MCMC) or Variational Inference. MCMC methods are not suitable for high dimensional models such as the DP-DLGMM, where convergence of the Markov chain to the true posterior can prove to be slow and hard to diagnose

[2].

In the next section, we develop a structured variational inference algorithm for DP-DLGMM. We show that by choosing a suitable structure for the variational posterior, closed-form solutions can be obtained for the updates of the truncated variational posteriors of the beta distributions, the variational posteriors of the cluster hidden variables, and the optimal prior parameters maximizing the evidence lower bound.

Figure 1: The graphical representation of the generative process of the model, with the convention .

3 Structured Variational Inference

For a brief review of variational methods, we denote by the samples present in the dataset supposed to be independent and identically distributed. The log-likelihood of the model is intractable due to the required marginalization of all the hidden variables. In order to bypass this marginalization, we introduce an approximate distribution and use Jensen’s inequality to obtain a lower bound [8]:

(1)

We can show that if the distribution is a good approximation of the true posterior, maximizing the evidence lower bound (ELBO) with respect to the model parameters is equivalent to maximizing the log-likelihood. For deep generative models, most state-of-the-art methods use inference networks to construct the posterior distribution [17, 14]. For deep mixture models with discrete latent variables, this approach leads to a mixture density variational posterior where the reparameterization trick requires additional investigation [5]. Our approach combines standard variational Bayes and neural variational inference. We approximate the true posterior using the following structured variational posterior:

(2)

where is a truncation level for the variational posterior of the beta distributions obtained by supposing that [1]. We assume a factorized posterior over the hidden layers , where the intra-layer dependencies are conserved.

3.1 Deriving the variational posteriors and

Deriving the nature of the posterior distributions of the hidden layers using the variational approach is intractable due to the nonlinearities present in the model. Thus, we take a similar approach to [17], and we assume that the variational posterior is specified by an inference network, where the parameters of the distribution are the outputs of deep neural networks and of parameters for the layer and the cluster:

In contrast to the hidden layers, we can use the proposed variational posterior of equation (2) to derive closed-form solutions for and . Let us consider the Kullback–Leibler definition of the ELBO :

By plugging the variational posterior and isolating terms and terms, we can analytically derive the optimal distributions and maximizing :

where the fixed point equations for the variational parameters and are:

(3)
(4)
(5)

The fixed point equation of , requires the evaluation of the expectation over the hidden layers, this can be performed by sampling from the variational posterior of each hidden layer and then forwarding the sample using the generative model:

(6)

A key insight here is the following: if a cluster is incapable of reconstructing a sample from the variational posterior, this will reinforce the belief that should not be assigned to that cluster. Furthermore, the estimation of the expectation can be performed using the same reparameterization trick that we will develop in section 3.3.

3.2 Closed-Form updates for and

In addition to the variational posteriors of the beta distributions and the cluster assignments, closed-form solutions can be obtained for the updates of and . Let us reconsider the evidence lower bound of equation (3), where we isolate only terms dependent on the prior parameters. We have:

where represents the covariance matrix of the layer. By setting the derivative of with respect to the parameters to zero, we obtain:

(7)
(8)

where to extract the diagonal elements we perform an elementwise multiplication by the identity matrix

I. The update rules obtained are similar to the M-Step of a classical Gaussian Mixture Model, except in this case the updates are performed on the last hidden layer of the generative model, and the E-step of equation (5) takes into account all the hidden layers. Detailed derivation of the previous equations are presented in the supplementary material.

3.3 Stochastic Backpropagation

We next show how to perform stochastic backpropagation in order to maximize

with respect to the parameters and . Similarly to the previous section, we isolate the terms in the evidence lower bound dependent on and . We have:

(9)

By taking the expectation over the hidden cluster variables , we obtain conditional expectations over the hidden layers knowing the cluster assignment. In order to backpropagate gradients of and , it suffices to perform a reparameterization trick for each cluster assignment at each hidden layer (proof in Appendix A). We can achieve this by sampling:

a sample from the posterior of the hidden layer can then be obtained by the following transformation:

where is supposed to be a diagonal matrix for simplicity. Following the previous analysis, we can derive an algorithm to perform inference on the proposed model, where between iterations of the fixed point update steps, epochs of gradient ascent are performed to obtain a local maximum of the ELBO with respect to and . Algorithm 1 summarizes the process.

  Input:
  Initialize
  while not converged do
      {(3), (4)}
      {(7)}
      {(8)}
     for each epoch do
        
        
     end for
      {(5)}
  end while
Algorithm 1 Variational Inference for the DP-DLGMM

4 Semi-Supervised Learning (SSL)

4.1 SSL using the DP-DGLMM

In this section, similarly to [10] we consider a partially labeled dataset , where is the labeled part, represents the label of the sample , and represents the unlabeled part. The log likelihood can be divided for the labeled and unlabeled parts as:

The last equation follows from the fact that . By dividing the labeled and unlabeled parts of the dataset, we can follow the same approach presented in section 3 in order to derive a variational inference algorithm. In this case, the fixed point updates and the gradient ascent steps remain unchanged if we set for a labeled sample.

4.2 The predictive distribution

In order to make predictions using the model, we need to evaluate the predictive distribution. Given a new sample , the objective is to evaluate the following quantity . This task requires an intractable marginalization over all the other hidden variables. However, similarly to [1], we can use the variational posterior to approximate the true posterior, which in turn leads to simpler expectation terms:

(10)

where represents the forward pass over the generative model. The expectation with respect to the beta terms can be computed in closed-form as a product of expectations over the beta posteriors. The second expectation can be evaluated using the Monte-Carlo estimator of equation (6).

5 Experiments

5.1 Evaluation of the semi-supervised classification

We evaluate the semi-supervised classification capabilities of the model. We train our DP-DLGMM model on the MNIST dataset [12] with train-valid-test splits equal to similarly to [14]

, with 10 % labelisation randomly drawn. We run the process for 5 iterations, and we evaluate our model on the test set. We report the mean and standard deviation of the classification error in percentages in Table

1. Our method produces a competitive score with existing state-of-the art methods: Deep Generative Models (DGM) [10] and Stick-Breaking Deep Generative Models (SB-DGM) [14]. Unlike the previous approaches, the loss was not up-weighted for the labeled samples. Figure 2 shows the t-SNE projections [13] obtained with 10 % of the labels provided. We notice that by introducing a small fraction of labels the class structure was highly preserved in the latent space.

Figure 2: t-SNE plot of the second stochastic hidden layer on the MNIST test set for the semi-supervised (10% labels) version of the DP-DLGMM.
Figure 3: t-SNE plot of the second stochastic hidden layer on the MNIST test set for the unsupervised version of the DP-DLGMM.
kNN
(k=5)
DGM SB-DGM DP-DLGMM
Table 1: Semi-supervised classification error (%) on the MNIST test set with 10 % labelisation. Comparison with [14]

.

5.2 Data generation and visualization

Figure 4: Generated samples from the DP-DLGMM model for the unsupervised version on the MNIST dataset (left) and the semi-supervised version on the SVHN dataset (right).

To further test our model, we generate samples for each cluster from the models trained on both the MNIST and SVHN [16] datasets. The MNIST model is trained in an unsupervised manner, and the SVHN model is trained with semi-supervision where we provide 1000 randomly generated labels. The samples obtained are represented in figure 4. For the unsupervised model, we notice that the clusters are representative of the shape of each digit. We plot the t-SNE projections of the MNIST test set of the unsupervised model in Figure 3. We notice that the digits belonging to the same true class tend to group with each other. However, two groups of the same class can be very separated in the embedding space. The interpretation we can draw from this effect is that the DP-DLGMM tends to separate the latent space in order to distinguish between the variations of hidden representations of the same class. The clusters obtained are not always representative of the true classes which is a common effect with infinite mixture models. In a full unsupervised setting, data can be explained by multiple correct clusterings. This effect can simply be countered by adding a small supervision (figure 2).

6 Conclusion

In this paper, we have presented a variational inference method for Dirichlet Process Deep Latent Gaussian Mixture Models. Our approach combines classical variational inference and neural variational inference. The algorithm derived is thus a standard variational inference algorithm, with fixed point updates over a subset of the parameters presenting linear dependencies. The parameters present in nonlinear transformations are updated using standard gradient ascent where the reparameterization trick can be applied for the variational posterior of the stochastic hidden layers knowing the cluster assignments. Our approach shows promising results both for the unsupervised and semi-supervised cases. In future work, stochastic variational inference can be explored to speed-up the training procedure. Our approach can also be generalized to other types of deep probabilistic graphical models.

Appendix A Proof of the reparameterization trick knowing the cluster assignment

The evidence lower bound of our model can be written in its general form as:

By introducing the following transformation:

and using the density transformation lemma:

we have:

where is a sample drawn from , thus we can backpropagate stochastic gradients for each class assignment.

Appendix B Stochastic Variational Inference

Updating the and parameters using epochs of gradient ascent significantly adds to the complexity of Algorithm 1. One possible approach is to perform stochastic variational inference [6] for fixed point update equations. This allows for the use of the same batch of data for the gradient ascent steps of and and the stochastic updates of the fixed point equations. Let us consider a batch , the updates in this case are:

where , , , , are computed for the minibatch using equations (3),(4),(7), and (8) respectively. In order to guarantee convergence must satisfy:

References