1 Introduction
Clustering is the process of grouping similar objects together, which is one of the most fundamental tasks in machine learning and artificial intelligence. Over the past decades, a large family of clustering algorithms have been developed and successfully applied in enormous real world tasks Ng et al. (2002); Ye et al. (2008); Yang et al. (2010); Xie et al. (2016). Generally speaking, there is a dichotomy of clustering methods: Similaritybased clustering and Featurebased clustering. Similaritybased clustering builds models upon a distance matrix, which is a matrix that measures the distance between each pair of the
samples. One of the most famous similaritybased clustering methods is Spectral Clustering (SC)
Von Luxburg (2007), which leverages the Laplacian spectra of the distance matrix to reduce dimensionality before clustering. Similaritybased clustering methods have the advantage that domainspecific similarity or kernel functions can be easily incorporated into the models. But these methods suffer scalability issue due to superquadratic running time for computing spectra.Different from similaritybased methods, a featurebased method takes a matrix as input, where is the number of samples and is the feature dimension. One popular featurebased clustering method is means, which aims to partition the samples into
clusters so as to minimize the withincluster sum of squared errors. Another representative featurebased clustering model is Gaussian Mixture Model (GMM), which assumes that the data points are generated from a MixtureofGaussians (MoG), and the parameters of GMM are optimized by the Expectation Maximization (EM) algorithm. One advantage of GMM over
means is that a GMM can generate samples by estimation of data density. Although means, GMM and their variants Ye et al. (2008); Liu et al. (2010) have been extensively used, learning good representations most suitable for clustering tasks is left largely unexplored.Recently, deep learning has achieved widespread success in numerous machine learning tasks
Krizhevsky et al. (2012); Zheng et al. (2014b); Szegedy et al. (2015); Zheng et al. (2014a); He et al. (2016); Zheng et al. (2015, 2016), where learning good representations by deep neural networks (DNN) lies in the core. Taking a similar approach, it is conceivable to conduct clustering analysis on good representations, instead of raw data points. In a recent work, Deep Embedded Clustering (DEC)
Xie et al. (2016) was proposed to simultaneously learn feature representations and cluster assignments by deep neural networks. Although DEC performs well in clustering, similar to means, DEC cannot model the generative process of data, hence is not able to generate samples. Some recent works, e.g. VAE Kingma and Welling (2014), GAN Goodfellow et al. (2014) , PixelRNN Oord et al. (2016), InfoGAN Chen et al. (2016) and PPGN Nguyen et al. (2016), have shown that neural networks can be trained to generate meaningful samples. The motivation of this work is to develop a clustering model based on neural networks that 1) learns good representations that capture the statistical structure of the data, and 2) is capable of generating samples.In this paper, we propose a clustering framework, Variational Deep Embedding (VaDE), that combines VAE Kingma and Welling (2014) and a Gaussian Mixture Model for clustering tasks. VaDE models the data generative process by a GMM and a DNN : 1) a cluster is picked up by the GMM; 2) from which a latent representation is sampled; 3) DNN decodes to an observation . Moreover, VaDE is optimized by using another DNN to encode observed data into latent embedding , so that the Stochastic Gradient Variational Bayes (SGVB) estimator and the reparameterization trick Kingma and Welling (2014) can be used to maximize the evidence lower bound (ELBO). VaDE generalizes VAE in that a MixtureofGaussians prior replaces the single Gaussian prior. Hence, VaDE is by design more suitable for clustering tasks^{2}^{2}2
Although people can use VaDE to do unsupervised feature learning or semisupervised learning tasks, we only focus on clustering tasks in this work.
. Specifically, the main contributions of the paper are:
We propose an unsupervised generative clustering framework, VaDE, that combines VAE and GMM together.

We show how to optimize VaDE by maximizing the ELBO using the SGVB estimator and the reparameterization trick;

Experimental results show that VaDE outperforms the stateoftheart clustering models on datasets from various modalities by a large margin;

We show that VaDE can generate highly realistic samples for any specified cluster, without using supervised information during training.
The diagram of VaDE is illustrated in Figure 1.
2 Related Work
Recently, people find that learning good representations plays an important role in clustering tasks. For example, DEC Xie et al. (2016) was proposed to learn feature representations and cluster assignments simultaneously by deep neural networks. In fact, DEC learns a mapping from the observed space to a lowerdimensional latent space, where it iteratively optimizes the KL divergence to minimize the withincluster distance of each cluster. DEC achieved impressive performances on clustering tasks. However, the feature embedding in DEC is designed specifically for clustering and fails to uncover the real underlying structure of the data, which makes the model lack of the ability to extend itself to other tasks beyond clustering, such as generating samples.
The deep generative models have recently attracted much attention in that they can capture the data distribution by neural networks, from which unseen samples can be generated. GAN and VAE are among the most successful deep generative models in recent years. Both of them are appealing unsupervised generative models, and their variants have been extensively studied and applied in various tasks such as semisupervised classification Kingma et al. (2014); Maaløe et al. (2016); Salimans et al. (2016); Makhzani et al. (2016); Abbasnejad et al. (2016), clustering Makhzani et al. (2016) and image generation Radford et al. (2016); Dosovitskiy and Brox (2016).
For example, Abbasnejad et al. (2016) proposed to use a mixture of VAEs for semisupervised classification tasks, where the mixing coefficients of these VAEs are modeled by a Dirichlet process to adapt its capacity to the input data. SBVAE Nalisnick and Smyth (2016) also applied Bayesian nonparametric techniques on VAE, which derived a stochastic latent dimensionality by a stickbreaking prior and achieved good performance on semisupervised classification tasks. VaDE differs with SBVAE in that the cluster assignment and the latent representation are jointly considered in the Gaussian mixture prior, whereas SBVAE separately models the latent representation and the class variable, which fails to capture the dependence between them. Additionally, VaDE does not need the class label during training, while the labels of data are required by SBVAE due to its semisupervised setting. Among the variants of VAE, Adversarial AutoEncoder(AAE) Makhzani et al. (2016) can also do unsupervised clustering tasks. Different from VaDE, AAE uses GAN to match the aggregated posterior with the prior of VAE, which is much more complex than VaDE on the training procedure. We will compare AAE with VaDE in the experiments part.
Similar to VaDE, Nalisnick et al. (2016)
proposed DLGMM to combine VAE and GMM together. The crucial difference, however, is that VaDE uses a mixture of Gaussian prior to replace the single Gaussian prior of VAE, which is suitable for clustering tasks by nature, while DLGMM uses a mixture of Gaussian distribution as the approximate posterior of VAE and does not model the class variable. Hence, VaDE generalizes VAE to clustering tasks, whereas DLGMM is used to improve the capacity of the original VAE and is not suitable for clustering tasks by design. The recently proposed GMCVAE
Shu et al. (2016) also combines VAE with GMM together. However, the GMM in GMCVAE is used to model the transitions between video frames, which is the main difference with VaDE.3 Variational Deep Embedding
In this section, we describe Variational Deep Embedding (VaDE), a model for probabilistic clustering problem within the framework of Variational AutoEncoder (VAE).
3.1 The Generative Process
Since VaDE is a kind of unsupervised generative approach to clustering, we herein first describe the generative process of VaDE. Specifically, suppose there are clusters, an observed sample is generated by the following process:

Choose a cluster

Choose a latent vector

Choose a sample :

If is binary

Compute the expectation vector
(1) 
Choose a sample


If is realvalued

Compute and
(2) 
Choose a sample


where is a predefined parameter,
is the prior probability for cluster
, , , is the categorical distribution parametrized by , andare the mean and the variance of the Gaussian distribution corresponding to cluster
,is an identity matrix,
is a neural network whose input is and is parametrized by , andare multivariate Bernoulli distribution and Gaussian distribution parametrized by
and , respectively. The generative process is depicted in Figure 1.According to the generative process above, the joint probability can be factorized as:
(3) 
since and are independent conditioned on . And the probabilities are defined as:
(4)  
(5)  
(6) 
3.2 Variational Lower Bound
A VaDE instance is tuned to maximize the likelihood of the given data points. Given the generative process in Section 3.1, by using Jensen’s inequality, the loglikelihood of VaDE can be written as:
(7) 
where is the evidence lower bound (ELBO), is the variational posterior to approximate the true posterior . In VaDE, we assume to be a meanfield distribution and can be factorized as:
(8) 
In VaDE, similar to VAE, we use a neural network to model :
(10)  
(11) 
where is the parameter of network .
By substituting the terms in Equation 3.2 with Equations 4, 5, 6 and 11, and using the SGVB estimator and the reparameterization trick, the can be rewritten as: ^{3}^{3}3This is the case when the observation is binary. For the realvalued situation, the ELBO can be obtained in a similar way.
(12) 
where is the number of Monte Carlo samples in the SGVB estimator, is the dimensionality of and , is the ^{th} element of , is the dimensionality of , , and , and denotes the ^{th} element of , is the number of clusters, is the prior probability of cluster , and denotes for simplicity.
In Equation 12, we compute as
(13) 
where is the ^{th} sample from by Equation 11 to produce the Monte Carlo samples. According to the reparameterization trick, is obtained by
(14) 
where , is elementwise multiplication, and , are derived by Equation 10.
We now describe how to formulate in Equation 12 to maximize the ELBO. Specifically, can be rewritten as:
(15) 
In Equation 15, the first term has no relationship with and the second term is nonnegative. Hence, to maximize , should be satisfied. As a result, we use the following equation to compute in VaDE:
(16) 
By using Equation 16, the information loss induced by the meanfield approximation can be mitigated, since captures the relationship between and . It is worth noting that is only an approximation to , and we find it works well in practice^{4}^{4}4We approximate by: 1) sampling a ; 2) computing according to Equation 16.
3.3 Understanding the ELBO of VaDE
This section, we provide some intuitions of the ELBO of VaDE. More specifically, the ELBO in Equation 7 can be further rewritten as:
(17) 
The first term in Equation 17 is the reconstruction
term, which encourages VaDE to explain the dataset well. And the second term is the KullbackLeibler divergence from the MixtureofGaussians (MoG) prior
to the variational posterior , which regularizes the latent embedding to lie on a MoG manifold.To demonstrate the importance of the KL term in Equation 17, we train an AutoEncoder (AE) with the same network architecture as VaDE first, and then apply GMM on the latent representations from the learned AE, since a VaDE model without the KL term is almost equivalent to an AE. We refer to this model as AE+GMM. We also show the performance of using GMM directly on the observed space (GMM), using VAE on the observed space and then using GMM on the latent space from VAE (VAE+GMM)^{5}^{5}5By doing this, VAE and GMM are optimized separately., as well as the performances of LDMGI Yang et al. (2010), AAE Makhzani et al. (2016) and DEC Xie et al. (2016), in Figure 2. The fact that VaDE outperforms AE+GMM (without KL term) and VAE+GMM significantly confirms the importance of the regularization term and the advantage of jointly optimizing VAE and GMM by VaDE. We also present the illustrations of clusters and the way they are changed w.r.t. training epochs on MNIST dataset in Figure 3, where we map the latent representations into 2D space by tSNE Maaten and Hinton (2008).
4 Experiments
In this section, we evaluate the performance of VaDE on 5 benchmarks from different modalities: MNIST LeCun et al. (1998), HHAR Stisen et al. (2015), Reuters10K Lewis et al. (2004), Reuters Lewis et al. (2004) and STL10 Coates et al. (2011). We provide quantitative comparisons of VaDE with other clustering methods including GMM, AE+GMM, VAE+GMM, LDGMI Yang et al. (2010), AAE Makhzani et al. (2016) and the strong baseline DEC Xie et al. (2016). We use the same network architecture as DEC for a fair comparison. The experimental results show that VaDE achieves the stateoftheart performance on all these benchmarks. Additionally, we also provide quantitatively comparisons with other variants of VAE on the discriminative quality of the latent representations. The code of VaDE is available at https://github.com/slim1017/VaDE.
4.1 Datasets Description
The following datasets are used in our empirical experiments.

MNIST: The MNIST dataset consists of handwritten digits. The images are centered and of size 28 by 28 pixels. We reshaped each image to a 784dimensional vector.

HHAR: The Heterogeneity Human Activity Recognition (HHAR) dataset contains sensor records from smart phones and smart watches. All samples are partitioned into categories of human activities and each sample is of dimensions.

REUTERS: There are around English news stories labeled with a category tree in original Reuters dataset. Following DEC, we used root categories: corporate/industrial, government/social, markets, and economics as labels and discarded all documents with multiple labels, which results in a article dataset. We computed tfidf features on the most frequent words to represent all articles. Similar to DEC, a random subset of documents is sampled, which is referred to as Reuters10K, since some spectral clustering methods (e.g. LDMGI) cannot scale to full Reuters dataset.

STL10: The STL10 dataset consists of color images of 96by96 pixel size. There are classes with examples each. Since clustering directly from raw pixels of high resolution images is rather difficult, we extracted features of images of STL10 by ResNet50 He et al. (2016), which were then used to test the performance of VaDE and all baselines. More specifically, we applied a average pooling over the last feature map of ResNet50 and the dimensionality of the features is .
Dataset  Samples  Input Dim  Clusters 

MNIST  
HHAR  
REUTERS10K  
REUTERS  
STL10 
Method  MNIST  HHAR  REUTERS10K  REUTERS  STL10 

GMM  
AE+GMM  
VAE+GMM  
LDMGI  N/A  
AAE  
DEC  
VaDE 
Method  k=3  k=5  k=10 

VAE  
DLGMM  
SBVAE  
VaDE 
) for kNN on latent space.
4.2 Experimental Setup
As mentioned before, the same network architecture as DEC is adopted by VaDE for a fair comparison. Specifically, the architectures of and in Equation 1 and Equation 10 are  and , respectively, where is the input dimensionality. All layers are fully connected. Adam optimizer Kingma and Ba (2015) is used to maximize the ELBO of Equation 3.2, and the minibatch size is . The learning rate for MNIST, HHAR, Reuters10K and STL10 is and decreases every epochs with a decay rate of , and the learning rate for Reuters is with a decay rate of for every epoch. As for the generative process in Section 3.1, the multivariate Bernoulli distribution is used for MNIST dataset, and the multivariate Gaussian distribution is used for the others. The number of clusters is fixed to the number of classes for each dataset, similar to DEC. We will vary the number of clusters in Section 4.6.
Similar to other VAEbased models Sønderby et al. (2016); Kingma and Salimans (2016), VaDE suffers from the problem that the reconstruction term in Equation 17 would be so weak in the beginning of training that the model might get stuck in an undesirable local minima or saddle point, from which it is hard to escape. In this work, pretraining is used to avoid this problem. Specifically, we use a Stacked AutoEncoder to pretrain the networks and . Then all data points are projected into the latent space by the pretrained network , where a GMM is applied to initialize the parameters of ,
. In practice, few epochs of pretraining are enough to provide a good initialization of VaDE. We find that VaDE is not sensitive to hyperparameters after pretraining. Hence, we did not spend a lot of effort to tune them.
4.3 Quantitative Comparison
Following DEC, the performance of VaDE is measured by unsupervised clustering accuracy (ACC), which is defined as:
where is the total number of samples, is the groundtruth label, is the cluster assignment obtained by the model, and is the set of all possible onetoone mappings between cluster assignments and labels. The best mapping can be obtained by using the Kuhn–Munkres algorithm Munkres (1957). Similar to DEC, we perform random restarts when initializing all clustering models and pick the result with the best objective value. As for LDMGI, AAE and DEC, we use the same configurations as their original papers. Table 2 compares the performance of VaDE with other baselines over all datasets. It can be seen that VaDE outperforms all these baselines by a large margin on all datasets. Specifically, on MNIST, HHAR, Reuters10K, Reuters and STL10 dataset, VaDE achieves ACC of , , , and , which outperforms DEC with a relative increase ratio of , , , and , respectively.
We also compare VaDE with SBVAE Nalisnick and Smyth (2016) and DLGMM Nalisnick et al. (2016)
on the discriminative power of the latent representations, since these two baselines cannot do clustering tasks. Following SBVAE, the discriminative powers of the models’ latent representations are assessed by running a kNearest Neighbors classifier (kNN) on the latent representations of MNIST. Table
3 shows the error rate of the kNN classifier on the latent representations. It can be seen that VaDE outperforms SBVAE and DLGMM significantly^{6}^{6}6We use the same network architecture for VaDE, SBVAE in Table 3 for fair comparisons. Since there is no code available for DLGMM, we take the number of DLGMM directly from Nalisnick et al. (2016). Note that Nalisnick and Smyth (2016) has already shown that the performance of SBVAE is comparable to DLGMM..Note that although VaDE can learn discriminative representations of samples, the training of VaDE is in a totally unsupervised way. Hence, we did not compare VaDE with other supervised models.
4.4 Generating Samples by VaDE
One major advantage of VaDE over DEC Xie et al. (2016) is that it is by nature a generative clustering model and can generate highly realistic samples for any specified cluster (class). In this section, we provide some qualitative comparisons on generating samples among VaDE, GMM, VAE and the stateofart generative method InfoGAN Chen et al. (2016).
Figure 4 illustrates the generated samples for class to of MNIST by GMM, VAE, InfoGAN and VaDE, respectively. It can be seen that the digits generated by VaDE are smooth and diverse. Note that the classes of the samples from VAE cannot be specified. We can also see that the performance of VaDE is comparable with InfoGAN.
4.5 Visualization of Learned Embeddings
In this section, we visualize the learned representations of VAE, DEC and VaDE on MNIST dataset. To this end, we use tSNE Maaten and Hinton (2008) to reduce the dimensionality of the latent representation from to , and plot randomly sampled digits in Figure 5. The first row of Figure 5 illustrates the groundtruth labels for each digit, where different colors indicate different labels. The second row of Figure 5 demonstrates the clustering results, where correctly clustered samples are colored with green and incorrect ones with red.
From Figure 5 we can see that the original VAE which used a single Gaussian prior does not perform well in clustering tasks. It can also be observed that the embeddings learned by VaDE are better than those by VAE and DEC, since the number of incorrectly clustered samples is smaller. Furthermore, incorrectly clustered samples by VaDE are mostly located at the border of each cluster, where confusing samples usually appear. In contrast, a lot of the incorrectly clustered samples of DEC appear in the interior of the clusters, which indicates that DEC fails to preserve the inherent structure of the data. Some mistakes made by DEC and VaDE are also marked in Figure 5.
4.6 The Impact of the Number of Clusters
So far, the number of clusters for VaDE is set to the number of classes for each dataset, which is a prior knowledge. To demonstrate VaDE’s representation power as an unsupervised clustering model, we deliberately choose different numbers of clusters . Each row in Figure 6 illustrates the samples from a cluster grouped by VaDE on MNIST dataset, where is set to and in Figure 6(a) and Figure 6(b), respectively. We can see that, if is smaller than the number of classes, digits with similar appearances will be clustered together, such as and , and in Figure 6(a). On the other hand, if is larger than the number of classes, some digits will fall into subclasses by VaDE, such as the fatter and thinner , and the upright and oblique in Figure 6(b).
5 Conclusion
In this paper, we proposed Variational Deep Embedding (VaDE) which embeds the probabilistic clustering problems into a Variational AutoEncoder (VAE) framework. VaDE models the data generative procedure by a GMM model and a neural network, and is optimized by maximizing the evidence lower bound (ELBO) of the loglikelihood of data by the SGVB estimator and the reparameterization trick. We compared the clustering performance of VaDE with strong baselines on 5 benchmarks from different modalities, and the experimental results showed that VaDE outperforms the stateoftheart methods by a large margin. We also showed that VaDE could generate highly realistic samples conditioned on cluster information without using any supervised information during training. Note that although we use a MoG prior for VaDE in this paper, other mixture models can also be adopted in this framework flexibly, which will be our future work.
Acknowledgments
We thank the School of Mechanical Engineering of BIT (Beijing Institute of Technology) and Collaborative Innovation Center of Electric Vehicles in Beijing for their support. This work was supported by the National Natural Science Foundation of China (61620106002, 61271376). We also thank the anonymous reviewers.
References
 Abbasnejad et al. [2016] Ehsan Abbasnejad, Anthony Dick, and Anton van den Hengel. Infinite variational autoencoder for semisupervised learning. arXiv preprint arXiv:1611.07800, 2016.
 Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
 Coates et al. [2011] Adam Coates, Andrew Y Ng, and Honglak Lee. An analysis of singlelayer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, 2011.
 Dosovitskiy and Brox [2016] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In NIPS, 2016.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

Kingma and Salimans [2016]
Diederik P Kingma and Tim Salimans.
Improving variational autoencoders with inverse autoregressive flow.
In NIPS, 2016.  Kingma and Welling [2014] Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2014.
 Kingma et al. [2014] Diederik P. Kingma, Danilo J. Rezende, Shakir Mohamed, and Max Welling. Semisupervised learning with deep generative models. In NIPS, 2014.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Lewis et al. [2004] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 2004.
 Liu et al. [2010] Jialu Liu, Deng Cai, and Xiaofei He. Gaussian mixture model with local consistency. In AAAI, 2010.
 Maaløe et al. [2016] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. In ICML, 2016.
 Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 2008.
 Makhzani et al. [2016] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In NIPS, 2016.
 Munkres [1957] James Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 1957.
 Nalisnick and Smyth [2016] Eric Nalisnick and Padhraic Smyth. Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197, 2016.
 Nalisnick et al. [2016] Eric Nalisnick, Lars Hertel, and Padhraic Smyth. Approximate inference for deep latent gaussian mixtures. 2016.
 Ng et al. [2002] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. In NIPS, 2002.
 Nguyen et al. [2016] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005, 2016.

Oord et al. [2016]
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
In ICML, 2016.  Radford et al. [2016] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, 2016.
 Shu et al. [2016] Rui Shu, James Brofos, Frank Zhang, Hung Hai Bui, Mohammad Ghavamzadeh, and Mykel Kochenderfer. Stochastic video prediction with conditional density estimation. In ECCV Workshop on Action and Anticipation for Visual Learning, 2016.
 Sønderby et al. [2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In NIPS, 2016.
 Stisen et al. [2015] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, 2015.

Szegedy et al. [2015]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–9, 2015.  Von Luxburg [2007] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 2007.
 Xie et al. [2016] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016.
 Yang et al. [2010] Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting Zhuang. Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing, 2010.

Ye et al. [2008]
Jieping Ye, Zheng Zhao, and Mingrui Wu.
Discriminative kmeans for clustering.
In NIPS, 2008.  Zheng et al. [2014a] Yin Zheng, Richard S Zemel, YuJin Zhang, and Hugo Larochelle. A neural autoregressive approach to attentionbased recognition. International Journal of Computer Vision, 113(1):67–79, 2014.
 Zheng et al. [2014b] Yin Zheng, YuJin Zhang, and H. Larochelle. Topic modeling of multimodal data: An autoregressive approach. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1370–1377, June 2014.
 Zheng et al. [2015] Y. Zheng, YuJin Zhang, and H. Larochelle. A deep and autoregressive approach for topic modeling of multimodal data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PP(99):1–1, 2015.
 Zheng et al. [2016] Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. A neural autoregressive approach to collaborative filtering. In Proceedings of the 33nd International Conference on Machine Learning, pages 764–773, 2016.
Appendix A
In this section, we provide the derivation of .
The evidence lower bound can be rewritten as:
(18) 
In Equation 18, the first term does not depend on and the second term is nonnegative. Thus, maximizing the lower bound with respect to requires that . Thus, we have
where is a constant.
Since and , we have:
Taking the expectation on both sides, we can obtain:
Appendix B
Lemma 1 Given two multivariate Gaussian distributions and , we have:
(19) 
where , , and simply denote the ^{th} element of , , and , respectively, and is the dimensionality of .
Proof (of Lemma 1).
where denotes for simplicity.
Appendix C
In this section, we describe how to compute the evidence lower bound of VaDE. Specifically, the evidence lower bound can be rewritten as:
(20) 
The Equation 20 can be computed by substituting Equation 4, 5, 6, 11 and 16 into Equation 20 and using Lemma 1 in Appendix B. Specifically, each item of Equation 20 can be obtained as follows:

:
Recall that the observation x can be modeled as either a multivariate Bernoulli distribution or a multivariate Gaussian distribution. We provide the derivation of for the multivariate Bernoulli distribution, and the derivation for the multivariate Gaussian case can be obtained in a similar way.
Using the SGVB estimator, we can approximate the as:
where , and . is the number of Monte Carlo samples in the SGVB estimator and can be set to . is the dimensionality of .
Since the Monte Carlo estimate of the expectation above is nondifferentiable w.r.t when is directly sampled from , we use the reparameterization trick to obtain a differentiable estimation:
where denotes the elementwise product.

:
According to Lemma 1 in Appendix B, we have:

:

:
According to Lemma 1 in Appendix B, we have:

:
where , , and simply denote the ^{th} element of , , and described in Section 3 respectively. is the dimensionality of z and is the number of clusters.
For all the above equations, is computed by Appendix A and can be approximated by the SGVB estimator and the reparameterization trick as follows:
where , , and .