1 Concrete GMVAE
Variational Autoencoders (
VAE) [6, 11]are popular latent variable probabilistic unsupervised learning methods, suitable for use with deep neural networks. The standard VAE formulation has a single continuous latent vector in its probabilistic model. However traditional clustering models such as the Gaussian Mixture Model contain a discrete latent variable representing the cluster id. While one can perform clustering using a standard VAE by for example first training a VAE and then performing Kmeans clustering on the inferred latent variables for each training example, it may be beneficial from a modelling and computational point of view to train endtoend a VAE capable of clustering via a discrete latent variable.
The Gaussian Mixture VAE (GMVAE) [2, 3]
is one of a number of VAE variants with discrete latent variables which can be used for unsupervised clustering and semisupervised learning
[7]. The GMVAE defines the following generative model of the observed data :(1)  
(2)  
(3)  
(4) 
Where , and are chosen to be deep neural networks. When is not distributed Bernoulli a different likelihood model is used. As is standard for VAEs we introduce a factorized inference model
for the latents which are also deep networks. The collection of networks are then trained endtoend to maximize a single sample Monte Carlo estimate of the ELBO
, a lower bound on :(5)  
(6) 
with . Thus even the single sample Monte Carlo estimate of the ELBO requires a summation over all possible settings of the cluster label , making training time linear in the number of clusters used. When one wishes to use a large number of clusters, this linear training time complexity is prohibitive especially when combined with the large datasets and deep neural networks the GMVAE is designed to be applied to.
But the summation over cluster ids is only required because sampling from the categorical distribution is nondifferentiable. To avoid the linear scaling of the training time complexity we could use a REINFORCE style gradient estimator [12]
but such methods tend to provide high variance gradient estimates and corresponding slow practical convergence. We instead propose the
Concrete GMVAE by continuously relaxing using the Concrete (also known as GumbelSoftmax) distribution [10, 4]. As is now a continuous approximation to the discrete Categorical distribution, sampling from is differentiable using the reparameterization trick [6, 11, 10, 4]. It follows that the single sample Monte Carlo estimate of the ELBO can be obtained in a time independent of the number of clusters used:(7) 
where , which can be obtained through ancestral sampling. In practice, as has been found when applying VAEs to text [1, 13], we found that the direct use of eq. (7) leads to the network ignoring the latent variable by reducing the KL divergence to zero. We adopt a similar strategy to [1] and introduce a weight on which we anneal from zero to one during the course of training. Our MC estimate of the ELBO thus becomes:
(8) 
We find this modification to be sufficient to encourage use of .
2 Experiments
We test our Concrete GMVAE on the binarized MNIST and CIFAR100 datasets
[9, 8]. For MNIST we set K=10 clusters, one for each class. For CIFAR100 in order to keep the standard GMVAE training time reasonable we use K=20 clusters, one for each “superclass” in the dataset [8]. We anneal the KL weight following: . Further experimental details and neural network architectures are given in appendix A. We note that at test time, for all models we fully discretize, by onehot encoding the argmax of
.We can see from table 1 that for both datasets there is no significant difference in the test set loglikelihood for the GMVAE and the Concrete GMVAE i.e. our proposed method learns an equally good model of the data to the standard model. Without KL annealing we obsvserved that at convergence was close to zero and thus the model failed to learn to cluster the dataset. The computational advantage of using the Concrete GMVAE is clear from the training times, training the Concrete GMVAE is approximately 4X and 8X faster than the standard GMVAE for K=10 and K=20 respectively. As K grows this gap would further increase.
Dataset  K  GMVAE  Concrete GMVAE  

    Test  Training time  Test  Training time 
MNIST  10  
CIFAR100  20 
Comparison of Concrete GMVAE vs. standard GMVAE. Mean and standard deviation of test set loglikelihood and training time (in hours) over 11 training runs are shown. Training was done on an Amazon EC2 p3.2xlarge instance with 1 NVIDIA V100 Tensor Core GPU and 8 virtual CPUs.
These results demonstrate that the theoretical reduction in training time complexity from linear to constant scaling in K, has in practice reduced training time substantially in a realistic training setup (see appendix A) with early stopping, etc. Despite significant speedup in training time, our introduction of a continuous relaxation of the latent variable in a GMVAE has had no significant negative impact on test set loglikelihood. We hope that the introduction of the Concrete GMVAE will enable the application of GMVAEs to large scale problems and problems requiring a large number of clusters where the required training time was previously considered prohibitive.
References
 [1] (2016) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10–21. Cited by: §1.
 [2] (2016) Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648. Cited by: §1.
 [3] Gaussian mixture vae: lessons in variational inference, generative models, and deep nets. Note: http://ruishu.io/2016/12/25/gmvae/Accessed: 20190323 Cited by: §1.
 [4] (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §1.
 [5] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix 0.A.
 [6] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §1.
 [7] (2014) Semisupervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §1.
 [8] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §2.
 [9] (2010) MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2, pp. 18. Cited by: §2.
 [10] (2016) The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §1.

[11]
(2014)
Stochastic backpropagation and approximate inference in deep generative models
. InInternational Conference on Machine Learning
, pp. 1278–1286. Cited by: §1, §1. 
[12]
(1992)
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Book Section In Reinforcement Learning, pp. 5–32. Cited by: §1.  [13] (2017) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3881–3890. Cited by: §1.
Appendix 0.A Experimental Details
For both datasets we train using the Adam optimizer [5] with initial learning rate of
, for up to 200 epochs, stopping training early if the validation set loss has not improved for 10 consecutive epochs. Each image is passed through a shared encoder convolutional neural network with 20 5x5 filters with stride 1 followed by 40 5x5 filters with stride 1, each layer is followed a 2x2 max pooling layer with a stride of 2 and the RELU activation function and finally a fullyconnected layer with 512 units and RELU activation. From this shared encoding of the raw image, we compute
with two further fullyconnected layers of 512 and 256 units also with RELU activation. Whether sampling or marginalizing out, is computed by concatenating the output of the shared encoder with the onehot encoded and passing it through two fullyconnected layers of 512 and 256 units with RELU activation. is implemented as a single fullyconnected layer from the onehot encoded . is implemented with a single fullyconnected layer followed by two layers of transposed convolutions to those used for the shared encoder. The temperature of the Concrete distribution is set to 0.3 for both datasets throughout training.