Unsupervised learning is generally considered one of the greatest challenges of machine learning research. In recent years, there has been a great progress in modeling data distributions using deep generative models [1, 2], and while this progress has influenced the clustering literature, the full potential has yet to be reached.
Consider a latent variable model
where latent variables provide a low-dimensional representation of data and . In general, the prior will determine if clustering of the latent variables is successful; e.g. the common Gaussian prior, , tend to move clusters closer together, making post hoc clustering difficult (see Fig. 1). This problem is particularly evident in deep generative models such as variational autoencoders (VAE) [3, 4] that pick
where the mean
are parametrized by deep neural networks with parameters. The flexibility of such networks ensures that the latent variables can be made to follow almost any prior , implying that the latent variables can be forced to show almost any structure, including ones not present in the data. This does not influence the distribution , but it can be detrimental for clustering.
These concerns indicate that one should be very careful when computing distances in the latent space of deep generative models. As these models (informally) span a manifold embedded in the data space, one can consider measuring distances along this manifold; an idea that share intuitions with classic approaches such as spectral clustering . Arvanitidis et al.  have recently shown that measuring along
the data manifold associated with a deep generative model can be achieved by endowing the latent space with a Riemannian metric and measure distances accordingly. Unfortunately, the approach of Arvanitidis et al. require numerical solutions to a system of ordinary differential equations, which cannot be readily evaluated using standard frameworks for deep learning (see Sec.II). In this paper, we propose an efficient algorithm for evaluating these distances and demonstrate usefulness for clustering tasks.
Ii Related Work
Clustering, as a fundamental problem in machine learning, highly depends on the quality of data representation. Recently deep neural networks have become useful in learning clustering-friendly representations. We see four categories of work based on network structure: autoencoders (AE), deep neural networks (DNN), generative adversarial networks (GAN) and variational autoencoders (VAE).
In AE-based methods, Deep Clustering Networks 
directly combine the loss functions in autoenconders andk-means, while Deep Embedding Network  also revised the loss function by adding locality-preserving and group sparsity constraints to guide the network for clustering. Deep Multi-Manifold Clustering  introduced manifold locality preserving loss and proximity to cluster centroids, while Deep Embedded Regularized Clustering  established a non-trivial structure of convolutional and softmax autoencoder and proposed an entropy loss function with clustering regularization. Deep Continuous Clustering  inherited the continuity property in the method of Robust Continuous Clustering, a formulation having a clear continuous objective and no prior knowledge of clusters number, to integrate the parameters learning in network and clustering altogether. The AE-based methods are easy to implement but introduce hyper-parameters to the loss and are very limited in network depth.
In DNN-based methods, the networks can be very flexible such as Convolutional Neural Networks  or Deep Belief Networks  and often involve pre-training and fine-tuning stages. Deep Nonparametric Clustering  and Deep Embedded Clustering  are such representative works. Since the network initialization is sensitive to the result, Clustering Convolutional Neural Networks  is proposed with initial cluster centroids. To get rid of pre-training, Joint Unsupervised Learning  and Deep Adaptive Image Clustering 
are proposed for hierarchical cluster and binary relationship of images specifically.
In VAE-based methods, because the VAE is a generative model, Variational Deep Embedding  and Gaussian Mixture VAE  designed special prior distributions over the latent representation and inferred the data classes correspond to the modes of different priors.
In GAN-based methods, as another generative model, Deep Adversarial Clustering  was inspired by the ideas behind the Variational Deep Embedding , but with GAN structure. Information Maximizing Generative Adversarial Network  can disentangle the latent representations both discrete and continuous, and it modeled a clustering function with categorical values for those latent codes. AE-based and DNN-based methods are designed specifically for clustering but do not consider the underlying structure of data resulting in having no ability to generate data. VAE-based and GAN-based methods can generate samples and infer the structure of data while because of changing the latent space for clustering, it may conversely affect the true intrinsic structure of data.
Our work, is based on a recent observation that deep generative models immerse random Riemannian manifolds . This implies a change in the way distances are measured in the latent space, which reveals a clustering structure. Unfortunately, practical algorithms for actually computing such distances are missing, and it is the main focus of the present paper. With such an algorithm in hand, clustering can be performed with high accuracy in the latent space of an off-the-shelf VAE.
The paper is organized as follows: Section III introduce the usual VAE network, along with its interpretation as a stochastic Riemannian manifold. In Sec. IV we derive an efficient algorithm for computing geodesics (shortest paths) over this manifold, and in Sec. V we demonstrate its usefulness for clustering tasks. The paper is concluded in Sec. VI.
Iii Background on Variational Autoencoders
Deep generative modeling is an area of machine learning which deals with models of distribution in a potentially high-dimensional space . Deep generative models can capture data dependencies by learning low-dimensional latent variables to form a latent space . In recent years, the variational autoencoder (VAE)
has emerged as one of the most popular deep generative model because it can be built on top of deep neural networks and be trained with stochastic gradient descent. The VAE aims to maximize the probability of data samplesgenerated as
Here, the latent variables
are sampled according to a probability density function defined overand the distribution denotes the likelihood parametrized by . In VAEs is often Gaussian
where is the mean function and is the covariance function.
Iii-a Inference and Generator
The VAE consists of two parts: an inference network and a generator network, that serve almost the same roles as encoders and decoders in classic autoencoders.
Iii-A1 The inference network
is trained to map the training data samples to the latent space meanwhile forcing the latent variables to comply with the distribution . However, both the posterior distribution and are unknown. Therefore, VAE gives the solution that the posterior distribution is a variational distribution , computed by a network with parameters . In order to make accord with the distribution , the Kullback-Leibler (KL) divergence  is used, that is:
Iii-A2 The generator network
is trained to map the latent variables to generate data samples that are much like the true samples from the data space . According to the purpose of this network, we know that it needs to maximize the marginal distribution over the whole latent space and actually it is often processed with logarithm and computed by a multi-layer network with parameters :
From these parts a VAE is jointly trained as
Iii-B The Random Riemannian Interpretation
The inference network should force the latent variables to approximately follow the pre-specified unit Gaussian prior , which implies that the latent space gives a highly distorted view of the original data. Fortunately, this distortion is fairly easy to characterize . First, observe that the generative model of the VAE can be written as (using the so-called re-parametrization trick; see also Fig. 2)
Now let be a latent variable and let be infinitesimal. Then we can measure the distance between and in the input space using Taylor’s theorem
where denote the Jacobian of at . This implies that define a local inner product under which we can define curve lengths through integration
Here is a curve in the latent space and is its velocity. Distances can then be defined as the length of the shortest curve (geodesic) connecting two points,
This is the traditional Riemannian analysis associated with embedded surfaces . From this, it is well-known that length-minimizing curves are minimizers of energy
which is easier to optimize than Eq. 11.
For generative models, the analysis is complicated by the fact that is a stochastic mapping, implying that the Jacobian is stochastic, geodesics are stochastic, distances are stochastic, etc. Arvanitidis et al.  propose to replace the stochastic metric with its expectation which is equivalent to minimizing the expected energy . While this is shown to work well, the practical algorithm proposed by Arvanitidis et al. amount to solving a nonlinear differential equation numerically, which require us to evaluate both the Jacobian
and its derivatives. Unfortunately, modern deep learning frameworks such as Tensorflow rely onreverse mode automatic differentiation , which does not support Jacobians. This renders the algorithm of Arvanitidis et al. impractical. A key contribution of this paper is a practical algorithm for computing geodesics that fits within modern deep learning frameworks.
Iv Proposed algorithm to compute geodesics
To develop an efficient algorithm for computing geodesics, we first note that the expected curve energy can be written as
If we discretize the curve at points (Fig. 3), then this integral can be approximated as
Since , the expectation computing can be evaluated in closed-form as
and the approximated expected energy can be written
This energy is easily interpretable: the first term of the sum corresponds to the curve energy along the expected data manifold, while the second term penalizes curves for traversing highly uncertain regions on the manifold. This implies that geodesics will be attracted to regions of high data density in the latent space.
Unlike the ordinary differential equations of Arvanitidis et al. , Eq. 18 can readily be optimized using automatic differentiation as implemented in Tensorflow. We can, thus, compute geodesics by picking a parametrization of the latent curve and optimize Eq. 18 with respect to curve parameters.
Iv-a Curve Parametrization
There are many common choices for parametrizing curves, e.g. splines , Gaussian processes  or point collections . In the interest of speed, we propose to use the restricted class of quadratic functions, i.e.
A curve, thus, has free parameters , , and . In practice, we are concerned with geodesic curves that connect two pre-specified points and , so the quadratic function should be constrained to satisfy and , which is easily achieved for quadratics. Under this constraint, there are only
free parameters to estimate when optimizing(18). Here we perform the optimization using standard gradient descent.
Iv-B Specifying Uncertainty
When training the VAE model, the reconstruction term of Eq. 7
ensure that we can expect high-quality reconstructions of the training data. Interpolations between latent training data usually give high-quality reconstructions in densely sampled regions of the latent space, but low-quality reconstructions in regions with low sample density. Ideally, the generator varianceshould reflect this.
From the point of view of computing geodesics, the generator variance is important as it appears directly in the expected curve energy (18). If is small near the latent data and large away from the data, then geodesics will follow the trend of the data , which is a useful property in a clustering context.
In practice, the neural network used to model
is only trained where there is data, and its behavior in-between is governed by the activation functions of the network; e.g. common activations such softplus orimplying that variances are smoothly interpolated in-between the training data. This is a most unfortunate property for a variance function; e.g. if the optimal variance is low at all latent training points, then the predicted variance will be low at all points in the latent space. To ensure that variance increases away from latent data, Arvanitidis et al.  proposed to model the inverse variance (precision) with an RBF network  with isotropic kernels. This is reported to provide meaningful variance estimates.
We found the isotropic assumptions to be too limited, and instead applied anisotropic kernels. Specifically, we propose to use a rescaled Gaussian Mixture Model (GMM) to represent the inverse variance function
where and are the component-wise mean and covariance, and and are positive weights. For simplicity, we set each component has its own single variance. For all the latent variables , we use the usual EM algorithm  to obtain the weights and the mean and covariance in each component. is trained by solving Eq. 7. Figure 4 gives an example, showing the inverse output of , and we see that for regions without data, gives low values such that the variance is large as one would expect.
Iv-C Curve Initialization
Once the VAE is fully trained, we can compute geodesics in the latent space. As previously mentioned, we use gradient descent to minimize (18
). To improve convergence speed, we here propose a heuristic for initialization that we have found to work well.
Since geodesics generally follow the trend of the data  we seek an initial curve with this property. As it can be expensive to evaluate the generator network , we propose to first seek a curve that minimize the inverse GMM model . We do this with a simple stochastic optimization akin to a particle filter . This is written explicitly in Algorithm 1.
V Experiments and details related
V-a Experimental Pipeline
Throughout the experiments, we use the same three-stage pipeline, which is illustrated in Fig. 6. In the first stage we train a VAE with fixed constant output variance; this VAE has five layers in total, H-enc, M-enc, S-enc, H-dec, M-dec which are optimized according to Eq. 7. In the second stage, we fit the generator variance represented by a GMM network (Sec. IV-B) according to Eq. 7. Finally, in the third stage, we compute geodesics parametrized by , and compute clusters accordingly. Here we use the -medoids algorithm  that only rely on pairwise distances. This decision was made to illustrate the information captured by geodesic distances.
V-B Visualizing Curvature
A useful visualization tool for the curvature of generative models is the magnification factor , which correspond to the Riemannian volume measure associated with the metric . For a given Jacobian , this is defined as
In practice, the Jacobian is a stochastic object, so previous work [6, 34] has proposed to visualize . Here we argue that the expectation should be taken as late in the process as possible, and instead visualize the expected volume measure,
To compute this measure, we split the latent space into small quadratic pieces, as in Fig. 7.
As we can see from the figure, there are two vectors, and the corresponding vectors in , and . Note , then the volume measure is:
Here we compute the right-hand side expectation using sampling. As an example visualization, Fig. 8 show the logarithm of the volume measure associated with the model from Fig. 4. In areas of small volume measure (blue), distances will generally be small, while they will be large in regions of large volume measure (red).
V-C Experimental Results
V-C1 The Two-Moon Dataset
As a first illustration, we consider the classic “Two Moon” data set shown in Fig. 4. For H-enc and H-dec layers, we use two hidden fully-connected layers with softplus activations, and for the S-enc layer, we use one fully-connected layer, again, with softplus. For M-enc and M-dec layers we use fully-connected layers.
Figure 9 show the latent space of the resulting VAE along with several quadratic geodesics. We see that the geodesics nicely follow the structure of the data. This also influences the observed clustering structure. Figure 10 show all pairwise distances using both geodesic and Euclidean distances. I should be noted that the first 50 points belong to the first “moon” while the remaining belong to the other. From the figure, we see that the geodesic distance reveals the cluster structure much more clearly than the Euclidean counterpart. We validate this by performing -medoids clustering using the two distances. As a baseline, we also consider standard spectral clustering (SC)  to the original data. We report clustering accuracy (the ratio of correct clustered sample number and the number of observations) in Fig. 11 and in Table I. It is evident that the geodesic distance reveals the intrinsic structure of the data.
|data samples||reconstructed data||latent variable||original data|
V-C2 Synthetic Anisotropically Distributed Data
|data samples||reconstructed data||latent variable||original data|
Using the same setup as for the two-moon dataset, we generate 100 samples from clusters with anisotropic distributions. Figure 12 shows both volume measure and pair-wise distances. Again -medoids clustering show that the geodesic distance does a much better job at capturing the data structure than the baselines. Clustering accuracy is in Table II and the found clusters are shown in Fig. 13.
V-C3 The MNIST Dataset
From the well-known MNIST dataset, we take hand-written digit ’0’, ’1’ and ’2’ to test 2-class and 3-class clustering. For H-enc and H-dec
layers, we use two hidden fully-connected layers with Relu activations111The number of H-enc neural nodes: from 784 to 500 and from 500 to 2. The number of H-dec neural nodes: from 2 to 500 and from 500 to 784., and for the S-enc layer, we use one fully-connected layer with a sigmoid activation function, and for M-enc, M-dec layers we use fully-connected layers with identity activation functions. Images generated by both networks are shown in Fig. 14.
For the 2-class situation, we use digits ’0’ and ’1’. We select 50 samples from each class and compute their pair-wise distances, which are shown in Fig. 15. For the 3-class situation, we select 30 samples from each class and show pair-wise distances in Fig. 16. In both cases, the geodesic distance reveals a clear clustering structure. We also see this in -medoids clustering, which outperforms the baselines (Table III).
|data samples||reconstructed data||latent variable||original data|
V-C4 The Fashion-MNIST Dataset
Fashion-MNIST  is a dataset of Zalando’s article images. Each image is a gray-scale image. We consider classes ’T-shirt’, ’Sandal’ and ’Bag’ to test 2-class and 3-class clustering. For H-enc and H-dec layers, we use three hidden fully-connected layers with Relu activations222The number of H-enc neural nodes: from 784 to 500 and from 500 to 200 and from 200 to 100. The number of H-dec neural nodes: from 100 to 200 and from 200 to 500 and from 500 to 784, and for S-enc layer, we use one fully-connected layer with a sigmoid activation function, and for M-enc, M-dec we use fully-connected layers with identity and sigmoid activation functions respectively. Images generated by the networks are shown in Fig. 17.
For the 2-class situation, we use the ’T-shirt’ and ’Sandal’ samples to train the VAE. We select 50 samples from ’T-shirt’ and ’Sandal’ dataset respectively, and compute pair-wise distances (see Fig. 18). For the 3-class situation we select 30 samples from each class and compute distances (Fig. 19). As before, we see that -medoids clustering with geodesic distances significantly outperform the baselines; see Table IV for numbers.
|data samples||generated data||latent variable||original data|
V-C5 The EMNIST-Letter Dataset
The EMNIST-letter dataset  is a set of handwritten alphabet characters derived from the NIST Special Database and converted to gray-scale images. We select the characters ’D’ and ’d’ as 2 classes, and fit a VAE with the same network architectures as the ones used for Fashion-MNIST. Generated images are shown in Fig. 20.
We select 50 samples from ’D’ and ’d’ respectively and show pair-wise distances in Fig. 21. Again, -medoids clustering show that the geodesic distance reflects the intrinsic structure, which improves clustering over the baselines, c.f. Table V.
|data samples||reconstructed data||latent variable||original data|
In this paper, we have proposed an efficient algorithm for computing shortest paths (geodesics) along data manifolds spanned by deep generative models. Unlike previous work, the proposed algorithm is easy to implement and fits well with modern deep learning frameworks. We have also proposed a new network architecture for representing variances in variational autoencoders. With these two tools in hand, we have shown that simple distance-based clustering works remarkably well in the latent space of a deep generative model, even if the model is not trained for clustering tasks. Still, the dimension of the latent space, the form of the curve parametrization and modeling variance in generator worth developing further to obtain the more robust geodesics computation algorithm.
TY was supported by the National Key R&D Program of China (No. 2017YFB0702104). SH was supported by a research grant (15334) from VILLUM FONDEN. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 757360).
D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-supervised learning with deep generative models,” inProceedings of the 28th Neural Information Processing Systems (NIPS), Montréal Canada, 2014, pp. 3581–3589.
-  E. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Proceedings of the 29th Neural Information Processing Systems (NIPS), Montréal Canada, 2015, pp. 1486–1494.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, Canada, 2014.
D. J. Rezende, S. Mohamed, and D. Wiestra, “Stochastic backpropagation and approximate inference in deep generative models,” inProceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 2014.
-  U. V. Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, pp. 395–416, 2007.
-  G. Arvanitidis, L. Hansen, and S. Hauberg, “Latent space oddity: on the curvature of deep generative models,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018.
B. Yang, X. Xiao, N. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” inProceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017, pp. 3861–3870.
P. Huang, Y. Huang, W. Wang, and L. Wang, “Deep embedding network for
Proceedings of the 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 2014, pp. 1532–1537.
D. Chen, J. Lv, and Z. Yi, “Unsupervised multi-manifold clustering by learning
deep representation,” in
Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, California USA, 2017, pp. 385–391.
K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep clustering via
joint convolutional autoencoder embedding and relative entropy
Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5736–5745.
-  S. A. Shah and V. Koltun, “Deep continuous clustering,” arXiv:1803.01449, 2018.
-  ——, “Robust continuous clustering,” Proceedings of the National Academy of Sciences of the United States of America, vol. 114, pp. 9814–9819, 2017.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inProceedings of the 26th Conference on Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, 2012.
-  R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for natural language understanding,” ACM Transactions on Audio Speech & Language Processing, vol. 22, pp. 778–784, 2014.
-  G. Chen, “Deep learning with nonparametric clustering,” arXiv:1501.03084, 2015.
J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” inProceedings of the 33rd International Conference on Machine Learning (ICML), New York, 2016, pp. 478–487.
-  C. C. Hsu and C. W. Lin, “Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data,” IEEE Transactions on Multimedia, vol. 20, pp. 421 – 429, 2018.
-  J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016, pp. 5147–5156.
-  J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deep adaptive image clustering,” in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, 2017, pp. 5879–5887.
-  Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep embedding: an unsupervised and generative approach to clustering,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 2017, pp. 1965–1972.
-  N. Dilokthanakul, P. Mediano, and M. Garnelo, “Deep unsupervised clustering with gaussian mixture variational autoencoders,” arXiv:1611.02648, 2016.
-  W. Harchaoui, P. A. Mattei, and C. Bouveyron, “Deep adversarial gaussian mixture auto-encoder for clustering,” in Workshop of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Proceedings of the 5th Proceedings of the 30th Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016, pp. 2172–2180.
-  C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
-  S. Gallot, D. Hulin, and J. Lafontaine, Riemannian geometry. Springer, 1990, vol. 3.
-  S. Hauberg, “Only bayes should learn a manifold,” 2018.
-  L. B. Rall, “Automatic differentiation: Techniques and applications,” 1981.
-  P. Hennig and S. Hauberg, “Probabilistic solutions to differential equations and their application to riemannian statistics,” in Proceedings of the 17th international Conference on Artificial Intelligence and Statistics (AISTATS), vol. 33, 2014.
-  S. Laine, “Feature-based metrics for exploring the latent space of generative models,” ICLR workshops, 2018.
Q. Que and M. Belkin, “Back to the future: Radial basis function networks revisited,” inArtificial Intelligence and Statistics (AISTATS), 2016.
-  O. Cappé, S. J. Godsill, and E. Moulines, “An overview of existing methods and recent advances in sequential monte carlo,” Proceedings of the IEEE, vol. 95, no. 5, pp. 899–924, 2007.
-  L. Kaufman and P. Rousseeuw, Clustering by means of medoids. North-Holland, 1987.
-  C. M. Bishop, M. Svensen, and C. K. Williams, “Magnification factors for the gtm algorithm,” 1997.
-  A. Tosi, S. Hauberg, A. Vellido, and N. D. Lawrence, “Metrics for Probabilistic Geometries,” in The Conference on Uncertainty in Artificial Intelligence (UAI), Jul. 2014.
-  H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
-  G. Cohen, S. Afshar, J. Tapson, and A. V. Schaik, “Emnist:an extension of mnist to handwritten letters,” arXiv:1702.05373, 2017.