Geodesic Clustering in Deep Generative Models

09/13/2018 ∙ by Tao Yang, et al. ∙ English 设为首页 263 DTU 6

Deep generative models are tremendously successful in learning low-dimensional latent representations that well-describe the data. These representations, however, tend to much distort relationships between points, i.e. pairwise distances tend to not reflect semantic similarities well. This renders unsupervised tasks, such as clustering, difficult when working with the latent representations. We demonstrate that taking the geometry of the generative model into account is sufficient to make simple clustering algorithms work well over latent representations. Leaning on the recent finding that deep generative models constitute stochastically immersed Riemannian manifolds, we propose an efficient algorithm for computing geodesics (shortest paths) and computing distances in the latent space, while taking its distortion into account. We further propose a new architecture for modeling uncertainty in variational autoencoders, which is essential for understanding the geometry of deep generative models. Experiments show that the geodesic distance is very likely to reflect the internal structure of the data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Unsupervised learning is generally considered one of the greatest challenges of machine learning research. In recent years, there has been a great progress in modeling data distributions using deep generative models [1, 2], and while this progress has influenced the clustering literature, the full potential has yet to be reached.

Consider a latent variable model

(1)

where latent variables provide a low-dimensional representation of data and . In general, the prior will determine if clustering of the latent variables is successful; e.g. the common Gaussian prior, , tend to move clusters closer together, making post hoc clustering difficult (see Fig. 1). This problem is particularly evident in deep generative models such as variational autoencoders (VAE) [3, 4] that pick

(2)

where the mean

and variance

are parametrized by deep neural networks with parameters

. The flexibility of such networks ensures that the latent variables can be made to follow almost any prior , implying that the latent variables can be forced to show almost any structure, including ones not present in the data. This does not influence the distribution , but it can be detrimental for clustering.

Fig. 1: The latent space of a generative model is highly distorted and will often loose clustering structure.

These concerns indicate that one should be very careful when computing distances in the latent space of deep generative models. As these models (informally) span a manifold embedded in the data space, one can consider measuring distances along this manifold; an idea that share intuitions with classic approaches such as spectral clustering [5]. Arvanitidis et al. [6] have recently shown that measuring along

the data manifold associated with a deep generative model can be achieved by endowing the latent space with a Riemannian metric and measure distances accordingly. Unfortunately, the approach of Arvanitidis et al. require numerical solutions to a system of ordinary differential equations, which cannot be readily evaluated using standard frameworks for deep learning (see Sec. 

II). In this paper, we propose an efficient algorithm for evaluating these distances and demonstrate usefulness for clustering tasks.

Ii Related Work

Clustering, as a fundamental problem in machine learning, highly depends on the quality of data representation. Recently deep neural networks have become useful in learning clustering-friendly representations. We see four categories of work based on network structure: autoencoders (AE), deep neural networks (DNN), generative adversarial networks (GAN) and variational autoencoders (VAE).

In AE-based methods, Deep Clustering Networks [7]

directly combine the loss functions in autoenconders and

k-means, while Deep Embedding Network [8] also revised the loss function by adding locality-preserving and group sparsity constraints to guide the network for clustering. Deep Multi-Manifold Clustering [9] introduced manifold locality preserving loss and proximity to cluster centroids, while Deep Embedded Regularized Clustering [10] established a non-trivial structure of convolutional and softmax autoencoder and proposed an entropy loss function with clustering regularization. Deep Continuous Clustering [11] inherited the continuity property in the method of Robust Continuous Clustering[12], a formulation having a clear continuous objective and no prior knowledge of clusters number, to integrate the parameters learning in network and clustering altogether. The AE-based methods are easy to implement but introduce hyper-parameters to the loss and are very limited in network depth.

In DNN-based methods, the networks can be very flexible such as Convolutional Neural Networks [13] or Deep Belief Networks [14] and often involve pre-training and fine-tuning stages. Deep Nonparametric Clustering [15] and Deep Embedded Clustering [16] are such representative works. Since the network initialization is sensitive to the result, Clustering Convolutional Neural Networks [17] is proposed with initial cluster centroids. To get rid of pre-training, Joint Unsupervised Learning [18] and Deep Adaptive Image Clustering [19]

are proposed for hierarchical cluster and binary relationship of images specifically.

In VAE-based methods, because the VAE is a generative model, Variational Deep Embedding [20] and Gaussian Mixture VAE [21] designed special prior distributions over the latent representation and inferred the data classes correspond to the modes of different priors.

In GAN-based methods, as another generative model, Deep Adversarial Clustering [22] was inspired by the ideas behind the Variational Deep Embedding [20], but with GAN structure. Information Maximizing Generative Adversarial Network [23] can disentangle the latent representations both discrete and continuous, and it modeled a clustering function with categorical values for those latent codes. AE-based and DNN-based methods are designed specifically for clustering but do not consider the underlying structure of data resulting in having no ability to generate data. VAE-based and GAN-based methods can generate samples and infer the structure of data while because of changing the latent space for clustering, it may conversely affect the true intrinsic structure of data.

Our work, is based on a recent observation that deep generative models immerse random Riemannian manifolds [6]. This implies a change in the way distances are measured in the latent space, which reveals a clustering structure. Unfortunately, practical algorithms for actually computing such distances are missing, and it is the main focus of the present paper. With such an algorithm in hand, clustering can be performed with high accuracy in the latent space of an off-the-shelf VAE.

The paper is organized as follows: Section III introduce the usual VAE network, along with its interpretation as a stochastic Riemannian manifold. In Sec. IV we derive an efficient algorithm for computing geodesics (shortest paths) over this manifold, and in Sec. V we demonstrate its usefulness for clustering tasks. The paper is concluded in Sec. VI.

Iii Background on Variational Autoencoders

Deep generative modeling is an area of machine learning which deals with models of distribution in a potentially high-dimensional space . Deep generative models can capture data dependencies by learning low-dimensional latent variables to form a latent space . In recent years, the variational autoencoder (VAE)

has emerged as one of the most popular deep generative model because it can be built on top of deep neural networks and be trained with stochastic gradient descent. The VAE aims to maximize the probability of data samples

generated as

(3)

Here, the latent variables

are sampled according to a probability density function defined over

and the distribution denotes the likelihood parametrized by . In VAEs is often Gaussian

(4)

where is the mean function and is the covariance function.

Iii-a Inference and Generator

The VAE consists of two parts: an inference network and a generator network, that serve almost the same roles as encoders and decoders in classic autoencoders.

Iii-A1 The inference network

is trained to map the training data samples to the latent space meanwhile forcing the latent variables to comply with the distribution . However, both the posterior distribution and are unknown. Therefore, VAE gives the solution that the posterior distribution is a variational distribution , computed by a network with parameters . In order to make accord with the distribution , the Kullback-Leibler (KL) divergence [24] is used, that is:

(5)

Iii-A2 The generator network

is trained to map the latent variables to generate data samples that are much like the true samples from the data space . According to the purpose of this network, we know that it needs to maximize the marginal distribution over the whole latent space and actually it is often processed with logarithm and computed by a multi-layer network with parameters :

(6)

From these parts a VAE is jointly trained as

(7)
Fig. 2: Expectation of samplings from distribution

Iii-B The Random Riemannian Interpretation

The inference network should force the latent variables to approximately follow the pre-specified unit Gaussian prior , which implies that the latent space gives a highly distorted view of the original data. Fortunately, this distortion is fairly easy to characterize [6]. First, observe that the generative model of the VAE can be written as (using the so-called re-parametrization trick; see also Fig. 2)

(8)

Now let be a latent variable and let be infinitesimal. Then we can measure the distance between and in the input space using Taylor’s theorem

(9)
(10)

where denote the Jacobian of at . This implies that define a local inner product under which we can define curve lengths through integration

(11)

Here is a curve in the latent space and is its velocity. Distances can then be defined as the length of the shortest curve (geodesic) connecting two points,

(12)
(13)

This is the traditional Riemannian analysis associated with embedded surfaces [25]. From this, it is well-known that length-minimizing curves are minimizers of energy

(14)

which is easier to optimize than Eq. 11.

For generative models, the analysis is complicated by the fact that is a stochastic mapping, implying that the Jacobian is stochastic, geodesics are stochastic, distances are stochastic, etc. Arvanitidis et al. [6] propose to replace the stochastic metric with its expectation which is equivalent to minimizing the expected energy [26]. While this is shown to work well, the practical algorithm proposed by Arvanitidis et al. amount to solving a nonlinear differential equation numerically, which require us to evaluate both the Jacobian

and its derivatives. Unfortunately, modern deep learning frameworks such as Tensorflow rely on

reverse mode automatic differentiation [27], which does not support Jacobians. This renders the algorithm of Arvanitidis et al. impractical. A key contribution of this paper is a practical algorithm for computing geodesics that fits within modern deep learning frameworks.

Fig. 3: Discretization of parameter

Iv Proposed algorithm to compute geodesics

To develop an efficient algorithm for computing geodesics, we first note that the expected curve energy can be written as

(15)

If we discretize the curve at points (Fig. 3), then this integral can be approximated as

(16)

Since , the expectation computing can be evaluated in closed-form as

(17)

and the approximated expected energy can be written

(18)

This energy is easily interpretable: the first term of the sum corresponds to the curve energy along the expected data manifold, while the second term penalizes curves for traversing highly uncertain regions on the manifold. This implies that geodesics will be attracted to regions of high data density in the latent space.

Unlike the ordinary differential equations of Arvanitidis et al. [6], Eq. 18 can readily be optimized using automatic differentiation as implemented in Tensorflow. We can, thus, compute geodesics by picking a parametrization of the latent curve and optimize Eq. 18 with respect to curve parameters.

Iv-a Curve Parametrization

There are many common choices for parametrizing curves, e.g. splines [6], Gaussian processes [28] or point collections [29]. In the interest of speed, we propose to use the restricted class of quadratic functions, i.e.

(19)

A curve, thus, has free parameters , , and . In practice, we are concerned with geodesic curves that connect two pre-specified points and , so the quadratic function should be constrained to satisfy and , which is easily achieved for quadratics. Under this constraint, there are only

free parameters to estimate when optimizing

(18). Here we perform the optimization using standard gradient descent.

Iv-B Specifying Uncertainty

When training the VAE model, the reconstruction term of Eq. 7

ensure that we can expect high-quality reconstructions of the training data. Interpolations between latent training data usually give high-quality reconstructions in densely sampled regions of the latent space, but low-quality reconstructions in regions with low sample density. Ideally, the generator variance

should reflect this.

From the point of view of computing geodesics, the generator variance is important as it appears directly in the expected curve energy (18). If is small near the latent data and large away from the data, then geodesics will follow the trend of the data [26], which is a useful property in a clustering context.

In practice, the neural network used to model

is only trained where there is data, and its behavior in-between is governed by the activation functions of the network; e.g. common activations such softplus or

implying that variances are smoothly interpolated in-between the training data. This is a most unfortunate property for a variance function; e.g. if the optimal variance is low at all latent training points, then the predicted variance will be low at all points in the latent space. To ensure that variance increases away from latent data, Arvanitidis et al. [6] proposed to model the inverse variance (precision) with an RBF network [30] with isotropic kernels. This is reported to provide meaningful variance estimates.

We found the isotropic assumptions to be too limited, and instead applied anisotropic kernels. Specifically, we propose to use a rescaled Gaussian Mixture Model (GMM) to represent the inverse variance function

(20)

where and are the component-wise mean and covariance, and and are positive weights. For simplicity, we set each component has its own single variance. For all the latent variables , we use the usual EM algorithm [24] to obtain the weights and the mean and covariance in each component. is trained by solving Eq. 7. Figure 4 gives an example, showing the inverse output of , and we see that for regions without data, gives low values such that the variance is large as one would expect.

Fig. 4: (a): The latent samples of two-moon data. (b) The GMM result in latent space. (c) The logarithmic result of the variance.
Fig. 5: Construction of unified VAE and geodesic computation network

Iv-C Curve Initialization

Once the VAE is fully trained, we can compute geodesics in the latent space. As previously mentioned, we use gradient descent to minimize (18

). To improve convergence speed, we here propose a heuristic for initialization that we have found to work well.

Since geodesics generally follow the trend of the data [6] we seek an initial curve with this property. As it can be expensive to evaluate the generator network , we propose to first seek a curve that minimize the inverse GMM model . We do this with a simple stochastic optimization akin to a particle filter [31]. This is written explicitly in Algorithm 1.

1:  Set
2:  for each step in  do
3:     ,
4:     Sample sets of parameters
5:     for each index in  do
6:        ,
7:        
8:     end for
9:      for all indexes
10:  end for
11:  Initialize parameters from
Algorithm 1 The pseudo-code for initializing

V Experiments and details related

V-a Experimental Pipeline

Throughout the experiments, we use the same three-stage pipeline, which is illustrated in Fig. 6. In the first stage we train a VAE with fixed constant output variance; this VAE has five layers in total, H-enc, M-enc, S-enc, H-dec, M-dec which are optimized according to Eq. 7. In the second stage, we fit the generator variance represented by a GMM network (Sec. IV-B) according to Eq. 7. Finally, in the third stage, we compute geodesics parametrized by , and compute clusters accordingly. Here we use the -medoids algorithm [32] that only rely on pairwise distances. This decision was made to illustrate the information captured by geodesic distances.

Fig. 6: The three stages of our pipeline. See text for details.

V-B Visualizing Curvature

A useful visualization tool for the curvature of generative models is the magnification factor [33], which correspond to the Riemannian volume measure associated with the metric [34]. For a given Jacobian , this is defined as

(21)

In practice, the Jacobian is a stochastic object, so previous work [6, 34] has proposed to visualize . Here we argue that the expectation should be taken as late in the process as possible, and instead visualize the expected volume measure,

(22)

To compute this measure, we split the latent space into small quadratic pieces, as in Fig. 7.

Fig. 7: The volume measure describe the volume of an infinitesimal box in the latent space measured in the input space.

As we can see from the figure, there are two vectors

, and the corresponding vectors in , and . Note , then the volume measure is:

(23)

Here we compute the right-hand side expectation using sampling. As an example visualization, Fig. 8 show the logarithm of the volume measure associated with the model from Fig. 4. In areas of small volume measure (blue), distances will generally be small, while they will be large in regions of large volume measure (red).

Fig. 8: The logarithmic results of the volume measure

V-C Experimental Results

V-C1 The Two-Moon Dataset

As a first illustration, we consider the classic “Two Moon” data set shown in Fig. 4. For H-enc and H-dec layers, we use two hidden fully-connected layers with softplus activations, and for the S-enc layer, we use one fully-connected layer, again, with softplus. For M-enc and M-dec layers we use fully-connected layers.

Fig. 9: Examples of the optimized quadratic curves in latent space for two-moon dataset

Figure 9 show the latent space of the resulting VAE along with several quadratic geodesics. We see that the geodesics nicely follow the structure of the data. This also influences the observed clustering structure. Figure 10 show all pairwise distances using both geodesic and Euclidean distances. I should be noted that the first 50 points belong to the first “moon” while the remaining belong to the other. From the figure, we see that the geodesic distance reveals the cluster structure much more clearly than the Euclidean counterpart. We validate this by performing -medoids clustering using the two distances. As a baseline, we also consider standard spectral clustering (SC) [5] to the original data. We report clustering accuracy (the ratio of correct clustered sample number and the number of observations) in Fig. 11 and in Table I. It is evident that the geodesic distance reveals the intrinsic structure of the data.


Fig. 10: The comparison of distance matrix. Left: Shortest distance optimized in reconstructed data space. Right: Euclidean distance in latent variable space
Fig. 11: (a): The two-moon clustering result of -medoids with geodesic distance. (b) The result of -medoids with original distance in latent space. (c) The clustering result of SC.
method -medoids -medoids SC
data samples reconstructed data latent variable original data
distance Geodesic Euclidean Euclidean
accuracy 1 0.86 0.92
TABLE I: Two-moon dataset clustering accuracy

V-C2 Synthetic Anisotropically Distributed Data

Fig. 12: (a): The logarithm of the volume measure in the latent space. (b) Left: The optimized geodesic pair-wise distance. Right: The Euclidean pair-wise distance in latent space.
method -medoids -medoids SC
data samples reconstructed data latent variable original data
distance Geodesic Euclidean Euclidean
accuracy 1 0.80 0.96
TABLE II: Anisotropic data samples clustering accuracy
Fig. 13: (a): The anisotropic scattered samples clustering result of -medoids with geodesic distance. (b) The result of -medoids with original distance in latent space. (c) The clustering result of SC.

Using the same setup as for the two-moon dataset, we generate 100 samples from clusters with anisotropic distributions. Figure 12 shows both volume measure and pair-wise distances. Again -medoids clustering show that the geodesic distance does a much better job at capturing the data structure than the baselines. Clustering accuracy is in Table II and the found clusters are shown in Fig. 13.

V-C3 The MNIST Dataset

From the well-known MNIST dataset, we take hand-written digit ’0’, ’1’ and ’2’ to test 2-class and 3-class clustering. For H-enc and H-dec

layers, we use two hidden fully-connected layers with Relu activations

111The number of H-enc neural nodes: from 784 to 500 and from 500 to 2. The number of H-dec neural nodes: from 2 to 500 and from 500 to 784., and for the S-enc layer, we use one fully-connected layer with a sigmoid activation function, and for M-enc, M-dec layers we use fully-connected layers with identity activation functions. Images generated by both networks are shown in Fig. 14.

Fig. 14: Generated ’0’, ’1’ and ’2’ examples for MNIST dataset by the VAE used in this paper.

For the 2-class situation, we use digits ’0’ and ’1’. We select 50 samples from each class and compute their pair-wise distances, which are shown in Fig. 15. For the 3-class situation, we select 30 samples from each class and show pair-wise distances in Fig. 16. In both cases, the geodesic distance reveals a clear clustering structure. We also see this in -medoids clustering, which outperforms the baselines (Table III).

Fig. 15: (a): The logarithm of the volume measure in latent space for ’0’ and ’1’ digit images. (b) Left: pair-wise geodesic distances. Right: pair-wise Euclidean distances in latent space.
Fig. 16: (a): The logarithm of the volume measure in latent space for ’0’, ’1’ and ’2’ digit images. (b) Left: The optimized geodesic pair-wise distance. Right: The Euclidean pair-wise distance in latent space.
method -medoids -medoids SC
data samples reconstructed data latent variable original data
distance Geodesic Euclidean Euclidean
’0’-’1’ accuracy 1 0.93 0.69
’0’-’1’-’2’ accuracy 1 0.80 0.36
TABLE III: MNIST dataset clustering 2-class and 3-class accuracy

V-C4 The Fashion-MNIST Dataset

Fashion-MNIST [35] is a dataset of Zalando’s article images. Each image is a gray-scale image. We consider classes ’T-shirt’, ’Sandal’ and ’Bag’ to test 2-class and 3-class clustering. For H-enc and H-dec layers, we use three hidden fully-connected layers with Relu activations222The number of H-enc neural nodes: from 784 to 500 and from 500 to 200 and from 200 to 100. The number of H-dec neural nodes: from 100 to 200 and from 200 to 500 and from 500 to 784, and for S-enc layer, we use one fully-connected layer with a sigmoid activation function, and for M-enc, M-dec we use fully-connected layers with identity and sigmoid activation functions respectively. Images generated by the networks are shown in Fig. 17.

Fig. 17: Generated ’T-shirt’, ’Sandal’ and ’Bag’ examples for Fashion-MNIST dataset by VAE used in this paper.

For the 2-class situation, we use the ’T-shirt’ and ’Sandal’ samples to train the VAE. We select 50 samples from ’T-shirt’ and ’Sandal’ dataset respectively, and compute pair-wise distances (see Fig. 18). For the 3-class situation we select 30 samples from each class and compute distances (Fig. 19). As before, we see that -medoids clustering with geodesic distances significantly outperform the baselines; see Table IV for numbers.

Fig. 18: (a): The logarithm of the volume measure in latent space for ’T-shirt’ and ’Sandal’ images. (b) Left: The optimized geodesic pair-wise distance. Right: The Euclidean pair-wise distance in latent space.
Fig. 19: (a): The logarithm of the volume measure in latent space for ’T-shirt’ ,’Sandal’ and ’Bag’ images. (b) Left: The optimized geodesic pair-wise distance. Right: The Euclidean pair-wise distance in latent space.
method -medoids -medoids SC
data samples generated data latent variable original data
distance Geodesic Euclidean Euclidean
’T-shirt’-’Scandal’ 1 0.98 0.48
’T-shirt’-’Scandal’-’Bag’ 1 0.93 0.28
TABLE IV: Fashion-MNIST dataset clustering 2-class and 3-class accuracy

V-C5 The EMNIST-Letter Dataset

The EMNIST-letter dataset [36] is a set of handwritten alphabet characters derived from the NIST Special Database and converted to gray-scale images. We select the characters ’D’ and ’d’ as 2 classes, and fit a VAE with the same network architectures as the ones used for Fashion-MNIST. Generated images are shown in Fig. 20.

We select 50 samples from ’D’ and ’d’ respectively and show pair-wise distances in Fig. 21. Again, -medoids clustering show that the geodesic distance reflects the intrinsic structure, which improves clustering over the baselines, c.f. Table V.

Fig. 20: Generated ’D’ and ’d’ examples for EMNIST-letters by VAE used in the paper
Fig. 21: (a): The logarithmic result of the area measure in latent space for ’D’ and ’d’ images. (b) Left: The optimized geodesic pair-wise distance. Right: The Euclidean pair-wise distance in latent space.
method -medoids -medoids SC
data samples reconstructed data latent variable original data
distance Geodesic Euclidean Euclidean
’D’-’d’ 1 0.76 0.48
TABLE V: EMNIST-letter dataset clustering 2-class accuracy

Vi Conclusion

In this paper, we have proposed an efficient algorithm for computing shortest paths (geodesics) along data manifolds spanned by deep generative models. Unlike previous work, the proposed algorithm is easy to implement and fits well with modern deep learning frameworks. We have also proposed a new network architecture for representing variances in variational autoencoders. With these two tools in hand, we have shown that simple distance-based clustering works remarkably well in the latent space of a deep generative model, even if the model is not trained for clustering tasks. Still, the dimension of the latent space, the form of the curve parametrization and modeling variance in generator worth developing further to obtain the more robust geodesics computation algorithm.

Acknowledgments

TY was supported by the National Key R&D Program of China (No. 2017YFB0702104). SH was supported by a research grant (15334) from VILLUM FONDEN. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 757360).

References

  • [1]

    D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-supervised learning with deep generative models,” in

    Proceedings of the 28th Neural Information Processing Systems (NIPS), Montréal Canada, 2014, pp. 3581–3589.
  • [2] E. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative image models using a laplacian pyramid of adversarial networks,” in Proceedings of the 29th Neural Information Processing Systems (NIPS), Montréal Canada, 2015, pp. 1486–1494.
  • [3] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, Canada, 2014.
  • [4]

    D. J. Rezende, S. Mohamed, and D. Wiestra, “Stochastic backpropagation and approximate inference in deep generative models,” in

    Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 2014.
  • [5] U. V. Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, pp. 395–416, 2007.
  • [6] G. Arvanitidis, L. Hansen, and S. Hauberg, “Latent space oddity: on the curvature of deep generative models,” in Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018.
  • [7]

    B. Yang, X. Xiao, N. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in

    Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017, pp. 3861–3870.
  • [8] P. Huang, Y. Huang, W. Wang, and L. Wang, “Deep embedding network for clustering,” in

    Proceedings of the 22nd International Conference on Pattern Recognition

    , Stockholm, Sweden, 2014, pp. 1532–1537.
  • [9] D. Chen, J. Lv, and Z. Yi, “Unsupervised multi-manifold clustering by learning deep representation,” in

    Proceedings of the 31st AAAI Conference on Artificial Intelligence

    , San Francisco, California USA, 2017, pp. 385–391.
  • [10] K. G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang, “Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization,” in

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    , Venice, Italy, 2017, pp. 5736–5745.
  • [11] S. A. Shah and V. Koltun, “Deep continuous clustering,” arXiv:1803.01449, 2018.
  • [12] ——, “Robust continuous clustering,” Proceedings of the National Academy of Sciences of the United States of America, vol. 114, pp. 9814–9819, 2017.
  • [13]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Proceedings of the 26th Conference on Neural Information Processing Systems (NIPS), Harrahs and Harveys, Lake Tahoe, 2012.
  • [14] R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for natural language understanding,” ACM Transactions on Audio Speech & Language Processing, vol. 22, pp. 778–784, 2014.
  • [15] G. Chen, “Deep learning with nonparametric clustering,” arXiv:1501.03084, 2015.
  • [16]

    J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in

    Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, 2016, pp. 478–487.
  • [17] C. C. Hsu and C. W. Lin, “Cnn-based joint clustering and representation learning with feature drift compensation for large-scale image data,” IEEE Transactions on Multimedia, vol. 20, pp. 421 – 429, 2018.
  • [18] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016, pp. 5147–5156.
  • [19] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deep adaptive image clustering,” in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, 2017, pp. 5879–5887.
  • [20] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep embedding: an unsupervised and generative approach to clustering,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, 2017, pp. 1965–1972.
  • [21] N. Dilokthanakul, P. Mediano, and M. Garnelo, “Deep unsupervised clustering with gaussian mixture variational autoencoders,” arXiv:1611.02648, 2016.
  • [22] W. Harchaoui, P. A. Mattei, and C. Bouveyron, “Deep adversarial gaussian mixture auto-encoder for clustering,” in Workshop of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.
  • [23] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Proceedings of the 5th Proceedings of the 30th Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016, pp. 2172–2180.
  • [24] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics).   Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
  • [25] S. Gallot, D. Hulin, and J. Lafontaine, Riemannian geometry.   Springer, 1990, vol. 3.
  • [26] S. Hauberg, “Only bayes should learn a manifold,” 2018.
  • [27] L. B. Rall, “Automatic differentiation: Techniques and applications,” 1981.
  • [28] P. Hennig and S. Hauberg, “Probabilistic solutions to differential equations and their application to riemannian statistics,” in Proceedings of the 17th international Conference on Artificial Intelligence and Statistics (AISTATS), vol. 33, 2014.
  • [29] S. Laine, “Feature-based metrics for exploring the latent space of generative models,” ICLR workshops, 2018.
  • [30]

    Q. Que and M. Belkin, “Back to the future: Radial basis function networks revisited,” in

    Artificial Intelligence and Statistics (AISTATS), 2016.
  • [31] O. Cappé, S. J. Godsill, and E. Moulines, “An overview of existing methods and recent advances in sequential monte carlo,” Proceedings of the IEEE, vol. 95, no. 5, pp. 899–924, 2007.
  • [32] L. Kaufman and P. Rousseeuw, Clustering by means of medoids.   North-Holland, 1987.
  • [33] C. M. Bishop, M. Svensen, and C. K. Williams, “Magnification factors for the gtm algorithm,” 1997.
  • [34] A. Tosi, S. Hauberg, A. Vellido, and N. D. Lawrence, “Metrics for Probabilistic Geometries,” in The Conference on Uncertainty in Artificial Intelligence (UAI), Jul. 2014.
  • [35] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
  • [36] G. Cohen, S. Afshar, J. Tapson, and A. V. Schaik, “Emnist:an extension of mnist to handwritten letters,” arXiv:1702.05373, 2017.