Finite Gaussian Mixture Models (GMMs), also called Mixture of Gaussians (MoG), are powerful, parametric, and probabilistic tools that are widely used as flexible models for multivariate density estimation in various applications concerning machine learning, computer vision, and signal/image analysis. GMMs have been utilized for: image representation [5, 17] to generate feature signatures, point set registration , adaptive contrast enhancement 
, inverse problems including super-resolution and deblurring[19, 55], time series classification , texture segmentation , and robotic visuomotor transformations  among many others.
As a special case of general latent variable models, finite GMM parameters could serve as a concise embedding , which provide a compressed representation of the data. Moreover, GMMs could be used to approximate any density defined on with a large enough number of mixture components. To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components. Several techniques have been introduced to provide an answer for the first question . The focus of this paper in on the latter question.
The existing methods to estimate the GMM parameters are based on minimizing the negative log-likelihood (NLL) of the data with respect to the parameters . The Expectation Maximization (EM) algorithm  is the prominent way of minimizing the NLL (though, see, e.g., as an alternative [38, 22]). While EM remains the most popular method for estimating GMMs, it only guarantees convergence to a stationary point of the likelihood function. On the other hand, various studies have shown that the likelihood function has bad local maxima that can have arbitrarily worse log-likelihood values compared to any of the global maxima [22, 25, 2]. More importantly, Jin et al. 
proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability. This issue turns the EM algorithm sensitive to the choice of initial parameters.
In the limit (i.e. having infinite i.i.d samples), minimizing the NLL function is equivalent to minimizing the Kullback-Leibler divergence between the data distribution and the GMM, with respect to the GMM parameters. Here, we propose, alternatively, to minimize the p-Wasserstein distance, and more specifically the sliced p-Wasserstein distance, between the data distribution and the GMM. The Wasserstein distance and its variations have attracted a lot of attention from the Machine Learning (ML) and signal processing communities lately [28, 3, 16]. It has been shown that optimizing with respect to the Wasserstein loss has various practical benefits over the KL-divergence loss [44, 16, 39, 3, 20]. Importantly, unlike the KL-divergence and its related dissimilarity measures (e.g. Jensen-Shannon divergence), the Wasserstein distance can provide a meaningful notion of closeness (i.e. distance) for distributions supported on non-overlapping low dimensional manifolds. This motivates our proposed formulation for estimating GMMs.
To overcome the computational burden of the Wasserstein minimization for high-dimensional distributions, we propose to use the sliced Wasserstein distance [6, 30, 28]. Our method slices the high-dimensional data distribution via random projections and minimizes the Wasserstein distance between the projected one-dimensional distributions with respect to the GMM parameters. We note that the idea of characterizing a high-dimensional distribution via its random projections has been studied in various work before [52, 26]. The work in , for instance, minimizes the norm between the slices of the data distribution and the GMM with respect to the parameters. This method, however, suffers from the same shortcomings as the KL-divergence based methods.
The p-Wasserstein distances and more generally the optimal mass transportation problem have recently gained plenty of attention from the computer vision and machine learning communities [28, 49, 42, 29, 54, 47, 3]. We note that the p-Wasserstein distances have also been used in regard to GMMs, however, as a distance metric to compare various GMM models [11, 34, 45]. Our proposed method, on the other hand, is an alternative framework for fitting a GMM to data via sliced p-Wasserstein distances.
, we reiterate the connection between the K-means problem and the Wasserstein means problem, extend it to GMMs, and formulate the Sliced Wasserstein means problem. Our numerical experiments are presented in Section 4. Finally, we conclude our paper in Section 5.
2.1 p-Wasserstein distance:
In this section we review the preliminary concepts and formulations needed to develop our framework. Let be the set of Borel probability measures with finite
’th moment defined on a given metric space, and let and be probability measures defined on
with corresponding probability density functionsand , and . The -Wasserstein distance for between and is defined as the optimal mass transportation (OMT) problem  with cost function , such that:
where is the set of all transportation plans, , and satisfy the following:
Due to Brenier’s theorem , for absolutely continuous probability measures and (with respect to Lebesgue measure) the -Wasserstein distance can be equivalently obtained from,
where, and represents the pushforward of measure ,
When a transport map exists, the transport plan and the transport map are related via, . Note that in most engineering and computer science applications is a compact subset of and is the Euclidean distance. By abuse of notation we will use and interchangeably throughout the manuscript. For a more detailed explanation of the Wasserstein distances and the optimal mass transport problem, we refer the reader to the recent review article by Kolouri et al.  and the references there in.
One-dimensional distributions: The case of one-dimensional continuous probability measures is specifically interesting as the p-Wasserstein distance has a closed form solution. More precisely, for one-dimensional probability measures there exists a unique monotonically increasing transport map that pushes one measure into another. Let
be the cumulative distribution function (CDF) forand define to be the CDF of . The transport map is then uniquely defined as, and consequently the -Wasserstein distance is calculated as:
where in the second line we used the change of variable . The closed form solution of the p-Wasserstein is an attractive property that gives rise to the Sliced-Wasserstein (SW) distances. Next we review the Radon transform, which enables the definition the Sliced -Wasserstein distance.
2.2 Radon transform
The -dimensional Radon transform, , maps a function where
to the set of its integrals over the hyperplanes ofand is defined as,
For all where is the unit sphere in . Note that
. For the sake of completeness, we note that the Radon transform is an invertible, linear transform and we denote its inverse as, which is also known as the filtered back projection algorithm and is defined as:
is a one-dimensional filter with corresponding Fourier transform(it appears due to the Fourier slice theorem, see  for more details) and ‘’ denotes convolution. Radon transform and its inverse are extensively used in Computerized Axial Tomography (CAT) scans in the field of medical imaging, where X-ray measurements integrate the tissue-absorption levels along 2D hyper-planes to provide a tomographic image of the internal organs. Note that in practice acquiring infinite number of projections is not feasible therefore the integration in Equation (5) is replaced with a finite summation over projection angles. A formal measure theoretic definition of Radon transform for probability measures could be find in .
Radon transform of empirical PDFs: The Radon transform of simply follows Equation (4). However, in most machine learning applications we do not have access to the distribution but to its samples,
. Kernel density estimation could be used in such scenarios to approximatefrom its samples,
where is a density kernel where (e.g. Gaussian kernel). The Radon transform of can then be approximated from its samples via:
Note that certain density kernels have analytic Radon transformation. For instance when the Radon transform .
Radon transform of multivariate Gaussians: Let
be a d-dimensional multivariate Gaussian distribution with meanand covariance . A slice/projection of the Radon transform of
is then a one-dimensional normal distribution. Given the linearity of the Radon transform, this indicates that a slice of a d-dimensional GMM is a one-dimensional GMM with component means
2.3 Sliced -Wasserstein Distance
The idea behind the sliced
-Wasserstein distance is to first obtain a family of marginal distributions (i.e. one-dimensional distributions) for a higher-dimensional probability distribution through linear projections (via Radon transform), and then calculate the distance between two input distributions as a functional on the-Wasserstein distance of their marginal distributions. In this sense, the distance is obtained by solving several one-dimensional optimal transport problems, which have closed-form solutions. More precisely, the Sliced Wasserstein distance between and is defined as,
The Sliced -Wasserstein distance as defined above is symmetric, and it satisfies sub-additivity and coincidence axioms, and hence it is a true metric .
The sliced -Wasserstein distance is especially useful when one only has access to samples of a high-dimensional PDFs and kernel density estimation is required to estimate . One dimensional kernel density estimation of PDF slices is a much simpler task compared to direct estimation of from its samples. The catch, however, is that as the dimensionality grows one requires larger number of projections to estimate from . In short, if a reasonably smooth two-dimensional distribution can be approximated by its projections (up to an acceptable reconstruction error, ), then one would require number of projections to approximate a similarly smooth d-dimensional distribution (for ). In later sections we show that the projections could be randomized in a stochastic Gradient descent fashion for learning Gaussian mixture models.
3 Sliced Wasserstein Means and Gaussian Mixture Models
Here we first reiterate the connection between the K-means clustering algorithm and the Wasserstein means problem, and then extend this connection to GMMs and state the need for the sliced Wasserstein distance. Let for be samples and . The K-means clustering algorithm seeks the best points, for and , that represent . Formally,
where contains the one-hot labels of .
Let be the empirical distribution of , where
is a kernel density estimator (e.g. radial basis function kernel or the Dirac delta function in its limit). Then, the K-means problem can be alternatively solved by minimizing a statistical distance/divergence betweenand . A common choice for such distance/divergence is the Kullback-Leibler divergence (KL-divergence) [4, 10]. Alternatively, the p-Wasserstein distance could be used to estimate the parameters of ,
We discuss the benefits of the p-Wasserstein distance over the KL-divergence in the next sub-section. Above minimization is known as the Wasserstein means problem and is closely related to the Wasserstein Barycenter problem [1, 46, 14, 21]. The main difference being in that in these works the goal is to find a measure such that , where are sets of given low dimensional distributions (2 or 3D images or point clouds). The strategy in [1, 46, 14] could also be extended into a clustering problem, though the two formulations are still significantly different given the inputs into the wasserstein distance being very different. Note also that K-means is equivalent to a variational EM approximation of a GMM with isotropic Gaussians , therefore, a natural extension of the Wasserstein means problem could be formulated to fit a general GMM to . To do so, we let distribution to be the parametric GMM as follows:
where and Equation (9) is solved to find s, s, and s. Next we describe the benefits of using the Wasserstein distance in Equation (9) to fit a general GMM to the observed data compared to the common log-likelihood maximization schemes.
3.1 Wasserstein Means vs. Maximum Log-Likelihood
General GMMs are often fitted to the observed data points, s, via maximizing the log-likelihood of samples with respect to . Formally, this is written as:
It is straightforward to show that in the limit and as the number of samples grows to infinity, , the maximum log-likelihood becomes equivalent to minimizing the KL-divergence between and (See supplementary material for a proof):
The p-Wasserstein distance has been shown to have certain benefits over the commonly used KL-divergence and its related distances/divergences (i.e., other examples of Bregman divergences including the Jensen-Shannon (JS) distance and Itakura-Saito distance) . For a general GMM, the model is continuous and smooth (i.e. infinitely differentiable) in its parameters and is locally Lipschitz; therefore, is continuous and differentiable everywhere, while this is not true for the KL-divergence. In addition, in scenarios where the distributions are supported by low dimensional manifolds, KL-divergence and other Bregman divergences may be difficult cost functions to optimize given their limited capture range. This limitation is due to their ‘Eulerian’ nature, in the sense that the distributions are compared at fixed spatial coordinates (i.e., bin-to-bin comparison in discrete measures) as opposed to the p-Wasserstein distance, which is ‘Lagrangian’, and morphs one distribution to match another by finding correspondences in the domain of these distributions (i.e., Wasserstein distances perform cross-bin comparisons).
To get a practical sense of the benefits of the Wasserstein means problem over the maximum log-likelihood estimation, we study two simple scenarios. In the first scenario, we generate one-dimensional samples, , from a normal distribution where we assume known and visualize the negative log-likelihood (NLL) and the Wasserstein means (WM) problem as a function of . Figure 1 (a) shows the first scenario and the corresponding energy landscapes for the negative log-likelihood and the Wasserstein means problem. It can be seen that while the global optimum is the same for both problems, the Wasserstein means landscape is less sensitive to the initial point, hence a gradient descent approach would easily converge to the optimal point regardless of the starting point. In the second scenario, we generated samples, , from a mixture of two one-dimensional Gaussian distributions. Next, we assumed that the mixture coefficients
s and the standard deviationss, for , are known and visualized the corresponding energy landscapes for NLL and WM as a function of s (See Figure 1 (b)). It can be clearly seen that although the global optimum of both problems is the same, but the energy landscape of the Wasserstein means problem does not suffer from local minima and is much smoother.
3.2 Sliced Wasserstein Means
We propose to use an approximation of the p-Wasserstein distance between and using the SW distance. Substituting the Wasserstein distance in Equation (9) with the SW distance leads to the Sliced p-Wasserstein Means (SWM) problem,
which can be written as:
where for a fixed , is the optimal transport map between and , and satisfies . Note that, since is an absolutely continuous PDF, an optimal transport map will exist even when is not an absolutely continuous PDF (e.g. when ) . Moreover, since the slices/projections are one-dimensional the transport map, , is uniquely defined and the minimization on has a closed form solution and is calculated from Equation (3). The Radon transformations in Equation (LABEL:eq:SWM) are:
The new formulation avoids the optimization for calculating the Wasserstein distance and enables an efficient implementation for clustering high-dimensional data. Figure 2 demonstrates the high-level idea behind slicing high-dimensional PDFs and and minimizing the p-Wasserstein distance between these slices over GMM components. Moreover, given the high-dimensional nature of the problem estimating density in requires large number of samples, however, the projections of , , are one dimensional and therefore it may not be critical to have large number of samples to estimate these one-dimensional densities.
Optimization scheme: To obtain a numerical optimization scheme, which minimizes the problem in Equation (LABEL:eq:SWM) we first discretize the set of directions/projections. This corresponds to the use of a finite set , and a minimization of the following energy function,
A fine sampling of is required for Equation (13) to be a good approximation of (LABEL:eq:SWM). Such sampling, however, becomes prohibitively expensive for high-dimensional data. Alternatively, following the approach presented in  we utilize random samples of at each minimization step to approximate the Equation (LABEL:eq:SWM). This leads to a stochastic gradient descent scheme where instead of random sampling of the input data, we random sample the projection angles. Finally, the GMM parameters are updated through an EM-like approach where for fixed GMM parameters we calculate the optimal transport map between random slices of and , followed by updating for fixed transport maps s. Below we describe these steps:
Generate random samples from , .
Fix the GMM, , and calculate the optimal transport map between slices and via:
where is the CDF of .
Project the updated s onto the positive semidefinite cone, and renormalize s to satisfy .
Notice that the derivative with respect to the k’th component of the mixture model in Equation (3) is independent of other components. In addition, the transport map for each projection, , in Equation (14) is calculated independent of the other projections. Therefore the optimization can be heavily parallelized in each iteration. We note that, we have also experimented with the Adam optimizer  but did not see any improvements over RMSProp. The detailed update equations are included in the Supplementary materials. In what follows we show the SWM solver for estimating GMM parameters in action.
4 Numerical Experiments
We ran various experiments on three datasets to test our proposed method for learning GMM parameters. The first dataset is a two-dimensional data-point distribution consisting a ring, a square, and a connecting line (See Figure 3). To show the applicability of our method on higher-dimensional datasets we also used the MNIST dataset  and the CelebFaces Attributes Dataset (CelebA) .
4.1 Robustness to initialization
We started by running a simple experiment to demonstrate the robustness of our proposed formulation to different initializations. In this test, we used a two-dimensional dataset consisting of a ring, a square, and a line connecting them. For a fixed number of modes, in our experiment, we randomly initialized the GMM. Next, for each initialization, we optimized the GMM parameters using the EM algorithm as well as our proposed method. We repeated this experiment 100 times.
Figure 3 shows sample results of the fitted GMM models for both algorithms (Top Row). Moreover, we calculated the histograms of the negative log-likelihood of the fitted GMM and the sliced-Wasserstein distance between the fitted GMM and the empirical data distribution (bottom). It can be seen that our proposed formulation provides a consistent model regardless of the initialization. In of initializations, our method achieved the optimal negative log-likelihood, compared to only for EM-GMM. In addition, both the negative log-likelihood and the sliced-Wasserstein distance for our method are smaller than those of the EM algorithm, indicating that our solution is closer to the global optimum (up to permutations of the modes).
4.2 High-dimensional datasets
. To capture the nonlinearity of the image data and boost the applicability of GMMs, we trained an adversarial deep convolutional autoencoder (Figure4
, Left) on the image data. Next, we modeled the distribution of the data in the embedded space via a GMM. The GMM was then used to generate samples in the embedding space, which were consequently decoded to generate synthetic (i.e. ’fake’) images. In learning the GMM, we compared the EM algorithm with our proposed method, SW-GMM. We note that the entire pipeline is in an unsupervised learning setting. Figure4
demonstrates the steps of our experiment (Left) and provides a qualitative measure of the generated samples (Right) for the MNIST dataset. It can be seen that the SW-GMM model leads to more visually appealing samples compared to the EM-GMM. In addition, we trained a CNN classifier on the MNIST training data. We then generated 10,000 samples from each GMM component and classified these samples to measure the fidelity/pureness of each component. Ideally, each component should only be assigned to a single digit. We found out that for EM-GMM the components were in averagepure, compared to pureness of SW-GMM components.
Similarly, a deep convolutional autoencoder was learned for the CelebA face dataset, and a GMM was trained in the embedding space. Figure 5 shows samples generated from GMM components learned by EM and by our proposed method (The samples generated from all components is attached in the Supplementary materials). We note that, Figures 4 and 5 only provide qualitative measures of how well the GMM is fitting the dataset. Next we provide quantitative measures for the fitness of the GMMs for both methods.
We used adversarial training of neural networks[18, 33]
to provide a goodness of fitness of the GMM to the data distribution. In short, we use success in fooling an adversary network as an evaluation metric for goodness of fit of a GMM. A deep discriminator/classifier was trained to distinguish whether a data point was sampled from the actual data distribution or from the GMM. The fooling rate (i.e. error rate) of such a discriminator is a good measure of fitness for the GMM, as a higher error rate translates to a better fit to the distribution of the data. Figure6 shows the idea behind this experiment, and reports the fooling rates for all three datasets used in our experiments. Note that the SW-GMM consistently provides a higher fooling rate, indicating a better fit to the datasets. Moreover, we point out that while for the low-dimensional ring-square-line dataset both methods provide reasonable models for the dataset, the SW-GMM significantly outperforms EM-GMM for higher-dimensional datasets (i.e. the MNIST and CelebA datasets).
The details of the architectures used in our experiments are included in the supplementary material.
In this paper, we proposed a novel algorithm for estimating the parameters of a GMM via minimization of the sliced p-Wasserstein distance. In each iteration, our method projects the high-dimensional data distribution into a small set of one-dimensional distributions utilizing random projections/slices of the Radon transform and estimates the GMM parameters from these one-dimensional projections. While we did not provide a theoretical guarantee that the new method is convex, or that it has fewer local minima, the empirical results suggest that our method is more robust compared to KL-divergence-based methods, including the EM algorithm, for maximizing the log-likelihood function. Consistent with this finding, we showed that the p-Wasserstein metrics result in more well-behaved energy landscapes. We demonstrated the robustness of our method on three datasets: a two-dimensional ring-square-line distribution and the high-dimensional MNIST and CelebA face datasets. Finally, while we used deep convolutional encoders to provide embeddings for two of the datasets and learned GMMs in these embeddings, we emphasize that our method could be applied to other embeddings including the original data space.
This work was partially supported by NSF (CCF 1421502). The authors gratefully appreciate countless fruitful conversations with Drs. Charles H. Martin and Dejan Slepćev.
-  M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
-  C. Améndola, M. Drton, and B. Sturmfels. Maximum likelihood estimates for gaussian mixtures are transcendental. In International Conference on Mathematical Aspects of Computer and Information Sciences, pages 579–590. Springer, 2015.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
-  A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
C. Beecks, A. M. Ivanescu, S. Kirchhoff, and T. Seidl.
Modeling image similarity by gaussian mixture models and the signature quadratic form distance.In Computer Vision (ICCV), 2011 IEEE International Conference On, pages 1754–1761. IEEE, 2011.
-  N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
Polar factorization and monotone rearrangement of vector-valued functions.Communications on pure and applied mathematics, 44(4):375–417, 1991.
-  W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using gmm supervectors for speaker verification. IEEE signal processing letters, 13(5):308–311, 2006.
-  T. Celik and T. Tjahjadi. Automatic image equalization and contrast enhancement using gaussian mixture modeling. IEEE Transactions on Image Processing, 21(1):145–156, 2012.
-  K. Chaudhuri and A. McGregor. Finding metric structure in information theoretic clustering. In COLT, volume 8, page 10, 2008.
-  Y. Chen, T. T. Georgiou, and A. Tannenbaum. Optimal transport for gaussian mixture models. arXiv preprint arXiv:1710.07876, 2017.
-  F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
-  M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
-  M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
-  A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
-  C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
-  J. Goldberger, S. Gordon, and H. Greenspan. An efficient image similarity measure based on approximations of kl-divergence between two gaussian mixtures. In null, page 487. IEEE, 2003.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  J. A. Guerrero-Colón, L. Mancera, and J. Portilla. Image restoration using space-variant gaussian scale mixtures in overcomplete pyramids. IEEE Transactions on Image Processing, 17(1):27–41, 2008.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
-  N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung. Multilevel clustering via wasserstein means. arXiv preprint arXiv:1706.03883, 2017.
-  H. Hoffmann. Unsupervised Learning of Visuomotor Associations, volume 11 of MPI Series in Biological Cybernetics. Logos Verlag Berlin, 2005.
-  H. Hoffmann, W. Schenck, and R. Möller. Learning visuomotor transformations for gaze-control and grasping. Biological Cybernetics, 93:119–130, 2005.
-  B. Jian and B. C. Vemuri. Robust point set registration using gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1633–1645, 2011.
-  C. Jin, Y. Zhang, S. Balakrishnan, M. J. Wainwright, and M. I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, pages 4116–4124, 2016.
-  A. T. Kalai, A. Moitra, and G. Valiant. Disentangling gaussians. Communications of the ACM, 55(2):113–120, 2012.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.
S. Kolouri and G. K. Rohde.
Transport-based single frame super resolution of very low resolution
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4876–4884, 2015.
-  S. Kolouri, Y. Zou, and G. K. Rohde. Sliced-Wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4876–4884, 2016.
The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/.
-  B. Lévy. A numerical algorithm for semi-discrete optimal transport in 3D. ESAIM Math. Model. Numer. Anal., 49(6):1693–1715, 2015.
-  J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
-  P. Li, Q. Wang, and L. Zhang. A novel earth mover’s distance methodology for image matching with gaussian mixture models. In Proceedings of the IEEE International Conference on Computer Vision, pages 1689–1696, 2013.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
-  J. Lücke and D. Forster. k-means is a variational em approximation of gaussian mixture models. arXiv preprint arXiv:1704.04812, 2017.
-  G. J. McLachlan and S. Rathnayake. On the number of components in a gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5):341–355, 2014.
-  R. Möller and H. Hoffmann. An extension of neural gas to local PCA. Neurocomputing, 62:305–326, 2004.
G. Montavon, K.-R. Müller, and M. Cuturi.
Wasserstein training of restricted boltzmann machines.In Advances in Neural Information Processing Systems, pages 3718–3726, 2016.
-  K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
-  F. Natterer. The mathematics of computerized tomography, volume 32. Siam, 1986.
-  S. R. Park, S. Kolouri, S. Kundu, and G. K. Rohde. The cumulative distribution transform and linear pattern classification. Applied and Computational Harmonic Analysis, 2017.
-  H. Permuter, J. Francos, and I. Jermyn. A study of gaussian mixture models of color and texture features for image classification and segmentation. Pattern Recognition, 39(4):695–706, 2006.
-  G. Peyré, J. Fadili, and J. Rabin. Wasserstein active contours. In Image Processing (ICIP), 2012 19th IEEE International Conference on, pages 2541–2544. IEEE, 2012.
-  Y. Qian, E. Vazquez, and B. Sengupta. Deep geometric retrieval. arXiv preprint arXiv:1702.06383, 2017.
-  J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2012.
-  A. Rolet, M. Cuturi, and G. Peyré. Fast dictionary learning with a smoothed wasserstein loss. In Artificial Intelligence and Statistics, pages 630–638, 2016.
-  J. Solomon, F. de Goes, P. A. Studios, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (Proc. SIGGRAPH 2015), to appear, 2015.
-  M. Thorpe, S. Park, S. Kolouri, G. K. Rohde, and D. Slepčev. A transportation l^ p distance for signal analysis. Journal of Mathematical Imaging and Vision, 59(2):187–210, 2017.
-  T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
-  M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11:443–482, 1999.
-  S. S. Vempala. The random projection method, volume 65. American Mathematical Soc., 2005.
-  C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
-  S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha. Wasserstein learning of deep generative point process models. arXiv preprint arXiv:1705.08051, 2017.
-  G. Yu, G. Sapiro, and S. Mallat. Solving inverse problems with piecewise linear estimators: From gaussian mixture models to structured sparsity. IEEE Transactions on Image Processing, 21(5):2481–2499, 2012.
7 Supplementary material
7.1 Maximum log-likelihood and KL-divergence
The KL-divergence between and is defined as:
For the maximum log-likelihood and in the limit and as the number of samples grows to infinity, , we have:
7.2 RMSProp update equations
SGD often suffers from oscillatory behavior across the slopes of a ravine while only making incremental progress towards the optimal point. Various momentum based methods have been introduced to adaptively change the learning rate of SGD and dampen this oscillatory behavior. In our work, we used RMSProp, introduced by Tieleman and Hinton , which is a momentum based technique for SGD. Let be the learning rate, be the decay parameter, and be the momentum parameter. The update equation for a GMM parameter, here , is then calculated from:
Where and are the first and second moments (uncentered variance) of , respectively. Similar update equations hold for and .
7.3 Experimental Details
Here we provide the detailed implementation and architecture of the adversarial autoencoders we used in our paper. Figure 8 shows the detailed architectures of the deep adversarial autoencoder for MNIST and CelebA datasets. The architecture of the deep binary classifiers used for scoring the fitness of the GMMs are shown in Figure 7. We used Keras  for implementation of our experiments.
For the loss function of the autoencoder we used themean absolute error between the input image and the decoded image together with the adversarial loss of the decoded image (equally weighted). The loss functions for training the adversarial networks and the binary classifiers were chosen to be cross entropy. Finally, we used RMSProp  as the default optimizer for all the models, and trained the models over Epochs, with batch size of .
7.4 CelebA Generated Images
Figure 9 shows all the GMM components learned by EM and our SWM formulation.