1 Introduction
Finite Gaussian Mixture Models (GMMs), also called Mixture of Gaussians (MoG), are powerful, parametric, and probabilistic tools that are widely used as flexible models for multivariate density estimation in various applications concerning machine learning, computer vision, and signal/image analysis. GMMs have been utilized for: image representation [5, 17] to generate feature signatures, point set registration [24], adaptive contrast enhancement [9]
, inverse problems including superresolution and deblurring
[19, 55], time series classification [8], texture segmentation [43], and robotic visuomotor transformations [23] among many others.As a special case of general latent variable models, finite GMM parameters could serve as a concise embedding [40], which provide a compressed representation of the data. Moreover, GMMs could be used to approximate any density defined on with a large enough number of mixture components. To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components. Several techniques have been introduced to provide an answer for the first question [37]. The focus of this paper in on the latter question.
The existing methods to estimate the GMM parameters are based on minimizing the negative loglikelihood (NLL) of the data with respect to the parameters [51]. The Expectation Maximization (EM) algorithm [15] is the prominent way of minimizing the NLL (though, see, e.g., as an alternative [38, 22]). While EM remains the most popular method for estimating GMMs, it only guarantees convergence to a stationary point of the likelihood function. On the other hand, various studies have shown that the likelihood function has bad local maxima that can have arbitrarily worse loglikelihood values compared to any of the global maxima [22, 25, 2]. More importantly, Jin et al. [24]
proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability. This issue turns the EM algorithm sensitive to the choice of initial parameters.
In the limit (i.e. having infinite i.i.d samples), minimizing the NLL function is equivalent to minimizing the KullbackLeibler divergence between the data distribution and the GMM, with respect to the GMM parameters. Here, we propose, alternatively, to minimize the pWasserstein distance, and more specifically the sliced pWasserstein distance
[28], between the data distribution and the GMM. The Wasserstein distance and its variations have attracted a lot of attention from the Machine Learning (ML) and signal processing communities lately [28, 3, 16]. It has been shown that optimizing with respect to the Wasserstein loss has various practical benefits over the KLdivergence loss [44, 16, 39, 3, 20]. Importantly, unlike the KLdivergence and its related dissimilarity measures (e.g. JensenShannon divergence), the Wasserstein distance can provide a meaningful notion of closeness (i.e. distance) for distributions supported on nonoverlapping low dimensional manifolds. This motivates our proposed formulation for estimating GMMs.To overcome the computational burden of the Wasserstein minimization for highdimensional distributions, we propose to use the sliced Wasserstein distance [6, 30, 28]. Our method slices the highdimensional data distribution via random projections and minimizes the Wasserstein distance between the projected onedimensional distributions with respect to the GMM parameters. We note that the idea of characterizing a highdimensional distribution via its random projections has been studied in various work before [52, 26]. The work in [26], for instance, minimizes the norm between the slices of the data distribution and the GMM with respect to the parameters. This method, however, suffers from the same shortcomings as the KLdivergence based methods.
The pWasserstein distances and more generally the optimal mass transportation problem have recently gained plenty of attention from the computer vision and machine learning communities [28, 49, 42, 29, 54, 47, 3]. We note that the pWasserstein distances have also been used in regard to GMMs, however, as a distance metric to compare various GMM models [11, 34, 45]. Our proposed method, on the other hand, is an alternative framework for fitting a GMM to data via sliced pWasserstein distances.
In what follows, we first formulate the pWasserstein distance, the Radon transform, and the Sliced pWasserstein distance in Section 2. In Section 3
, we reiterate the connection between the Kmeans problem and the Wasserstein means problem
[21], extend it to GMMs, and formulate the Sliced Wasserstein means problem. Our numerical experiments are presented in Section 4. Finally, we conclude our paper in Section 5.2 Preliminary
2.1 pWasserstein distance:
In this section we review the preliminary concepts and formulations needed to develop our framework. Let be the set of Borel probability measures with finite
’th moment defined on a given metric space
, and let and be probability measures defined onwith corresponding probability density functions
and , and . The Wasserstein distance for between and is defined as the optimal mass transportation (OMT) problem [53] with cost function , such that:(1) 
where is the set of all transportation plans, , and satisfy the following:
Due to Brenier’s theorem [7], for absolutely continuous probability measures and (with respect to Lebesgue measure) the Wasserstein distance can be equivalently obtained from,
(2) 
where, and represents the pushforward of measure ,
When a transport map exists, the transport plan and the transport map are related via, . Note that in most engineering and computer science applications is a compact subset of and is the Euclidean distance. By abuse of notation we will use and interchangeably throughout the manuscript. For a more detailed explanation of the Wasserstein distances and the optimal mass transport problem, we refer the reader to the recent review article by Kolouri et al. [28] and the references there in.
Onedimensional distributions: The case of onedimensional continuous probability measures is specifically interesting as the pWasserstein distance has a closed form solution. More precisely, for onedimensional probability measures there exists a unique monotonically increasing transport map that pushes one measure into another. Let
be the cumulative distribution function (CDF) for
and define to be the CDF of . The transport map is then uniquely defined as, and consequently the Wasserstein distance is calculated as:(3)  
where in the second line we used the change of variable . The closed form solution of the pWasserstein is an attractive property that gives rise to the SlicedWasserstein (SW) distances. Next we review the Radon transform, which enables the definition the Sliced Wasserstein distance.
2.2 Radon transform
The dimensional Radon transform, , maps a function where
to the set of its integrals over the hyperplanes of
and is defined as,(4) 
For all where is the unit sphere in . Note that
. For the sake of completeness, we note that the Radon transform is an invertible, linear transform and we denote its inverse as
, which is also known as the filtered back projection algorithm and is defined as:(5)  
where
is a onedimensional filter with corresponding Fourier transform
(it appears due to the Fourier slice theorem, see [41] for more details) and ‘’ denotes convolution. Radon transform and its inverse are extensively used in Computerized Axial Tomography (CAT) scans in the field of medical imaging, where Xray measurements integrate the tissueabsorption levels along 2D hyperplanes to provide a tomographic image of the internal organs. Note that in practice acquiring infinite number of projections is not feasible therefore the integration in Equation (5) is replaced with a finite summation over projection angles. A formal measure theoretic definition of Radon transform for probability measures could be find in [6].Radon transform of empirical PDFs: The Radon transform of simply follows Equation (4). However, in most machine learning applications we do not have access to the distribution but to its samples,
. Kernel density estimation could be used in such scenarios to approximate
from its samples,where is a density kernel where (e.g. Gaussian kernel). The Radon transform of can then be approximated from its samples via:
(6) 
Note that certain density kernels have analytic Radon transformation. For instance when the Radon transform .
Radon transform of multivariate Gaussians: Let
be a ddimensional multivariate Gaussian distribution with mean
and covariance . A slice/projection of the Radon transform ofis then a onedimensional normal distribution
. Given the linearity of the Radon transform, this indicates that a slice of a ddimensional GMM is a onedimensional GMM with component meansand variance
.2.3 Sliced Wasserstein Distance
The idea behind the sliced
Wasserstein distance is to first obtain a family of marginal distributions (i.e. onedimensional distributions) for a higherdimensional probability distribution through linear projections (via Radon transform), and then calculate the distance between two input distributions as a functional on the
Wasserstein distance of their marginal distributions. In this sense, the distance is obtained by solving several onedimensional optimal transport problems, which have closedform solutions. More precisely, the Sliced Wasserstein distance between and is defined as,(7) 
The Sliced Wasserstein distance as defined above is symmetric, and it satisfies subadditivity and coincidence axioms, and hence it is a true metric [30].
The sliced Wasserstein distance is especially useful when one only has access to samples of a highdimensional PDFs and kernel density estimation is required to estimate . One dimensional kernel density estimation of PDF slices is a much simpler task compared to direct estimation of from its samples. The catch, however, is that as the dimensionality grows one requires larger number of projections to estimate from . In short, if a reasonably smooth twodimensional distribution can be approximated by its projections (up to an acceptable reconstruction error, ), then one would require number of projections to approximate a similarly smooth ddimensional distribution (for ). In later sections we show that the projections could be randomized in a stochastic Gradient descent fashion for learning Gaussian mixture models.
3 Sliced Wasserstein Means and Gaussian Mixture Models
Here we first reiterate the connection between the Kmeans clustering algorithm and the Wasserstein means problem, and then extend this connection to GMMs and state the need for the sliced Wasserstein distance. Let for be samples and . The Kmeans clustering algorithm seeks the best points, for and , that represent . Formally,
(8) 
where contains the onehot labels of .
Let be the empirical distribution of , where
is a kernel density estimator (e.g. radial basis function kernel or the Dirac delta function in its limit). Then, the Kmeans problem can be alternatively solved by minimizing a statistical distance/divergence between
and . A common choice for such distance/divergence is the KullbackLeibler divergence (KLdivergence) [4, 10]. Alternatively, the pWasserstein distance could be used to estimate the parameters of ,(9) 
We discuss the benefits of the pWasserstein distance over the KLdivergence in the next subsection. Above minimization is known as the Wasserstein means problem and is closely related to the Wasserstein Barycenter problem [1, 46, 14, 21]. The main difference being in that in these works the goal is to find a measure such that , where are sets of given low dimensional distributions (2 or 3D images or point clouds). The strategy in [1, 46, 14] could also be extended into a clustering problem, though the two formulations are still significantly different given the inputs into the wasserstein distance being very different. Note also that Kmeans is equivalent to a variational EM approximation of a GMM with isotropic Gaussians [36], therefore, a natural extension of the Wasserstein means problem could be formulated to fit a general GMM to . To do so, we let distribution to be the parametric GMM as follows:
where and Equation (9) is solved to find s, s, and s. Next we describe the benefits of using the Wasserstein distance in Equation (9) to fit a general GMM to the observed data compared to the common loglikelihood maximization schemes.
3.1 Wasserstein Means vs. Maximum LogLikelihood
General GMMs are often fitted to the observed data points, s, via maximizing the loglikelihood of samples with respect to . Formally, this is written as:
(10) 
It is straightforward to show that in the limit and as the number of samples grows to infinity, , the maximum loglikelihood becomes equivalent to minimizing the KLdivergence between and (See supplementary material for a proof):
The pWasserstein distance has been shown to have certain benefits over the commonly used KLdivergence and its related distances/divergences (i.e., other examples of Bregman divergences including the JensenShannon (JS) distance and ItakuraSaito distance) [3]. For a general GMM, the model is continuous and smooth (i.e. infinitely differentiable) in its parameters and is locally Lipschitz; therefore, is continuous and differentiable everywhere, while this is not true for the KLdivergence. In addition, in scenarios where the distributions are supported by low dimensional manifolds, KLdivergence and other Bregman divergences may be difficult cost functions to optimize given their limited capture range. This limitation is due to their ‘Eulerian’ nature, in the sense that the distributions are compared at fixed spatial coordinates (i.e., bintobin comparison in discrete measures) as opposed to the pWasserstein distance, which is ‘Lagrangian’, and morphs one distribution to match another by finding correspondences in the domain of these distributions (i.e., Wasserstein distances perform crossbin comparisons).
To get a practical sense of the benefits of the Wasserstein means problem over the maximum loglikelihood estimation, we study two simple scenarios. In the first scenario, we generate onedimensional samples, , from a normal distribution where we assume known and visualize the negative loglikelihood (NLL) and the Wasserstein means (WM) problem as a function of . Figure 1 (a) shows the first scenario and the corresponding energy landscapes for the negative loglikelihood and the Wasserstein means problem. It can be seen that while the global optimum is the same for both problems, the Wasserstein means landscape is less sensitive to the initial point, hence a gradient descent approach would easily converge to the optimal point regardless of the starting point. In the second scenario, we generated samples, , from a mixture of two onedimensional Gaussian distributions. Next, we assumed that the mixture coefficients
s and the standard deviations
s, for , are known and visualized the corresponding energy landscapes for NLL and WM as a function of s (See Figure 1 (b)). It can be clearly seen that although the global optimum of both problems is the same, but the energy landscape of the Wasserstein means problem does not suffer from local minima and is much smoother.3.2 Sliced Wasserstein Means
We propose to use an approximation of the pWasserstein distance between and using the SW distance. Substituting the Wasserstein distance in Equation (9) with the SW distance leads to the Sliced pWasserstein Means (SWM) problem,
which can be written as:
where for a fixed , is the optimal transport map between and , and satisfies . Note that, since is an absolutely continuous PDF, an optimal transport map will exist even when is not an absolutely continuous PDF (e.g. when ) . Moreover, since the slices/projections are onedimensional the transport map, , is uniquely defined and the minimization on has a closed form solution and is calculated from Equation (3). The Radon transformations in Equation (LABEL:eq:SWM) are:
(12) 
The new formulation avoids the optimization for calculating the Wasserstein distance and enables an efficient implementation for clustering highdimensional data. Figure 2 demonstrates the highlevel idea behind slicing highdimensional PDFs and and minimizing the pWasserstein distance between these slices over GMM components. Moreover, given the highdimensional nature of the problem estimating density in requires large number of samples, however, the projections of , , are one dimensional and therefore it may not be critical to have large number of samples to estimate these onedimensional densities.
Optimization scheme: To obtain a numerical optimization scheme, which minimizes the problem in Equation (LABEL:eq:SWM) we first discretize the set of directions/projections. This corresponds to the use of a finite set , and a minimization of the following energy function,
(13) 
A fine sampling of is required for Equation (13) to be a good approximation of (LABEL:eq:SWM). Such sampling, however, becomes prohibitively expensive for highdimensional data. Alternatively, following the approach presented in [6] we utilize random samples of at each minimization step to approximate the Equation (LABEL:eq:SWM). This leads to a stochastic gradient descent scheme where instead of random sampling of the input data, we random sample the projection angles. Finally, the GMM parameters are updated through an EMlike approach where for fixed GMM parameters we calculate the optimal transport map between random slices of and , followed by updating for fixed transport maps s. Below we describe these steps:

Generate random samples from , .

Fix the GMM, , and calculate the optimal transport map between slices and via:
(14) where is the CDF of .

Project the updated s onto the positive semidefinite cone, and renormalize s to satisfy .
Notice that the derivative with respect to the k’th component of the mixture model in Equation (3) is independent of other components. In addition, the transport map for each projection, , in Equation (14) is calculated independent of the other projections. Therefore the optimization can be heavily parallelized in each iteration. We note that, we have also experimented with the Adam optimizer [27] but did not see any improvements over RMSProp. The detailed update equations are included in the Supplementary materials. In what follows we show the SWM solver for estimating GMM parameters in action.
4 Numerical Experiments
We ran various experiments on three datasets to test our proposed method for learning GMM parameters. The first dataset is a twodimensional datapoint distribution consisting a ring, a square, and a connecting line (See Figure 3). To show the applicability of our method on higherdimensional datasets we also used the MNIST dataset [31] and the CelebFaces Attributes Dataset (CelebA) [35].
4.1 Robustness to initialization
We started by running a simple experiment to demonstrate the robustness of our proposed formulation to different initializations. In this test, we used a twodimensional dataset consisting of a ring, a square, and a line connecting them. For a fixed number of modes, in our experiment, we randomly initialized the GMM. Next, for each initialization, we optimized the GMM parameters using the EM algorithm as well as our proposed method. We repeated this experiment 100 times.
Figure 3 shows sample results of the fitted GMM models for both algorithms (Top Row). Moreover, we calculated the histograms of the negative loglikelihood of the fitted GMM and the slicedWasserstein distance between the fitted GMM and the empirical data distribution (bottom). It can be seen that our proposed formulation provides a consistent model regardless of the initialization. In of initializations, our method achieved the optimal negative loglikelihood, compared to only for EMGMM. In addition, both the negative loglikelihood and the slicedWasserstein distance for our method are smaller than those of the EM algorithm, indicating that our solution is closer to the global optimum (up to permutations of the modes).
4.2 Highdimensional datasets
We analyzed the performance of our proposed method in modeling highdimensional data distributions, here, using the MNIST dataset [31] and the CelebA dataset [35]
. To capture the nonlinearity of the image data and boost the applicability of GMMs, we trained an adversarial deep convolutional autoencoder (Figure
4, Left) on the image data. Next, we modeled the distribution of the data in the embedded space via a GMM. The GMM was then used to generate samples in the embedding space, which were consequently decoded to generate synthetic (i.e. ’fake’) images. In learning the GMM, we compared the EM algorithm with our proposed method, SWGMM. We note that the entire pipeline is in an unsupervised learning setting. Figure
4demonstrates the steps of our experiment (Left) and provides a qualitative measure of the generated samples (Right) for the MNIST dataset. It can be seen that the SWGMM model leads to more visually appealing samples compared to the EMGMM. In addition, we trained a CNN classifier on the MNIST training data. We then generated 10,000 samples from each GMM component and classified these samples to measure the fidelity/pureness of each component. Ideally, each component should only be assigned to a single digit. We found out that for EMGMM the components were in average
pure, compared to pureness of SWGMM components.Similarly, a deep convolutional autoencoder was learned for the CelebA face dataset, and a GMM was trained in the embedding space. Figure 5 shows samples generated from GMM components learned by EM and by our proposed method (The samples generated from all components is attached in the Supplementary materials). We note that, Figures 4 and 5 only provide qualitative measures of how well the GMM is fitting the dataset. Next we provide quantitative measures for the fitness of the GMMs for both methods.
We used adversarial training of neural networks
[18, 33]to provide a goodness of fitness of the GMM to the data distribution. In short, we use success in fooling an adversary network as an evaluation metric for goodness of fit of a GMM. A deep discriminator/classifier was trained to distinguish whether a data point was sampled from the actual data distribution or from the GMM. The fooling rate (i.e. error rate) of such a discriminator is a good measure of fitness for the GMM, as a higher error rate translates to a better fit to the distribution of the data. Figure
6 shows the idea behind this experiment, and reports the fooling rates for all three datasets used in our experiments. Note that the SWGMM consistently provides a higher fooling rate, indicating a better fit to the datasets. Moreover, we point out that while for the lowdimensional ringsquareline dataset both methods provide reasonable models for the dataset, the SWGMM significantly outperforms EMGMM for higherdimensional datasets (i.e. the MNIST and CelebA datasets).The details of the architectures used in our experiments are included in the supplementary material.
5 Discussion
In this paper, we proposed a novel algorithm for estimating the parameters of a GMM via minimization of the sliced pWasserstein distance. In each iteration, our method projects the highdimensional data distribution into a small set of onedimensional distributions utilizing random projections/slices of the Radon transform and estimates the GMM parameters from these onedimensional projections. While we did not provide a theoretical guarantee that the new method is convex, or that it has fewer local minima, the empirical results suggest that our method is more robust compared to KLdivergencebased methods, including the EM algorithm, for maximizing the loglikelihood function. Consistent with this finding, we showed that the pWasserstein metrics result in more wellbehaved energy landscapes. We demonstrated the robustness of our method on three datasets: a twodimensional ringsquareline distribution and the highdimensional MNIST and CelebA face datasets. Finally, while we used deep convolutional encoders to provide embeddings for two of the datasets and learned GMMs in these embeddings, we emphasize that our method could be applied to other embeddings including the original data space.
6 Acknowledgement
This work was partially supported by NSF (CCF 1421502). The authors gratefully appreciate countless fruitful conversations with Drs. Charles H. Martin and Dejan Slepćev.
References
 [1] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
 [2] C. Améndola, M. Drton, and B. Sturmfels. Maximum likelihood estimates for gaussian mixtures are transcendental. In International Conference on Mathematical Aspects of Computer and Information Sciences, pages 579–590. Springer, 2015.
 [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
 [4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.

[5]
C. Beecks, A. M. Ivanescu, S. Kirchhoff, and T. Seidl.
Modeling image similarity by gaussian mixture models and the signature quadratic form distance.
In Computer Vision (ICCV), 2011 IEEE International Conference On, pages 1754–1761. IEEE, 2011.  [6] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.

[7]
Y. Brenier.
Polar factorization and monotone rearrangement of vectorvalued functions.
Communications on pure and applied mathematics, 44(4):375–417, 1991.  [8] W. M. Campbell, D. E. Sturim, and D. A. Reynolds. Support vector machines using gmm supervectors for speaker verification. IEEE signal processing letters, 13(5):308–311, 2006.
 [9] T. Celik and T. Tjahjadi. Automatic image equalization and contrast enhancement using gaussian mixture modeling. IEEE Transactions on Image Processing, 21(1):145–156, 2012.
 [10] K. Chaudhuri and A. McGregor. Finding metric structure in information theoretic clustering. In COLT, volume 8, page 10, 2008.
 [11] Y. Chen, T. T. Georgiou, and A. Tannenbaum. Optimal transport for gaussian mixture models. arXiv preprint arXiv:1710.07876, 2017.
 [12] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
 [13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
 [14] M. Cuturi and A. Doucet. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pages 685–693, 2014.
 [15] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 [16] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
 [17] J. Goldberger, S. Gordon, and H. Greenspan. An efficient image similarity measure based on approximations of kldivergence between two gaussian mixtures. In null, page 487. IEEE, 2003.
 [18] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [19] J. A. GuerreroColón, L. Mancera, and J. Portilla. Image restoration using spacevariant gaussian scale mixtures in overcomplete pyramids. IEEE Transactions on Image Processing, 17(1):27–41, 2008.
 [20] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
 [21] N. Ho, X. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung. Multilevel clustering via wasserstein means. arXiv preprint arXiv:1706.03883, 2017.
 [22] H. Hoffmann. Unsupervised Learning of Visuomotor Associations, volume 11 of MPI Series in Biological Cybernetics. Logos Verlag Berlin, 2005.
 [23] H. Hoffmann, W. Schenck, and R. Möller. Learning visuomotor transformations for gazecontrol and grasping. Biological Cybernetics, 93:119–130, 2005.
 [24] B. Jian and B. C. Vemuri. Robust point set registration using gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1633–1645, 2011.
 [25] C. Jin, Y. Zhang, S. Balakrishnan, M. J. Wainwright, and M. I. Jordan. Local maxima in the likelihood of gaussian mixture models: Structural results and algorithmic consequences. In Advances in Neural Information Processing Systems, pages 4116–4124, 2016.
 [26] A. T. Kalai, A. Moitra, and G. Valiant. Disentangling gaussians. Communications of the ACM, 55(2):113–120, 2012.
 [27] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [28] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing and machinelearning applications. IEEE Signal Processing Magazine, 34(4):43–59, 2017.

[29]
S. Kolouri and G. K. Rohde.
Transportbased single frame super resolution of very low resolution
face images.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4876–4884, 2015.  [30] S. Kolouri, Y. Zou, and G. K. Rohde. SlicedWasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4876–4884, 2016.

[31]
Y. LeCun.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/.  [32] B. Lévy. A numerical algorithm for semidiscrete optimal transport in 3D. ESAIM Math. Model. Numer. Anal., 49(6):1693–1715, 2015.
 [33] J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
 [34] P. Li, Q. Wang, and L. Zhang. A novel earth mover’s distance methodology for image matching with gaussian mixture models. In Proceedings of the IEEE International Conference on Computer Vision, pages 1689–1696, 2013.
 [35] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 [36] J. Lücke and D. Forster. kmeans is a variational em approximation of gaussian mixture models. arXiv preprint arXiv:1704.04812, 2017.
 [37] G. J. McLachlan and S. Rathnayake. On the number of components in a gaussian mixture model. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5):341–355, 2014.
 [38] R. Möller and H. Hoffmann. An extension of neural gas to local PCA. Neurocomputing, 62:305–326, 2004.

[39]
G. Montavon, K.R. Müller, and M. Cuturi.
Wasserstein training of restricted boltzmann machines.
In Advances in Neural Information Processing Systems, pages 3718–3726, 2016.  [40] K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
 [41] F. Natterer. The mathematics of computerized tomography, volume 32. Siam, 1986.
 [42] S. R. Park, S. Kolouri, S. Kundu, and G. K. Rohde. The cumulative distribution transform and linear pattern classification. Applied and Computational Harmonic Analysis, 2017.
 [43] H. Permuter, J. Francos, and I. Jermyn. A study of gaussian mixture models of color and texture features for image classification and segmentation. Pattern Recognition, 39(4):695–706, 2006.
 [44] G. Peyré, J. Fadili, and J. Rabin. Wasserstein active contours. In Image Processing (ICIP), 2012 19th IEEE International Conference on, pages 2541–2544. IEEE, 2012.
 [45] Y. Qian, E. Vazquez, and B. Sengupta. Deep geometric retrieval. arXiv preprint arXiv:1702.06383, 2017.
 [46] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2012.
 [47] A. Rolet, M. Cuturi, and G. Peyré. Fast dictionary learning with a smoothed wasserstein loss. In Artificial Intelligence and Statistics, pages 630–638, 2016.
 [48] J. Solomon, F. de Goes, P. A. Studios, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (Proc. SIGGRAPH 2015), to appear, 2015.
 [49] M. Thorpe, S. Park, S. Kolouri, G. K. Rohde, and D. Slepčev. A transportation l^ p distance for signal analysis. Journal of Mathematical Imaging and Vision, 59(2):187–210, 2017.
 [50] T. Tieleman and G. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [51] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11:443–482, 1999.
 [52] S. S. Vempala. The random projection method, volume 65. American Mathematical Soc., 2005.
 [53] C. Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [54] S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha. Wasserstein learning of deep generative point process models. arXiv preprint arXiv:1705.08051, 2017.
 [55] G. Yu, G. Sapiro, and S. Mallat. Solving inverse problems with piecewise linear estimators: From gaussian mixture models to structured sparsity. IEEE Transactions on Image Processing, 21(5):2481–2499, 2012.
7 Supplementary material
7.1 Maximum loglikelihood and KLdivergence
The KLdivergence between and is defined as:
For the maximum loglikelihood and in the limit and as the number of samples grows to infinity, , we have:
7.2 RMSProp update equations
SGD often suffers from oscillatory behavior across the slopes of a ravine while only making incremental progress towards the optimal point. Various momentum based methods have been introduced to adaptively change the learning rate of SGD and dampen this oscillatory behavior. In our work, we used RMSProp, introduced by Tieleman and Hinton [50], which is a momentum based technique for SGD. Let be the learning rate, be the decay parameter, and be the momentum parameter. The update equation for a GMM parameter, here , is then calculated from:
(15) 
Where and are the first and second moments (uncentered variance) of , respectively. Similar update equations hold for and .
7.3 Experimental Details
Here we provide the detailed implementation and architecture of the adversarial autoencoders we used in our paper. Figure 8 shows the detailed architectures of the deep adversarial autoencoder for MNIST and CelebA datasets. The architecture of the deep binary classifiers used for scoring the fitness of the GMMs are shown in Figure 7. We used Keras [12] for implementation of our experiments.
For the loss function of the autoencoder we used the
mean absolute error between the input image and the decoded image together with the adversarial loss of the decoded image (equally weighted). The loss functions for training the adversarial networks and the binary classifiers were chosen to be cross entropy. Finally, we used RMSProp [50] as the default optimizer for all the models, and trained the models over Epochs, with batch size of .7.4 CelebA Generated Images
Figure 9 shows all the GMM components learned by EM and our SWM formulation.