Permutation invariant networks to learn Wasserstein metrics

by   Arijit Sehanobish, et al.
Yale University

Understanding the space of probability measures on a metric space equipped with a Wasserstein distance is one of the fundamental questions in mathematical analysis. The Wasserstein metric has received a lot of attention in the machine learning community especially for its principled way of comparing distributions. In this work, we use a permutation invariant network to map samples from probability measures into a low-dimensional space such that the Euclidean distance between the encoded samples reflects the Wasserstein distance between probability measures. We show that our network can generalize to correctly compute distances between unseen densities. We also show that these networks can learn the first and the second moments of probability distributions.



page 1

page 2

page 3

page 4


Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

Embedding complex objects as vectors in low dimensional spaces is a long...

Distances between probability distributions of different dimensions

Comparing probability distributions is an indispensable and ubiquitous t...

Learning Wasserstein Embeddings

The Wasserstein distance received a lot of attention recently in the com...

Probing the Geometry of Data with Diffusion Fréchet Functions

Many complex ecosystems, such as those formed by multiple microbial taxa...

Estimation of smooth densities in Wasserstein distance

The Wasserstein distances are a set of metrics on probability distributi...

The Cramer Distance as a Solution to Biased Wasserstein Gradients

The Wasserstein probability metric has received much attention from the ...

Code Repositories


Using set encoders to encode Wasserstein Spaces

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Wasserstein distance is a distance function between probability measures on a metric space . It is a natural way to compare the probability distributions of two variables and , where one variable is derived from the other by small, non-uniform perturbations, while strongly reflecting the metric of the underlying space . It can also be used to compare discrete distributions. The Wasserstein distance enjoys a number of useful properties, which likely contribute to its wide-spread interest amongst mathematicians and computer scientists Bobkov and Ledoux (2019); Ambrosio et al. (2005); Bigot et al. (2013); Canas and Rosasco (2012); del Barrio et al. (1999); Givens and Shortt (1984); Villani (2003, 2008); Arjovsky et al. (2017). However, despite it’s broad use, the Wasserstein distance has several problems. For one, it is computationally expensive to calculate. Second, the Wasserstein distance is not Hadamard differentiable, which can present serious challenges when trying to use it in machine learning. Third, the distance is not robust. To alleviate these problems, one can use various regularized entropies to compute an approximation of this Wasserstein distance. Such an approach is more tractable and also enjoys several nice properties Altschuler et al. (2018); Cuturi (2013); Peyré and Cuturi (2020).

In this short article, we are interested in learning about the Wasserstein space of order , i.e. an infinite dimensional space of all probability measures with up to -th order finite moments on a complete and separable metric space . More specifically, we asked, (1

) can we propose a neural network that correctly computes the Wasserstein distance between

measures, even if both of them are not in our training examples? (2) What properties of the measures does such a network learn? For example, does it learn something about the moments of these distributions? (3) What properties of the original Wasserstein space can we preserve in our encoded space?

There has been a lot of work in understanding the space of Gaussian processes Mallasto and Feragen (2017); Takatsu (2011) but our work is more similar to, which attempts to understand Wasserstein spaces with neural networks Courty et al. (2017); Frogner et al. (2019). However unlike Courty et al. (2017), we use a permutation invariant network to compare and contrast various densities. Furthermore, unlike Frogner et al. (2019), we try to approximate the Wasserstein space by learning a mapping from the space to a low dimensional Euclidean space.

2 Theory

Let be a complete and separable metric space. For simplicity, we take to be or a closed and bounded subset of . Let be the space of all probability measures on . One can endow the space with a family of metrics called the Wasserstein metrics .


We use the notations and interchangeably whenever and . We also assume that (and ) is finite. Most of the following properties regarding the space and are well-known but we summarize them for the convenience of the reader Santambrogio (2015); Panaretos and Zemel (2019).

Theorem 2.1.

(i) equipped with is a complete and separable metric space.
(ii) If and are degenerate at , then .
(iii) (Scaling law) For any , .
(iv) (Translation invariance) For any ,
(v) is flat metric space under but the sectional curvature is non-negative under .


See Section 2 in Panaretos and Zemel (2019). ∎

Theorem 2.2.

(Topology generated by ) (i) If is compact and , in the space , we have iff .
(ii) If , then iff and


See proofs associated with Theorem and Theorem in Santambrogio (2015). ∎

The measures and are rarely known in practice. Instead, one has access to finite samples and . We then construct discrete measures and where

are vectors in the probability simplex, and the pairwise costs can be compactly represented as an

matrix , i. e., where is the metric of the underlying space . Since the marginals here are fixed to be the laws of and , the problem is to find a copula Sklar (1959) that couples X and Y together as “tightly” as possible in an -sense, on average; if then the copula one seeks is the one that maximizes the correlation (or covariance) between and , i.e., the copula inducing maximal linear dependence. Solving the above problem scales cubically on the sample sizes and is extremely difficult in practice. Adding an entropy regularization, leads to a problem that can be solved much more efficiently Altschuler et al. (2018); Cuturi (2013); Peyré and Cuturi (2020). In this article, we use the Sinkhorn distance and their computation, as in Cuturi (2013). For more details about the entropic regularization, please see Appendix C. The Sinkhorn distance however is not a true metric Cuturi (2013) and fails to satisfy . Moreover, the Sinkhorn distance requires discretizing the space, which alters the metric. The goal of this paper is to see how well a neural network, trained using the Sinkhorn distance, can capture the topology of the under .

Figure 1: (A) Our distributional encoder. (B) Low-dimensional embedding of encoded distributions.

3 Neural Networks to understand

We draw random samples with replacement of size from various distributions in . For technical reasons (described in Appendix D), we only use continuous distributions during the training process. We use the DeepSets architecture Zaheer et al. (2017) to encode this set of elements as we want an encoding that is invariant of the permutations of the samples. If and , ( is allowed, but and are drawn independently) and we denote the set of samples drawn from as (similarly ), we train the encoder such that,


Thus, the loss function becomes,


where is the size of the mini-batch and we pick sets at random from the mini-batch to compare distances. Our work can be thought as next-generation functional data analysis Wang et al. (2015) (Section 6). More details about the network architecture can be found in Appendix B.

3.1 Regularizers for ensuring better properties

If , then is a set of samples after translation (this similarly applies for and ). To ensure the properties of are reflected in our computed Euclidean distance, we demand that,

  1. .

These constraints comprise the loss function

4 Experiments

In this section we will describe our toy examples and show the discriminative behavior of our Neural Networks and the interesting properties of the space it can uncover. Our datasets are the following: (1) Random samples of size drawn independently about times from uniform, Normal, Beta, Gamma, Exponential, Laplace, Log Normal with varying parameters. (2) Random samples of size drawn independently about times from Normal distributions with various . Fig 1B shows the embedding our datasets by our model. In Fig 2, we show how well the neural network approximates the Sinkhorn distances from samples drawn from our test densities. All the results shown here are with the metric. Other plots showing how well our network respects the scaling law (textbfiii) in Theorem 2.1. Our results with the metric are shown in the Appendix A and some our results with the metric are slightly weaker than the ones with the metric. This may be due to the following reasons: under is no longer flat and the Sinkhorn distance changes the metric differently than it changes the space under ; secondly, since our target is an Euclidean space which is a flat space, we are losing more structural information when mapping from the -Wasserstein space.

Figure 2: (AC) Pearson’s r correlation coefficient for association between embedded and Sinkhorn distances (color code in Appendix A). (D) Correlation after translations. (E) Samples from a multivariate normal distribution translated around a circular path.

Generalizing to out-of-sample-densities : We also show that our model can generalize well to densities that are out of our training set. These densities are primarily constructed from the training densities but by changing the parameters (Fig 2B,C). But even more interestingly, we found that our model can correctly measure the distance between Dirac measures and distance between Binomial densities, even though they are not a part of the training densities.
Translating samples : Given samples , we can translate them around by a random vector , to create new samples , under property in Theorem , would form a parallelogram. Fig 2 D shows the exact relationship between the distances of encoded translated samples and the encoded samples. Furthermore, we took samples from a Normal Distribution and rotated it around by using a circle, i.e. created new samples via and we found that the encoded translated samples also formed a circular pattern around the original encoded sample. Thus our simple examples show that our metric preserves the translation invariance property and some geometry of the space (Figure 2E).
Learning statistical properties of the measures : Surprisingly for encoded

-distributions, we found the strong correlation between means (and variances) of the distributions and the

-coordinate (and y-coordinate) of the encoded point( Fig 3A,B). That explains why the encoded Dirac distribution at and Normal distribution with mean

and standard deviation

has -coordinates close to each other. An open question and an interesting future work will to be understand if we can capture higher moments as we increase the output dimension.
Respecting the topology of the space : We know that the Dirac delta measure is the limit of Gaussian measures under the weak convergence of measures. Choosing samples drawn from we can see that our encoded points converge to the point encoded by the Dirac measure (Fig 3C). This gives us an empirical evidence that our neural network may be continuous with respect to the Wasserstein metric.
Wasserstein barycenters : Given two densities , if is their Wasserstein barycenter Anderes et al. (2015); Agueh and Carlier (2011); Zemel and Panaretos (2017); Karcher (1977, 2014), our aim is to show that can be approximated by the midpoint of the line joining and . Fig 3D shows the following examples of this claim: 1) Samples drawn from and . 2) Dirac at and . 3)Uniform distribution in and in .

Figure 3: Person’s r comparing embedding axes to means (A) and standard deviations (B). (C) Convergence of samples from Normal distributions with various standard deviations to the Dirac distribution encoding. (D) Barycenters of distributions (left) and midpoints drawn between lines connecting the encoded samples (right).

We also note that none of the measures used above are in the set of our training measures. And finally observe that the figure also shows the correlation between -coordinates and means of the chosen measures.

5 Conclusion and Future Work

In this work we showed that we learned a metric by approximating the Wasserstein distance by Sinkhorn distance that obeys the translation invariance and also generalizes to some unseen measures. For measures, we found strong correlation between the encoded vectors coordinates (resp. y-coordinates) with means (variance) of the samples. We are excited by these toy results and would like to prove continuity properties of our neural network and investigate if it can learn higher moments as we increase the output dimension and finally to quantify the distortion of the original Wasserstein metric by our embedding.


The first author wants to thank Alex Cloninger for helpful suggestions and for suggesting to study the geometry of the Wasserstein space by simple translations and scalings.


  • [1] M. Agueh and G. Carlier (2011) Barycenters in the Wasserstein Space. SIAM Journal on Mathematical Analysis 43 (2), pp. 904–924. External Links: Document, Link, Cited by: §4.
  • [2] J. Altschuler, J. Weed, and P. Rigollet (2018) Near-linear time approximation algorithms for Optimal Transport via Sinkhorn iteration. External Links: 1705.09634 Cited by: Appendix C, §1, §2.
  • [3] L. Ambrosio, N. Gigli, and G. Savare (2005-01) Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhauser. External Links: Document Cited by: §1.
  • [4] E. Anderes, S. Borgwardt, and J. Miller (2015) Discrete Wasserstein Barycenters: Optimal Transport for Discrete Data. External Links: 1507.07218 Cited by: §4.
  • [5] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein Gan. External Links: 1701.07875 Cited by: §1.
  • [6] J. Bigot, R. Gouet, T. Klein, and A. López (2013) Geodesic PCA in the Wasserstein space. External Links: 1307.7721 Cited by: §1.
  • [7] S. Bobkov and M. Ledoux (2019) One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Memoirs of the American Mathematical Society 261, pp. 0–0. Cited by: §1.
  • [8] N. Bonneel, G. Peyré, and M. Cuturi (2016-04) Wasserstein Barycentric Coordinates: Histogram Regression Using Optimal Transport. ACM Transactions on Graphics 35 (4), pp. 71:1–71:10. External Links: Link, Document
  • [9] G. D. Canas and L. Rosasco (2012) Learning Probability Measures with respect to Optimal Transport Metrics. External Links: 1209.1077 Cited by: §1.
  • [10] D. Clevert, T. Unterthiner, and S. Hochreiter (2016) Fast and Accurate Deep Network Learning by Exponential Linear Units (elus). External Links: 1511.07289 Cited by: Appendix B.
  • [11] N. Courty, R. Flamary, and M. Ducoffe (2017) Learning Wasserstein Embeddings. External Links: 1710.07457 Cited by: §1.
  • [12] M. Cuturi and A. Doucet (2014) Fast computation of Wasserstein Barycenters. External Links: 1310.4375 Cited by: Appendix E.
  • [13] M. Cuturi (2013) Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. External Links: 1306.0895 Cited by: Appendix C, Appendix D, §1, §2.
  • [14] F. de Goes, K. Breeden, V. Ostromoukhov, and M. Desbrun (2012-11) Blue Noise through Optimal Transport. ACM Trans. Graph. 31 (6). External Links: ISSN 0730-0301, Link, Document
  • [15] E. del Barrio, E. Giné, and C. Matrán (1999-04) Central limit theorems for the wasserstein distance between the empirical and the true distributions. Ann. Probab. 27 (2), pp. 1009–1071. External Links: Document, Link Cited by: §1.
  • [16] J. Feydy, T. Séjourné, F. Vialard, S. Amari, A. Trouvé, and G. Peyré (2018) Interpolating between Optimal Transport and mmd using Sinkhorn Divergences. External Links: 1810.08278
  • [17] C. Frogner, F. Mirzazadeh, and J. Solomon (2019) Learning Embeddings into Entropic Wasserstein Spaces. External Links: 1905.03329 Cited by: §1.
  • [18] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. Poggio (2015) Learning with a Wasserstein Loss. External Links: 1506.05439
  • [19] A. Genevay Entropy-regularized optimal transport for machine learning. Note: PhD Thesis
  • [20] C. R. Givens and R. M. Shortt (1984) A class of wasserstein metrics for probability distributions.. Michigan Math. J. 31 (2), pp. 231–240. External Links: Document, Link Cited by: §1.
  • [21] L. Kantorovich (2006-03) On the Translocation of Masses. Journal of Mathematical Sciences 133, pp. . External Links: Document Cited by: Appendix C.
  • [22] H. Karcher (1977) Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics 30 (5), pp. 509–541. External Links: Document, Link, Cited by: §4.
  • [23] H. Karcher (2014) Riemannian Center of Mass and so called karcher mean. External Links: 1407.2087 Cited by: §4.
  • [24] B. R. Kloeckner (2014-05) A geometric study of wasserstein spaces: ultrametrics. Mathematika 61 (1), pp. 162–178. External Links: ISSN 2041-7942, Link, Document
  • [25] A. Mallasto and A. Feragen (2017) Learning from Uncertain Curves: The 2-Wasserstein Metric for Gaussian Processes. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, pp. 5665–5674. External Links: ISBN 9781510860964 Cited by: §1.
  • [26] S. Neumayer and G. Steidl (2020) From Optimal Transport to Discrepancy. External Links: 2002.01189
  • [27] V. M. Panaretos and Y. Zemel (2019) Statistical Aspects of Wasserstein Distances. Annual Review of Statistics and Its Application 6 (1), pp. 405–431. External Links: Document, Link, Cited by: §2, §2.
  • [28] V. Panaretos and Y. Zemel (2020-01) An Invitation to Statistics in Wasserstein Space. Springer Briefs in Probability and Mathematical Statistics. External Links: ISBN 978-3-030-38437-1, Document
  • [29] G. Peyré and M. Cuturi (2020) Computational Optimal Transport. External Links: 1803.00567 Cited by: Appendix C, §1, §2.
  • [30] F. Santambrogio (2015) Wasserstein distances and curves in the wasserstein spaces. In Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, pp. 177–218. External Links: ISBN 978-3-319-20828-2, Document, Link Cited by: §2, §2.
  • [31] S. Shirdhonkar and D. W. Jacobs (2008) Approximate earth mover’s distance in linear time. In

    2008 IEEE Conference on Computer Vision and Pattern Recognition

    Vol. , pp. 1–8.
  • [32] M. Sklar (1959) Fonctions de repartition a n dimensions et leurs marges. Publications de l’Institut Statistique de l’Université de Paris, pp. 229–231. Cited by: §2.
  • [33] A. Takatsu (2011-12) Wasserstein geometry of gaussian measures. Osaka J. Math. 48 (4), pp. 1005–1026. External Links: Link Cited by: §1.
  • [34] C. Villani (2003) Topics in Optimal Transportation. Am. Math. Soc. Cited by: §1.
  • [35] C. Villani (2008) Optimal Transport: Old and New. Berlin: Springer. Cited by: §1.
  • [36] J. Wang, J. Chiou, and H. Mueller (2015) Review of Functional Data Analysis. External Links: 1507.05135 Cited by: §3.
  • [37] L. Wasserman (2016) Topological Data Analysis. External Links: 1609.08227
  • [38] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep Sets. External Links: 1703.06114 Cited by: Appendix B, §3.
  • [39] Y. Zemel and V. M. Panaretos (2017) Fréchet Means and Procrustes Analysis in Wasserstein Space. External Links: 1701.06876 Cited by: §4.

Appendix A Additional Figures

Figure 4: Pearson’s r correlation coefficient for association between embedded and Sinkhorn distances

In this section we show some additional figures: The correlation plots colored by the densities (Fig 2, in the main text). We also show our results for the same experiments discussed earlier with the metric (Fig 5. We found a strong correlation between the variances of the densities and the -coordinates of the encoded points. However unlike in the case, we found no such relations between the means and the -coordinates. Finally we show in Figure 6 how our network has learnt to respect the scaling law ((iii) in Theorem 2.1).

Figure 5: Experiments for network train to measure . (AD) Correlations with and embedded distances, for 1D distributions (A, B) under translations (C) and for 2D Normal distributions (D). (EF) Interpretation of embedding axes showing Pearon’s r correlation between standard deviation and y-axis (E) and convergence of samples from 1D Normal distributions with various standard deviations to encoded sample of the Dirac distribution (F). (G) Samples of 2D Normal distribution translated around a circle with black dot representing un-translated embedding. (H) Barycenters of distributions (left) and midpoints drawn between lines connecting the encoded samples (right) after training the network on distances.
Figure 6: Correlation after scaling; empirically validating property (iii) in Theorem 2.1, Left: Network trained with metric, Right: Network trained with metric

Appendix B Details about Network Architecture

We use the DeepSets architecture [38] which basically consists of two blocks of linear layers (with non-linearities between them) and . consists of linear layers with hidden sizes with non-linearity Elu [10]. We sum over the outputs of as in [38] to ensure permutation invariance before passing it to the next network . consists of linear layers of hidden sizes with an output layer of dimension . Elu activation is added to each hidden layer. We trained the model for epochs with Adam optimizer and learning rate

Appendix C Optimal Transport

The optimization problem defining the distance 1 is popularly known as optimal transport or the Monge–Kantorovich problem. The Kantorovich formulation [21] of the transportation problem is:


where is a cost function and the set of couplings consists of joint probability distributions over the product space with marginals and ,


where are the projection maps from to th factor of and is the pushforward of the measure onto . The cost function generally reflects the metric of the space and in our case is just for some However as noted in the main text, solving the above problem scales cubically on the sample sizes and is extremely difficult in practice. Adding an entropy regularization leads to a problem that can be solved much more efficiently [13, 2, 29]. For the convenience of the reader, let us recall the entropy regularization as in [13]. We first construct discrete measures and where are vectors in the probability simplex, and let is the cost matrix given , then the optimization problem can be succinctly written as


where .
The entropy regularized version of this problem reads:


Due to the strong convexity introduced by the regularizer, the above problem now has a unique solution and can be efficiently solved by the Sinkhorn algorithm. In our work

Appendix D Some technical considerations

Note that is not a true metric as it do not satisfy , for all sets of samples [13]. However it is symmetric and satisfies the triangle inequalities. We circumvent this issue by only using continuous measures during training time. This ensures that any of two sets of samples drawn a given measure are distinct with probability . Thus, during training we never encounter the set twice, so a case where never arises. Thus we end up learning a metric space where the distances between different samples are approximately equal to the Sinkhorn distance.

Appendix E Wasserstein Barycenters

Given measures , we define the Wasserstein barycenter as the minimizer of the functional


where are some fixed weights and . For simplicity, we will take the weights to be . We use the algorithm in [12], as well as the Geomloss library to compute the barycenters.
Our experiments with the barycenters suggest a natural way to embed measures in our -dimensional encoded space. Take random samples of size and repeat this process times. Our encoder will take in these samples and produce points. We can take the centroid of these points and use it to get a representation of our measure.

Appendix F Sampling sizes

The size of the sample plays a crucial role here. What is the right size of samples to pick? If the size of the samples are large then our method works well. But picking a large sample size is computationally very expensive. We found a sample size of yields good results while a sample size of below yields inconsistent results (variance is high).