Using set encoders to encode Wasserstein Spaces
Understanding the space of probability measures on a metric space equipped with a Wasserstein distance is one of the fundamental questions in mathematical analysis. The Wasserstein metric has received a lot of attention in the machine learning community especially for its principled way of comparing distributions. In this work, we use a permutation invariant network to map samples from probability measures into a low-dimensional space such that the Euclidean distance between the encoded samples reflects the Wasserstein distance between probability measures. We show that our network can generalize to correctly compute distances between unseen densities. We also show that these networks can learn the first and the second moments of probability distributions.READ FULL TEXT VIEW PDF
Using set encoders to encode Wasserstein Spaces
The Wasserstein distance is a distance function between probability measures on a metric space . It is a natural way to compare the probability distributions of two variables and , where one variable is derived from the other by small, non-uniform perturbations, while strongly reflecting the metric of the underlying space . It can also be used to compare discrete distributions. The Wasserstein distance enjoys a number of useful properties, which likely contribute to its wide-spread interest amongst mathematicians and computer scientists Bobkov and Ledoux (2019); Ambrosio et al. (2005); Bigot et al. (2013); Canas and Rosasco (2012); del Barrio et al. (1999); Givens and Shortt (1984); Villani (2003, 2008); Arjovsky et al. (2017). However, despite it’s broad use, the Wasserstein distance has several problems. For one, it is computationally expensive to calculate. Second, the Wasserstein distance is not Hadamard differentiable, which can present serious challenges when trying to use it in machine learning. Third, the distance is not robust. To alleviate these problems, one can use various regularized entropies to compute an approximation of this Wasserstein distance. Such an approach is more tractable and also enjoys several nice properties Altschuler et al. (2018); Cuturi (2013); Peyré and Cuturi (2020).
In this short article, we are interested in learning about the Wasserstein space of order , i.e. an infinite dimensional space of all probability measures with up to -th order finite moments on a complete and separable metric space . More specifically, we asked, (1
) can we propose a neural network that correctly computes the Wasserstein distance betweenmeasures, even if both of them are not in our training examples? (2) What properties of the measures does such a network learn? For example, does it learn something about the moments of these distributions? (3) What properties of the original Wasserstein space can we preserve in our encoded space?
There has been a lot of work in understanding the space of Gaussian processes Mallasto and Feragen (2017); Takatsu (2011) but our work is more similar to, which attempts to understand Wasserstein spaces with neural networks Courty et al. (2017); Frogner et al. (2019). However unlike Courty et al. (2017), we use a permutation invariant network to compare and contrast various densities. Furthermore, unlike Frogner et al. (2019), we try to approximate the Wasserstein space by learning a mapping from the space to a low dimensional Euclidean space.
Let be a complete and separable metric space. For simplicity, we take to be or a closed and bounded subset of . Let be the space of all probability measures on . One can endow the space with a family of metrics called the Wasserstein metrics .
We use the notations and interchangeably whenever and . We also assume that (and ) is finite. Most of the following properties regarding the space and are well-known but we summarize them for the convenience of the reader Santambrogio (2015); Panaretos and Zemel (2019).
(i) equipped with is a complete and separable metric space.
(ii) If and are degenerate at , then .
(iii) (Scaling law) For any , .
(iv) (Translation invariance) For any ,
(v) is flat metric space under but the sectional curvature is non-negative under .
See Section 2 in Panaretos and Zemel (2019). ∎
(Topology generated by ) (i) If is compact and , in the space , we have iff .
(ii) If , then iff and
See proofs associated with Theorem and Theorem in Santambrogio (2015). ∎
The measures and are rarely known in practice. Instead, one has access to finite samples and . We then construct discrete measures and where
are vectors in the probability simplex, and the pairwise costs can be compactly represented as anmatrix , i. e., where is the metric of the underlying space . Since the marginals here are fixed to be the laws of and , the problem is to find a copula Sklar (1959) that couples X and Y together as “tightly” as possible in an -sense, on average; if then the copula one seeks is the one that maximizes the correlation (or covariance) between and , i.e., the copula inducing maximal linear dependence. Solving the above problem scales cubically on the sample sizes and is extremely difficult in practice. Adding an entropy regularization, leads to a problem that can be solved much more efficiently Altschuler et al. (2018); Cuturi (2013); Peyré and Cuturi (2020). In this article, we use the Sinkhorn distance and their computation, as in Cuturi (2013). For more details about the entropic regularization, please see Appendix C. The Sinkhorn distance however is not a true metric Cuturi (2013) and fails to satisfy . Moreover, the Sinkhorn distance requires discretizing the space, which alters the metric. The goal of this paper is to see how well a neural network, trained using the Sinkhorn distance, can capture the topology of the under .
We draw random samples with replacement of size from various distributions in . For technical reasons (described in Appendix D), we only use continuous distributions during the training process. We use the DeepSets architecture Zaheer et al. (2017) to encode this set of elements as we want an encoding that is invariant of the permutations of the samples. If and , ( is allowed, but and are drawn independently) and we denote the set of samples drawn from as (similarly ), we train the encoder such that,
Thus, the loss function becomes,
where is the size of the mini-batch and we pick sets at random from the mini-batch to compare distances. Our work can be thought as next-generation functional data analysis Wang et al. (2015) (Section 6). More details about the network architecture can be found in Appendix B.
If , then is a set of samples after translation (this similarly applies for and ). To ensure the properties of are reflected in our computed Euclidean distance, we demand that,
These constraints comprise the loss function
In this section we will describe our toy examples and show the discriminative behavior of our Neural Networks and the interesting properties of the space it can uncover. Our datasets are the following: (1) Random samples of size drawn independently about times from uniform, Normal, Beta, Gamma, Exponential, Laplace, Log Normal with varying parameters. (2) Random samples of size drawn independently about times from Normal distributions with various .
Fig 1B shows the embedding our datasets by our model. In Fig 2, we show how well the neural network approximates the Sinkhorn distances from samples drawn from our test densities. All the results shown here are with the metric. Other plots showing how well our network respects the scaling law (textbfiii) in Theorem 2.1. Our results with the metric are shown in the Appendix A and some our results with the metric are slightly weaker than the ones with the metric. This may be due to the following reasons: under is no longer flat and the Sinkhorn distance changes the metric differently than it changes the space under ; secondly, since our target is an Euclidean space which is a flat space, we are losing more structural information when mapping from the -Wasserstein space.
Generalizing to out-of-sample-densities : We also show that our model can generalize well to densities that are out of our training set. These densities are primarily constructed from the training densities but by changing the parameters (Fig 2B,C). But even more interestingly, we found that our model can correctly measure the distance between Dirac measures and distance between Binomial densities, even though they are not a part of the training densities.
Translating samples : Given samples , we can translate them around by a random vector , to create new samples , under property in Theorem , would form a parallelogram. Fig 2 D shows the exact relationship between the distances of encoded translated samples and the encoded samples. Furthermore, we took samples from a Normal Distribution and rotated it around by using a circle, i.e. created new samples via and we found that the encoded translated samples also formed a circular pattern around the original encoded sample. Thus our simple examples show that our metric preserves the translation invariance property and some geometry of the space (Figure 2E).
Learning statistical properties of the measures : Surprisingly for encoded
-distributions, we found the strong correlation between means (and variances) of the distributions and the-coordinate (and y-coordinate) of the encoded point( Fig 3A,B). That explains why the encoded Dirac distribution at and Normal distribution with mean has -coordinates close to each other. An open question and an interesting future work will to be understand if we can capture higher moments as we increase the output dimension.
We also note that none of the measures used above are in the set of our training measures. And finally observe that the figure also shows the correlation between -coordinates and means of the chosen measures.
In this work we showed that we learned a metric by approximating the Wasserstein distance by Sinkhorn distance that obeys the translation invariance and also generalizes to some unseen measures. For measures, we found strong correlation between the encoded vectors coordinates (resp. y-coordinates) with means (variance) of the samples. We are excited by these toy results and would like to prove continuity properties of our neural network and investigate if it can learn higher moments as we increase the output dimension and finally to quantify the distortion of the original Wasserstein metric by our embedding.
The first author wants to thank Alex Cloninger for helpful suggestions and for suggesting to study the geometry of the Wasserstein space by simple translations and scalings.
In this section we show some additional figures: The correlation plots colored by the densities (Fig 2, in the main text). We also show our results for the same experiments discussed earlier with the metric (Fig 5. We found a strong correlation between the variances of the densities and the -coordinates of the encoded points. However unlike in the case, we found no such relations between the means and the -coordinates. Finally we show in Figure 6 how our network has learnt to respect the scaling law ((iii) in Theorem 2.1).
We use the DeepSets architecture  which basically consists of two blocks of linear layers (with non-linearities between them) and . consists of linear layers with hidden sizes with non-linearity Elu . We sum over the outputs of as in  to ensure permutation invariance before passing it to the next network . consists of linear layers of hidden sizes with an output layer of dimension . Elu activation is added to each hidden layer. We trained the model for epochs with Adam optimizer and learning rate
where is a cost function and the set of couplings consists of joint probability distributions over the product space with marginals and ,
where are the projection maps from to th factor of and is the pushforward of the measure onto . The cost function generally reflects the metric of the space and in our case is just for some However as noted in the main text, solving the above problem scales cubically on the sample sizes and is extremely difficult in practice. Adding an entropy regularization leads to a problem that can be solved much more efficiently [13, 2, 29]. For the convenience of the reader, let us recall the entropy regularization as in . We first construct discrete measures and where are vectors in the probability simplex, and let is the cost matrix given , then the optimization problem can be succinctly written as
The entropy regularized version of this problem reads:
Due to the strong convexity introduced by the regularizer, the above problem now has a unique solution and can be efficiently solved by the Sinkhorn algorithm. In our work
Note that is not a true metric as it do not satisfy , for all sets of samples . However it is symmetric and satisfies the triangle inequalities. We circumvent this issue by only using continuous measures during training time. This ensures that any of two sets of samples drawn a given measure are distinct with probability . Thus, during training we never encounter the set twice, so a case where never arises. Thus we end up learning a metric space where the distances between different samples are approximately equal to the Sinkhorn distance.
Given measures , we define the Wasserstein barycenter as the minimizer of the functional
where are some fixed weights and . For simplicity, we will take the weights to be . We use the algorithm in , as well as the Geomloss library to compute the barycenters.
Our experiments with the barycenters suggest a natural way to embed measures in our -dimensional encoded space. Take random samples of size and repeat this process times. Our encoder will take in these samples and produce points. We can take the centroid of these points and use it to get a representation of our measure.
The size of the sample plays a crucial role here. What is the right size of samples to pick? If the size of the samples are large then our method works well. But picking a large sample size is computationally very expensive. We found a sample size of yields good results while a sample size of below yields inconsistent results (variance is high).