# Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance

Minimum expected distance estimation (MEDE) algorithms have been widely used for probabilistic models with intractable likelihood functions and they have become increasingly popular due to their use in implicit generative modeling (e.g. Wasserstein generative adversarial networks, Wasserstein autoencoders). Emerging from computational optimal transport, the Sliced-Wasserstein (SW) distance has become a popular choice in MEDE thanks to its simplicity and computational benefits. While several studies have reported empirical success on generative modeling with SW, the theoretical properties of such estimators have not yet been established. In this study, we investigate the asymptotic properties of estimators that are obtained by minimizing SW. We first show that convergence in SW implies weak convergence of probability measures in general Wasserstein spaces. Then we show that estimators obtained by minimizing SW (and also an approximate version of SW) are asymptotically consistent. We finally prove a central limit theorem, which characterizes the asymptotic distribution of the estimators and establish a convergence rate of √(n), where n denotes the number of observed data points. We illustrate the validity of our theory on both synthetic data and neural networks.

## Authors

• 7 publications
• 35 publications
• 37 publications
• 5 publications
• ### When OT meets MoM: Robust estimation of Wasserstein Distance

Issued from Optimal Transport, the Wasserstein distance has gained impor...
06/18/2020 ∙ by Guillaume Staerman, et al. ∙ 0

• ### Geometrical Insights for Implicit Generative Modeling

Learning algorithms for implicit generative models can optimize a variet...
12/21/2017 ∙ by Leon Bottou, et al. ∙ 0

• ### Statistical analysis of Wasserstein GANs with applications to time series forecasting

We provide statistical theory for conditional and unconditional Wasserst...
11/05/2020 ∙ by Moritz Haas, et al. ∙ 0

• ### Confidence Regions in Wasserstein Distributionally Robust Estimation

Wasserstein distributionally robust optimization (DRO) estimators are ob...
06/04/2019 ∙ by Jose Blanchet, et al. ∙ 0

• ### Metropolis-Hastings via Classification

This paper develops a Bayesian computational platform at the interface b...
03/06/2021 ∙ by Tetsuya Kaji, et al. ∙ 0

• ### Approximate Bayesian computation with the Wasserstein distance

A growing number of generative statistical models do not permit the nume...
05/09/2019 ∙ by Espen Bernton, et al. ∙ 0

• ### Statistical Inference for Generative Models with Maximum Mean Discrepancy

While likelihood-based inference and its variants provide a statisticall...
06/13/2019 ∙ by Francois-Xavier Briol, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Minimum distance estimation (MDE) is a generalization of maximum-likelihood inference, where the goal is to minimize a distance between the empirical distribution of a set of independent and identically distributed (i.i.d.) observations and a family of distributions indexed by a parameter . The problem is formally defined as follows [1, 2]:

 ^θn=argminθ∈Θ D(^μn,μθ), (1)

where denotes a distance (or a divergence in general) between probability measures, denotes a probability measure indexed by , denotes the parameter space, and

 ^μn=(1/n)∑ni=1\updeltaYi (2)

denotes the empirical measure of , with being the Dirac distribution with mass on the point . When

is chosen as the Kullback-Leibler divergence, this formulation coincides with the maximum likelihood estimation (MLE)

[2].

While MDE provides a fruitful framework for statistical inference, when working with generative models, solving the optimization problem in (1

) might be intractable since it might be impossible to evaluate the probability density function associated with

.

Nevertheless, in various settings, even if the density is not available, one can still generate samples from the distribution , and such samples turn out to be useful for making inference. More precisely, under such settings, a natural alternative to (1) is the minimum expected distance estimator, which is defined as follows [3]:

 ^θn,m=argminθ∈Θ E[D(^μn,^μθ,m)∣∣Y1:n]. (3)

Here,

 ^μθ,m=(1/m)∑mi=1\updeltaZi (4)

denotes the empirical distribution of

, that is a sequence of i.i.d. random variables with distribution

.

This algorithmic framework has computationally favorable properties since one can replace the expectation with a simple Monte-Carlo average in practical applications.

In the context of MDE, distances that are based on optimal transport (OT) have become increasingly popular due to their computational and theoretical properties [4, 5, 6, 7, 8]. For instance, if we replace the distance in (3) with the Wasserstein distance (defined in Section 2 below), we obtain the minimum expected Wasserstein estimator [3]. In the classical statistical inference setting, the typical use of such an estimator is to infer the parameters of a measure whose density does not admit an analytical closed-form formula [2]. On the other hand, in the implicit generative modeling (IGM) setting, this estimator forms the basis of two popular IGM strategies: Wasserstein generative adversarial networks (GAN) [4] and Wasserstein variational auto-encoders (VAE) [5] (cf. [9] for their relation). The goal of these two methods is to find the best parametric transport map , such that transforms a simple distribution (e.g. standard Gaussian or uniform) to a potentially complicated data distribution by minimizing the Wasserstein distance between the transported distribution and , where denotes the push-forward operator to be defined in the next section. In practice, is typically chosen as a neural network, for which it is often impossible to evaluate the induced density . However, one can easily generate samples from by first generating a sample from and then applying to that sample, making minimum expected distance estimation (3) feasible for this setting. Motivated by its practical success, the theoretical properties of this estimator have been recently taken under investigation [10, 11] and very recently Bernton et al. [3] have established the consistency (for the general setting) and the asymptotic distribution (for the one dimensional setting) of this estimator.

Even though estimation with the Wasserstein distance has served as a fertile ground for many generative modeling applications, except for the case when the measures are supported on , the computational complexity of minimum Wasserstein estimators rapidly becomes excessive with the increasing problem dimension, and developing accurate and efficient approximations is a highly non-trivial task. Therefore, there have been several attempts to use more practical alternatives to the Wasserstein distance [12, 6]. In this context, the Sliced-Wasserstein (SW) distance [13, 14, 15] has been an increasingly popular alternative to the Wasserstein distance, which is defined as an average of one-dimensional Wasserstein distances, which allows it to be computed in an efficient manner.

While several studies have reported empirical success on generative modeling with SW [16, 17, 18, 19], the theoretical properties of such estimators have not yet been fully established. Bonnotte [14] proved that SW is a proper metric, and in compact domains SW is equivalent to the Wasserstein distance, hence convergence in SW implies weak convergence in compact domains. [14] also analyzed the gradient flows based on SW, which then served as a basis for a recently proposed IGM algorithm [18]. Finally, recent studies [16, 20] investigated the sample complexity of SW and established bounds for the SW distance between two measures and their empirical instantiations.

In this paper, we investigate the asymptotic properties of estimators given in (1) and (3) when is replaced with the SW distance. We first prove that convergence in SW implies weak convergence of probability measures defined on general domains, which generalizes the results given in [14]. Then, by using similar techniques to the ones given in [3], we show that the estimators defined by (1) and (3) are consistent, meaning that as the number of observations increases the estimates will get closer to the data-generating parameters. We finally prove a central limit theorem (CLT) in the multidimensional setting, which characterizes the asymptotic distribution of these estimators and establish a convergence rate of . The CLT that we prove is stronger than the one given in [3] in the sense that it is not restricted to the one-dimensional setting as opposed to [3].

We support our theory with experiments that are conducted on both synthetic and real data. We first consider a more classical statistical inference setting, where we consider a Gaussian model and a multidimensional -stable model whose density is not available in closed-form. In both models, the experiments validate our consistency and CLT results. We further observe that, especially for high-dimensional problems, the estimators obtained by minimizing SW have significantly better computational properties when compared to the ones obtained by minimizing the Wasserstein distance, as expected. In the IGM setting, we consider the neural network-based generative modeling algorithm proposed in [16] and show that our results also hold in the real data setting as well.

## 2 Preliminaries and Technical Background

We consider a probability space with associated expectation operator , on which all the random variables are defined. Let be a sequence of random variables associated with observations, where each observation takes value in . We assume that these observations are i.i.d. according to , where stands for the set of probability measures on .

A statistical model is a family of distributions on and is denoted by , where is the parametric space. In this paper, we focus on parameter inference for purely generative models: for all , we can generate i.i.d. samples from , but the associated likelihood is numerically intractable. In the sequel, denotes an i.i.d. sequence from with , and for any , denotes the corresponding empirical distribution.

Throughout our study, we assume that the following conditions hold: (1) , endowed with the Euclidean distance , is a Polish space, (2) , endowed with the distance , is a Polish space, (3) is a -compact space, i.e. the union of countably many compact subspaces, and (4) parameters are identifiable, i.e. implies . We endow with the Lévy-Prokhorov distance , which metrizes the weak convergence by [21, Theorem 6.8] since is assumed to be a Polish space. We denote by the Borel -field of .

Wasserstein distance.

For , we denote by the set of probability measures on with finite

’th moment:

. The Wasserstein distance of order between any is defined by [22],

 Wpp(μ,ν)=infγ∈Γ(μ,ν){∫Y×Y∥x−y∥pdγ(x,y)}, (5)

where is the set of probability measures on satisfying and for any . The space endowed with the distance is a Polish space by [22, Theorem 6.18] since is assumed to be Polish.

The one-dimensional case is a favorable scenario for which computing the Wasserstein distance of order between becomes relatively easy since it has a closed-form formula, given by [23, Theorem 3.1.2.(a)]:

 Wpp(μ,ν)=∫10∣∣F−1μ(t)−F−1ν(t)∣∣pdt=∫R∣∣s−F−1ν(Fμ(s))∣∣pdμ(s), (6)

where and

denote the cumulative distribution functions (CDF) of

and respectively, and and

are the quantile functions of

and respectively.

For empirical distributions, (6) is calculated by simply sorting the samples drawn from each distribution and computing the average cost between the sorted samplescolor=red!20color=red!20todo: color=red!20AL: give formula.

Sliced-Wasserstein distance. The analytical form of the Wasserstein distance for one-dimensional distributions is an attractive property that gives rise to an alternative metric referred to as the Sliced-Wasserstein (SW) distance [13, 15]

. The idea behind SW is to first, obtain a family of one-dimensional representations for a higher-dimensional probability distribution through linear projections, and then, compute the average of the Wasserstein distance between these one-dimensional representations.

More formally, let be the -dimensional unit sphere, and denote by the Euclidean inner-product. For any , we define the linear form associated with for any by . The Sliced-Wasserstein distance of order is defined for any as,

 SWpp(μ,ν)=∫Sd−1Wpp(u⋆♯μ,u⋆♯ν)dσ(u) (7)

where

is the uniform distribution on

and for any measurable function and , is the push-forward measure of by , i.e. for any , where .

is a distance on [14] and has important practical implications: in practice, the integration in (7) is approximated using a Monte-Carlo scheme that randomly draws a finite set of samples from on and replaces the integral with a finite-sample average. Therefore, the evaluation of the SW distance between has significantly lower computational requirements than the Wasserstein distance, since it consists in solving several one-dimensional optimal transport problems, which have closed-form solutions.

## 3 Asymptotic Guarantees for Minimum Sliced-Wasserstein Estimators

We define the minimum Sliced-Wasserstein estimator (MSWE) of order as the estimator obtained by plugging in place of in (1). Similarly, we define the minimum expected Sliced-Wasserstein estimator (MESWE) of order as the estimator obtained by plugging in place of in (3). In the rest of the paper MSWE and MESWE will be denoted by and , respectively. We provide all the proofs in Appendix C.

### 3.1 Topology induced by the Sliced-Wasserstein distance

We begin this section by a useful result which we believe is interesting on its own and implies that the topology induced by on is finer than the weak topology induced by the Lévy-Prokhorov metric .

###### Theorem 1 (SWp metrizes the weak convergence in Pp(Rd)).

Let . The convergence in implies the weak convergence in . In other words, if is a sequence of measures in satisfying , with , then .

The property that convergence in implies weak convergence has already been proven in [14] only for compact domains. While the implication of weak convergence is one of the most crucial requirements that a distance metric should satisfy, to the best of our knowledge, this implication has not been proved for general domains before. In [14], the main proof technique was based on showing that is equivalent to in compact domains, whereas we follow a different path and use the Lévy characterization.

### 3.2 Existence and consistency of MSWE and MESWE

In our next set of results, we will show that both MSWE and MESWE are consistent, in the sense that, when the number of observations increases, the estimators will converge to a parameter that minimizes the ideal problem . Before we make this argument more precise, let us first present the assumptions that will imply our results.

###### A 1.

The map is continuous from to , i.e.  for any sequence in , satisfying , we have .

###### A 2.

The data-generating process is such that , -almost surely.

###### A 3.

There exists , such that setting , the set is bounded.

These assumptions are mostly related to the identifiability of the statistical model and the regularity of the data generating process. They are arguably mild assumptions and have already been considered in the literature [3]. In the next result, we establish the consistency of MSWE.

###### Theorem 2 (Existence and consistency of MSWE).

Assume creftypecap 1, creftypecap 2 and creftypecap 3. There exists with such that, for all ,

 limn→+∞infθ∈ΘSWp(^μn(ω),μθ) =infθ∈ΘSWp(μ⋆,μθ), and (8) limsupn→+∞argminθ∈ΘSWp(^μn(ω),μθ) ⊂argminθ∈ΘSWp(μ⋆,μθ), (9)

where is defined by (2). Besides, for all , there exists such that, for all , the set is non-empty.

Our proof technique is similar to the one given in [3]. This result shows that, when the number of observations goes to infinity, the estimate will converge to a global minimizer of the problem .

In our next result, we prove a similar property for MESWEs as goes to infinity. In order to increase clarity, and without loss of generality, in this setting, we consider as a function of such that . Now, we derive an analogous version of Theorem 2 for MESWE. For this result, we need to introduce another continuity assumption.

###### A 4.

If , then .

The next theorem establishes the consistency of MESWE.

###### Theorem 3 (Existence and consistency of MESWE).

Assume creftype 1, creftype 2, creftype 3 and creftype 4. Let be an increasing sequence satisfying . There exists a set with such that, for all ,

 limn→+∞infθ∈ΘE[SWp(^μn,^μθ,m(n))∣∣Y1:n] =infθ∈ΘSWp(μ⋆,μθ), and (10) limsupn→+∞argminθ∈Θ E[SWp(^μn,^μθ,m(n))∣∣Y1:n] ⊂argminθ∈Θ SWp(μ⋆,μθ), (11)

where and are defined by (2) and (4) respectively. Besides, for all , there exists such that, for all , the set is non-empty.

Similar to Theorem 2, this theorem shows that, when the number of observations goes to infinity, the estimator obtained with the expected distance will converge to a global minimizer.

### 3.3 Convergence of MESWE to MSWE

In practical applications, we can only use a finite number of generated samples . In this subsection, we analyze the case where the observations are kept fixed while the number of generated samples increases, i.e.  and we show in this scenario that MESWE converges to MSWE, assuming the latter exists.

Before deriving this result, we formulate a technical assumption below.

###### A 5.

For some and , the set is bounded almost surely.

###### Theorem 4 (MESWE converges to MSWE as m→+∞).

Assume creftype 1, creftype 4 and creftype 5. Then,

 limm→+∞infθ∈ΘE[SWp(^μn,^μθ,m)∣∣Y1:n] =infθ∈ΘSWp(^μn,μθ) (12) limsupm→+∞argminθ∈ΘE[SWp(^μn,^μθ,m)∣∣Y1:n] ⊂argminθ∈ΘSWp(^μn,μθ) (13)

Besides, there exists such that, for any , the set is non-empty.

This result shows that MESWE would be indeed promising in practice, as one get can more accurate estimations by increasing .

### 3.4 Rate of convergence and the asymptotic distribution

In our last set of theoretical results, we investigate the asymptotic distribution of MSWE and we establish a rate of convergence. We now suppose that we are in the well-specified setting, i.e. there exists in the interior of such that , and we consider the following two assumptions. For any and , we define . (Note that for any , is the cumulative distribution function (CDF) associated to the measure .)

###### A 6.

For all , there exists such that

Let denote the class of functions that are absolutely integrable on the domain , with respect to the measure , where denotes the Lebesgue measure.

###### A 7.

Assume that there exists a measurable function such that for each , and

 ∫Sd−1∫R|Fθ(u,t)−Fθ⋆(u,t)−⟨θ−θ⋆,D⋆(u,t)⟩|dtdσ(u)=ϵ(ρΘ(θ,θ⋆)),

where satisfies . Besides, are linearly independent in .

For any , and , define: , where denotes the cardinality of a set. Note that for any , is the CDF associated to the measure .

###### A 8.

There exists a random element such that the stochastic process converges weakly in to 111Under mild assumptions on the tails of for any , we believe that one can prove that creftype 8 holds in general by extending [24, Proposition 3.5] and [25, Theorem 2.1a]..

###### Theorem 5.

Assume creftype 1, creftype 2, creftype 3, creftype 6, creftype 7 and creftype 8. Then, the asymptotic distribution of the goodness-of-fit statistic is given by

 √ninfθ∈ΘSW1(^μn,μθ)w→infθ∈Θ∫Sd−1∫R|G⋆(u,t)−⟨θ,D⋆(u,t)⟩|dtdσ(u), as n→+∞,

where is defined by (2).

###### Theorem 6.

Assume creftype 1, creftype 2, creftype 3, creftype 6, creftype 7 and creftype 8. Suppose also that the random map has a unique infimum almost surely. Then, MSWE with satisfies

 √n(^θn−θ⋆)w→argminθ∈Θ∫Sd−1∫R|G⋆(u,t)−⟨θ,D⋆(u,t)⟩|dtdσ(u), as n→+∞,

where is defined by (1) with in place of .

These results show that the estimator and the associated goodness-of-fit statistics will converge to a random variable in distribution, where the rate of convergence is . We note that this result is also inspired by [3], where they identified the asymptotic distribution associated to the minimum Wasserstein estimator. However, since admits an analytical form only when , their result is restricted to the scalar case. On the contrary, since is defined in terms of one-dimensional distances, we circumvent that issue and hence our result holds for general .

## 4 Experiments

We conduct experiments on synthetic and real data to empirically confirm our theorems. We explain the different optimization methods used to approximate the estimators in Appendix D.

Multivariate Gaussian distributions:

We consider the task of estimating the parameters of a 10-dimensional Gaussian distribution using our SW estimators: we are interested in the model

and we draw i.i.d. observations with . The advantage of this simple setting is that the density of the generated data has a closed-form expression, which makes MSWE tractable. We empirically verify our central limit theorem: for different values of , we compute 500 times MSWE of order 1 using 100 random projections, then we estimate the density of

with a kernel density estimator.

Figure 1 shows the distributions centered and rescaled by for each , and confirms the convergence rate that we derived (Theorem 6). To illustrate the consistency property in Theorem 2, we approximate MSWE of order 2 for different numbers of observed data using 1000 random projections and we report for each

the mean squared error between the estimate mean and variance and the data-generating parameters

. We proceed the same way to study the consistency of MESWE (Theorem 3), which we approximate using 30 random projections and 20 ‘generated datasets’ of size for different values of . We also verify the convergence of MESWE to MSWE (Theorem 4): we compute these estimators on a fixed set of observations for different , and we measure the error between them for each . Results are shown in Figure 5. We see that our estimators indeed converge to as the number of observations increases (Figures (a)a, (b)b), and on a fixed observed dataset, MESWE converges to MSWE as we generate more samples (Figure (c)c).

Multivariate elliptically contoured stable distributions: We focus on parameter inference for a subclass of multivariate stable distributions, called elliptically contoured stable distributions and denoted by [26]. Stable distributions refer to a family of heavy-tailed probability distributions that generalize Gaussian laws and appear as the limit distributions in the generalized central limit theorem [27]. These distributions have many attractive theoretical properties and have been proven useful in modeling financial [28] data or audio signals [29]

. While special univariate cases include Gaussian, Lévy and Cauchy distributions, the density of stable distributions has no general analytic form, which restricts their practical application, especially for the multivariate case.

If

, then its joint characteristic function is defined for any

as , where is a positive definite matrix (akin to a correlation matrix),

is a location vector (equal to the mean if it exists) and

controls the thickness of the tail. Even though their densities cannot be evaluated easily, it is straightforward to sample from [26], therefore it is particularly relevant here to apply MESWE instead of MLE.

To demonstrate the computational advantage of MESWE over the minimum expected Wasserstein estimator [3, MEWE], we consider observations in i.i.d. from where each component of is 2 and , and

. The Wasserstein distance on multivariate data is either computed exactly by solving the linear program in (

5), or approximated by solving a regularized version of this problem with Sinkhorn’s algorithm [12]. The MESWE is approximated using 10 random projections and 10 sets of generated samples. Then, following the approach in [3], we use the gradient-free optimization method Nelder-Mead to minimize the Wasserstein and SW distances. We report on Figure (a)a the mean squared error between each estimate and , as well as their average computational time for different values of dimension . We see that MESWE provides the same quality of estimation as its Wasserstein-based counterparts while considerably reducing the computational time, especially in higher dimensions. We focus on this model in and we illustrate the consistency of the MESWE the same way as for the Gaussian model: see Figure (b)b. To confirm the convergence of to the MSWE , we fix 100 observations and we compute the mean squared error between the two approximate estimators (using 1 random projection and 1 generated dataset) for different values of (Figure (c)c). Note that the MSWE is approximated with the MESWE obtained for a large enough value of : .

High-dimensional real data using GANs: Finally, we run experiments on image generation using the Sliced-Wasserstein Generator (SWG), an alternative GAN formulation based on the minimization of the SW distance [16]. Specifically, the generative modeling approach consists in introducing a random variable which takes value in with a fixed distribution, and then transforming through a neural network. This defines a parametric function that is able to produce images from a distribution , and the goal is to optimize the neural network parameters such that the generated images are close to the observed ones. [16] proposes to minimize the SW distance between and the real data distribution over as the generator objective, and train on MESWE in practice. For our experiments, we design a neural network with the fully-connected configuration given in [16, Appendix D] and we use the MNIST dataset, made of 60 000 training images and 10 000 test images of size . Our training objective is MESWE of order 2 approximated with 20 random projections and 20 different generated datasets. We study the consistent behavior of the MESWE by training the neural network on different sizes of training data and different numbers of generated samples and by comparing the final training loss and test loss to the ones obtained when learning on the whole training dataset () and . Results are shown on Figure 10 and we observe that they also confirm Theorem 3.

## 5 Conclusion

The Sliced-Wasserstein distance has been an attractive metric choice for learning in generative models, where the densities cannot be computed directly. In this study, we investigated the asymptotic properties of estimators that are obtained by minimizing SW and the expected SW. We showed that (i) convergence in SW implies weak convergence of probability measures in general Wasserstein spaces, (ii) the estimators are consistent, (iii) the estimators converge to a random variable in distribution with a rate of . We validated our mathematical results on both synthetic data and neural networks. We believe that our techniques can be further extended to the extensions of SW such as [20, 30, 31].

## Acknowledgements

This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX project (ANR-16-CE23-0014) and by the industrial chair Machine Learning for Big Data from Télécom ParisTech.

## References

• [1] J. Wolfowitz. The minimum distance method. Ann. Math. Statist., 28(1):75–88, 03 1957.
• [2] A. Basu, H. Shioya, and C. Park. Statistical Inference: The Minimum Distance Approach. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press, 2011.
• [3] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. On parameter estimation with the Wasserstein distance. Information and Inference: A Journal of the IMA, Jan 2019.
• [4] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
• [5] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558, 2017.
• [6] Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with Sinkhorn divergences. arXiv preprint arXiv:1706.00292, 2017.
• [7] Giorgio Patrini, Marcello Carioni, Patrick Forre, Samarth Bhargav, Max Welling, Rianne van den Berg, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. arXiv preprint arXiv:1810.01118, 2018.
• [8] Jonas Adler and Sebastian Lunz. Banach Wasserstein GAN. In Advances in Neural Information Processing Systems, pages 6754–6763, 2018.
• [9] Aude Genevay, Gabriel Peyré, and Marco Cuturi. GAN and VAE from an optimal transport point of view. arXiv preprint arXiv:1706.01807, 2017.
• [10] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the vegan cookbook. arXiv preprint arXiv:1705.07642, 2017.
• [11] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pages 5545–5553, 2017.
• [12] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
• [13] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In Alfred M. Bruckstein, Bart M. ter Haar Romeny, Alexander M. Bronstein, and Michael M. Bronstein, editors,

Scale Space and Variational Methods in Computer Vision

, pages 435–446, 2012.
• [14] Nicolas Bonnotte. Unidimensional and Evolution Methods for Optimal Transportation. PhD thesis, Paris 11, 2013.
• [15] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
• [16] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced Wasserstein distance. In

IEEE Conference on Computer Vision and Pattern Recognition

, pages 3483–3491, 2018.
• [17] Soheil Kolouri, Phillip E. Pope, Charles E. Martin, and Gustavo K. Rohde. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations, 2019.
• [18] Antoine Liutkus, Umut Şimşekli, Szymon Majewski, Alain Durmus, and Fabian-Robert Stoter. Sliced-Wasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, 2019.
• [19] Jiqing Wu, Zhiwu Huang, Wen Li, Janine Thoma, and Luc Van Gool. Sliced wasserstein generative models. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
• [20] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander Schwing. Max-Sliced Wasserstein distance and its use for GANs. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
• [21] Patrick Billingsley. Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., New York, second edition, 1999. A Wiley-Interscience Publication.
• [22] Cédric Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, 2009 edition, September 2008.
• [23] S. T. Rachev and L. Rüschendorf. Mass transportation problems. Vol. I. Probability and its Applications (New York). Springer-Verlag, New York, 1998. Theory.
• [24] Sophie Dede. An empirical central limit theorem in l1 for stationary sequences. Stochastic Processes and their Applications, 119(10):3494–3515, 2009.
• [25] Eustasio del Barrio, Evarist Giné, and Carlos Matrán. Central limit theorems for the wasserstein distance between the empirical and the true distributions. Ann. Probab., 27(2):1009–1071, 04 1999.
• [26] John P. Nolan. Multivariate elliptically contoured stable distributions: theory and estimation. Computational Statistics, 28(5):2067–2089, Oct 2013.
• [27] G. Samorodnitsky and M.S. Taqqu. Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. Stochastic Modeling Series. Taylor & Francis, 1994.
• [28] B. B. Mandelbrot. Fractals and Scaling in Finance: Discontinuity, Concentration, Risk. Selecta Volume E. Springer Science & Business Media, 2013.
• [29] U. Şimşekli, A. Liutkus, and A. T. Cemgil. Alpha-stable matrix factorization. IEEE Signal Processing Letters, 22(12):2289–2293, 2015.
• [30] François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International Conference on Machine Learning, 2019.
• [31] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo K Rohde. Generalized sliced wasserstein distances. arXiv preprint arXiv:1902.00434, 2019.
• [32] R.T. Rockafellar, M. Wets, and R.J.B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2009.
• [33] L. D. Brown and R. Purves. Measurable selections of extrema. Ann. Statist., 1(5):902–912, 09 1973.
• [34] Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum kantorovich distance estimators. Statistics & Probability Letters, 76(12):1298 – 1302, 2006.
• [35] V.I. Bogachev. Measure Theory. Number vol. 1 in Measure Theory. Springer Berlin Heidelberg, 2007.
• [36] O. Kallenberg. Foundations of modern probability. Probability and its Applications (New York). Springer-Verlag, New York, 1997.
• [37] G. B. Folland. Real analysis. Pure and Applied Mathematics (New York). John Wiley & Sons, Inc., New York, second edition, 1999. Modern techniques and their applications, A Wiley-Interscience Publication.
• [38] C.D. Aliprantis, K.C. Border, and K.C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Studies in economic theory. Springer, 1999.
• [39] D. Pollard. The minimum distance method of testing. Metrika, 27(1):43–70, Dec 1980.
• [40] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasserstein Barycenters of Measures. Journal of Mathematical Imaging and Vision, 1(51):22–45, 2015.

## Appendix A Preliminaries

### a.1 Convergence and lower semi-continuity

###### Definition (Weak convergence).

Let be a sequence of probability measures on . We say that converges weakly to a probability measure on , and write , if for any continous and bounded function : , we have

 limk→+∞∫f dμk=∫f dμ.
###### Definition (Epi-convergence).

Let be a metric space and . Consider a sequence a function from to . We say that the sequence epi-converges to a function , and write , if for each ,

 liminfk→∞fk(θk) ≥f(θ) for every sequence (θk)n∈N such that limk→+∞θk=θ, and limsupk→∞fk(θk) ≤f(θ) for a sequence (θk)n∈N such that limk→+∞θk=θ.

An equivalent and useful characterization of epi-convergence is given in [32, Proposition 7.29], which we paraphrase in Proposition A.1 after recalling the definition of lower semi-continuous functions.

###### Definition (Lower semi-continuity).

Let be a metric space and . We say that is lower semi-continuous (l.s.c.) on if for any ,

 liminfθ→θ0f(θ)≥f(θ0)
###### Proposition (Characterization of epi-convergence via minimization, Proposition 7.29 of [32]).

Let be a metric space and be a l.s.c. function. The sequence , with  for any , epi-converges to if and only if

•   for every compact set ;

•   for every open set .

[32, Theorem 7.31], paraphrased below, gives asymptotic properties for the infimum and argmin of epiconvergent functions and will be useful to prove the existence and consistency of our estimators.

###### Theorem 7 (Inf and argmin in epiconvergence, Theorem 7.31 of [32]).

Let be a metric space, be a l.s.c. function and be a sequence with  for any . Suppose with .

• It holds if and only if for every there exists a compact set and such for any ,

 infθ∈Kfk(θ)≤infθ∈Θfk(θ)+η.

## Appendix B Preliminary results

In this section, we gather technical results regarding lower semi-continuity of (expected) Sliced-Wasserstein distances and measurability of MSWE which will be needed in our proofs.

### b.1 Lower semi-continuity of Sliced-Wasserstein distances

###### Lemma (Lower semi-continuity of SWp).

Let . The Sliced-Wasserstein distance of order is lower semi-continuous on endowed with the topology of weak convergence, i.e. for any sequences and of which converge weakly to and respectively, we have:

 SWp(μ,ν)≤liminfk→+∞SWp(μk,νk).
###### Proof.

First, by the continuous mapping theorem, if a sequence of elements of converges weakly to , then for any continuous function , converges weakly to . In particular, for any , since is a bounded linear form thus continuous.

Let . We introduce the two sequences and of elements of such that and . We show that for any ,

 Wpp(u⋆♯μ,u⋆♯ν)≤liminfk→+∞Wpp(u⋆♯μk,u⋆♯νk). (14)

Indeed, if (14) holds, then the proof is completed using the definition of the Sliced-Wasserstein distance (7) and Fatou’s Lemma. Let . For any , let be an optimal transference plan between and for the Wasserstein distance of order which exists by [22, Theorem 4.1] i.e.

 Wpp(u⋆♯μk,u⋆♯νk)=∫R×R|a−b|dγk(a,b).

Note that by [22, Lemma 4.4] and Prokhorov’s Theorem, is sequentially compact in for the topology associated with the weak convergence. Now, consider a subsequence where is increasing such that

 limk→+∞∫R×R|a−b|pdγϕ1(k)(a,b)=limk→+∞Wpp(u⋆♯μϕ1(k),u⋆♯νϕ1(k))=liminfk→+∞Wpp(u⋆♯μk,u⋆♯νk). (15)

Since is sequentially compact, is sequentially compact as well, and therefore there exists an increasing function and a probability distribution such that converges weakly to . Then, we obtain by (15),

 ∫R×R∥a−b∥pdγ(a,b)=limk→+∞∫R×R∥a−b∥pdγϕ2(ϕ1(k))(a,b)=liminfk→+∞Wpp(u⋆♯μk,u⋆♯νk).

If we show that , it will conclude the proof of (14) by definition of the Wasserstein distance (5). But for any continuous and bounded function , we have since for any , , and converge weakly to and respectively,

 ∫R×Rf(a)dγ(a,b)=limk→+∞∫R×Rf(a)dγϕ2(ϕ1(k))(a,b)=limk→+∞∫Rf(a)du⋆♯μϕ2(ϕ1(k))(a)=∫Rf(a)du⋆♯μ(a),

and similarly

 ∫R×Rf(b)dγ(a,b)=∫Rf(b)du⋆♯ν(a).

This shows that and therefore, (14) is true. We conclude by applying Fatou’s Lemma.

By a direct application of Section B.1, we have the following result.

###### Corollary .

Assume creftypecap 1. Then, is lower semi-continuous in .

###### Lemma (Lower semi-continuity of ESWp).

Let and . Denote for any , , where are i.i.d. samples from . Then, the map is lower semi-continuous on endowed with the topology of weak convergence.

###### Proof.

We consider two sequences and of probability measures in , such that and , and we fix .

By Skorokhod’s representation theorem, there exists a probability space , a sequence of random variables and a random variable