1 Introduction
The Wasserstein distance is a metric between probability distributions which has received a surge of interest in statistics and machine learning (Panaretos and Zemel, 2018; Kolouri et al., 2017). This distance is a special case of the optimal transport problem (Villani, 2003), and measures the work required to couple one distribution with another. Specifically, let denote the set of Borel probability measures supported on a set , for some integer , and let denote the set of probability measures with finite th moment, for some . Given , the th order Wasserstein distance between and is defined to be
(1) 
where denotes the set of joint probability distributions with marginals and , known as couplings. The minimizer is called the optimal coupling between and . The norm is taken to be Euclidean in this paper, but may more generally be replaced by any metric on .
Despite the popularity of Wasserstein distances in recent machine learning methodologies, their high computational complexity often limits their applicability to largescale problems. Developing efficient numerical approximations of the distance remains an active research area—see Peyré et al. (2019) for a recent review. A key exception to the high computational cost is the univariate case, in which the Wasserstein distance admits a closed form as the
norm between the quantile functions of
and , which can be easily computed. This fact has led to the study of an alternate metric, known as the Sliced Wasserstein distance (Rabin et al., 2011; Bonneel et al., 2015) obtained by averaging the Wasserstein distance between onedimensional projections of the distributions and . Specifically, let , and let denote the uniform probability measure on . The th order Sliced Wasserstein distance between and is given bywhere for any , and denote the respective probability distributions of and where and . Furthermore,
is the cumulative distribution function (cdf) of
, and denotes its quantile function (and similarly for and ).The Sliced Wasserstein distance is a generally weaker metric then the Wasserstein distance, as shown by the inequality for all (Bonnotte, 2013)
. It nonetheless preserves many qualitatively similar properties to the Wasserstein distance, making it an attractive, easily computable alternative in many applications. It is wellknown that the Wasserstein distance and its sliced analogue are senstive to outliers and thick tails, thus, inspired by
Munk and Czado (1998) and ÁlvarezEsteban et al. (2008), we also define the trimmed Sliced Wasserstein distance,(2) 
Despite the increasing applications of Wassersteintype distances in machine learning and related fields, uncertainty quantification for these distances has received significantly less attention. In this paper, we derive confidence intervals for the Sliced Wasserstein distance which make either no assumptions or mild moment assumptions on the unknown distributions and . Specifically, given a prespecified level and iid samples and , we derive confidence sets such that
(3) 
Our proposed choices of are adaptive in length to the smoothness of the distributions and , and have optimal length in a minimax sense. To this end, we derive minimax rates for estimating the Sliced Wasserstein distance between two distributions, noting that the minimax length of a confidence interval is always bounded below by the corresponding minimax estimation rate. Specifically, we provide lower bounds on the minimax risk
(4) 
where the infimum is over all estimators of the Sliced Wasserstein distance based on samples from and samples from , and is an appropriately chosen collection of pairs of distributions.
To motivate our choice of families , we begin by studying the minimax rates for estimating a distribution under the Sliced Wasserstein distance, that is we begin by bounding the quantity,
(5) 
where the infimum is over all estimators of a Borel probability distribution based on a sample of size , and . The rates we obtain for both problems (4) and (5) are dimensionindependent, thus showing that estimation of distributions under the Sliced Wasserstein distance, and estimation of the Sliced Wasserstein distance between two distributions, are simpler statistical problems than their analogues for the Wasserstein distance (Singh and Póczos, 2018; Liang, 2019).
Our Contributions and Related Work. The Sliced Wasserstein distance is one of many approximations of the Wasserstein distance based on onedimensional projections. We mention here the Generalized Sliced (Kolouri et al., 2019), TreeSliced (Le et al., 2019), and Subspace Robust (Paty and Cuturi, 2019) Wasserstein distances.
We are not aware of any existing work regarding statistical inference, in particular confidence intervals, for any of these distances, except in the special case when they all coincide with the onedimensional Wasserstein distance. In this case, Munk and Czado (1998) study limiting distributions of the empirical (plugin) Wasserstein distance estimator, and Freitag et al. (2003, 2007) establish sufficient conditions for the validity of the bootstrap in estimating the distribution of the empirical Wasserstein distance. These results, however, are only valid under certain smoothness assumptions on and , and imply different inferential procedures at the null and away from the null . In contrast, the confidence intervals derived in the present paper are valid under either no assumptions or mild moment assumptions on and , and are applied more generally to the Sliced Wasserstein distance in arbitrary dimension. Other inferential results for Wasserstein distances include those of Sommerfeld and Munk (2018); Tameling et al. (2017); Klatt et al. (2018) when the support of and are finite or countable sets, and the work of Rippl et al. (2016) when and only differ by a locationscale transformation.
To the best of our knowledge, the only existing results regarding the minimax rate of estimating the Wasserstein distance between two distributions are those of Liang (2019), which are restricted to the case . As we show below, the dependence of the minimax rate on is nontrivial, and in the special case our work generalizes that of Liang. The rate of estimating a distribution under the Wasserstein distance has received significantly more attention. Upper bounds on the convergence rate of the empirical measure have been established by Fournier and Guillin (2015); Boissard and Le Gouic (2014); Bobkov and Ledoux (2014); Weed and Bach (2018); Lei (2018). Singh and Póczos (2018) obtain corresponding minimax lower bounds.
Paper Outline. The rest of this paper is organized as follows. Section 2 establishes minimax rates of estimating a distribution under the Sliced Wasserstein distance. In Section 3, we derive confidence intervals for the onedimensional and Sliced Wasserstein distances, and we establish lower bounds on the minimax risk of estimating the Sliced Wasserstein distance in Section 4. We illustrate the performance of our confidence intervals via a brief simulation study in Section 5.
Notation. For any , denotes the maximum of and , and denotes the minimum of and . For any sequences of real numbers and , we write if there exists constants such that for all , and we write if . For any , denotes the Dirac delta measure with mass at .
2 Minimax Estimation of a Distribution under the Sliced Wasserstein Distance
Let , , and let be an iid sample. Let denote the corresponding empirical measure. In this section, we first establish upper bounds on the rate of convergence of , extending the comprehensive treatment by Bobkov and Ledoux (2014) of this quantity with in dimension one. We then use these results to bound the minimax risk in (5) of estimating a distribution under the trimmed Sliced Wasserstein distance.
For any , let denote the density of the absolutely continuous component of , where recall that denotes the probability distribution of , with cdf . Define the functional
(6) 
with the convention . When , we write instead of , and in the untrimmed case , we omit the subscript and write or . When and , Bobkov and Ledoux (2014) prove that decays at the standard rate if and only if , and otherwise decays at most at the rate under mild moment assumptions. The following result generalizes their Theorems 5.3 and 7.16, showing that a similar conclusion is true of the trimmed Sliced Wasserstein distance, with respect to the functional.
Proposition 1
Let , and .

Suppose . Them, there exists a universal constant depending only on such that

We have, , where
It can be seen that a necessary condition for the finiteness of is that the for almost every , the density is supported on a (possibly infinite) interval. When , this condition is also sufficient, independently of . On the other hand, when , the value of also depends on the tail behaviour of and the value of . For example, if
is the standard Gaussian distribution, it can be shown that
whenever , and if and only if . On the other hand, if , for some , wheredenotes the uniform distribution on the interval
, then if and only if , for every .We now bound the minimax risk in (5). Consider the class of distributions
(7) 
for some . It can be seen that , thus the results of Proposition 1 provide an upper bound on the worstcase risk under the Wasserstein distance, over the family . In view of this fact and the result of Theorem 1, it is natural to further distinguish our minimax bounds over the following two classes of distributions,
We now state the main result of this section.
Theorem 2
Let and . Then,
Theorem 2 shows that the minimax risk of estimating a distribution under the Sliced Wasserstein distance achieves the parametric rate for distributions in , and the generally slower but dimensionindependent rate otherwise, under mild moment assumptions. In particular, the empirical measure achieves both rates, as shown in Proposition 1. These results contrast the dimensiondependent rates of estimating a distribution under the Wasserstein distance, which may generically decay at the rate (see for instance Weed and Bach (2017)). In what follows, we show that parametric rates of convergence are also achievable for the problem of estimating the Sliced Wasserstein distance between two distributions, and we begin by deriving confidence intervals whose lengths decay at such a rate.
3 Confidence Intervals for the Sliced Wasserstein Distance
In this section, we propose several confidence intervals for the twosample problem (3), which have finite sample validity under at most mild moment assumptions, and which are adaptive to whether or not the functional in (6) is finite. We begin by constructing confidence intervals for the onedimensional Wasserstein distance, and we then extend these results to the Sliced Wasserstein distance.
3.1 Confidence Intervals for the OneDimensional Wasserstein Distance
Throughout this subsection, let be given, let be probability distributions with respective cdfs , and let and be iid samples. Let and denote their corresponding empirical cdfs, for all . We derive confidence intervals for the trimmed Wasserstein distance, with the following nonasymptotic coverage guarantee
(8) 
for some prespecified level , where the trimming variable may depend on the sample sizes and . Our approach hinges on the fact that the onedimensional Wasserstein distance may be expressed as the norm of the quantile functions of and , suggesting that a confidence interval may be derived via uniform control of the empirical quantile process. Specifically, let and be sequences of functions such that
For example, inversion of the DvoretzkyKieferWolfowitz (DKW) inequality (Dvoretzky et al., 1956; Massart, 1990),
(9) 
leads to the choice , . Scaledependent choices of and may also be obtained, for instance, via the relative VapnikChervonenkis (VC) inequality (Vapnik, 2013). We may then define the confidence interval
(10) 
where for all ,
Proposition 3 (Coverage of )
Although the coverage of requires no assumptions on and , we now show that it is adaptive in length to whether or not the functionals are finite. In what follows, we state our results regarding the length of assuming and are chosen based on the DKW inequality in (9), but more general proofs are available in the supplement.
Let . For any function , we write , and
Finally, let . We then have the following result.
Theorem 4 (Length of )
Let satisfy the conditions of Proposition 3. Then,

There exists a universal constant such that for every ,
with probability at least , where
and

Assume . Assume further that the measures and have no singular components. Then, there exists a universal constant such that for all ,
with probability at least where
and are the respective densities of the absolutely continuous components of and .
When the trimming constant is strictly positive and does not depend on or , the quantities and are constant for large enough and . In this case, a careful investigation of Theorem 4 shows that, whenever for some positive constant , regardless of the values of . On the other hand, in the finitesample regime where , and in particular when , the minimax lower bounds derived in Section 2 suggest that the standard rate is not achievable when . Indeed, Theorem 4 implies
3.2 Confidence Intervals for the Sliced Wasserstein Distance
We now derive a confidence interval for , where . In analogy to the previous subsection, an immediate first approach is obtained by choosing functions and such that
(11) 
where for all . Such a bound can be obtained, for instance by an application of the VC inequality (Vapnik, 2013) over the set of halfspaces. An assumptionfree confidence interval for with finitesample coverage may then be constructed by following the same lines as in the previous section. Due to the uniformity of (11) over the unit sphere, however, it can be seen that the length of such an interval is necessarily dimensiondependent. In what follows, we instead show that it is possible to obtain a dimensionindependent confidence interval by exploiting the fact that the Sliced Wasserstein distance is a mean with respect to the distribution .
Let be an iid sample from the distribution , for some integer , and consider the following Monte Carlo approximation of the Sliced Wasserstein distance between distributions and ,
For any , let be a confidence interval for , as constructed in the previous section, and let
(12) 
In analogy to the quantities , and in the statement of Theorem 4, we write their averages with respect to the distributions and , , as,
We then have the following result, which shows that the length of decays at the same rate as the onedimensional confidence interval (10), up to a polylogarithmic factor in .
Theorem 5
Let , and assume the same conditions on as in Theorem 1.

For all , and for any finite constant ,

Suppose . Then, there exists a universal constant such that for all , we have,
with probability at least . Furthermore, if and have no singular components for almost every , and if , then with probability at least ,
4 Minimax Lower Bounds
In this section we provide lower bounds on the minimax rates for estimating the Sliced Wasserstein distance between two distributions, as defined in (4). It should be noted that upper bounds on the (worst case) length of the confidence intervals for the Sliced Wasserstein distance, derived previously, provide an upper on . Given , recall the class in (7). Motivated by (5), we define
We lower bound the minimax risk of estimating the Sliced Wasserstein distance over each of these classes. The upper bounds on the length of in Theorem 5 match these lower bounds (up to a polylogarithmic factor in the number of Monte Carlo iterations ), provided is fixed.
Theorem 6
Let , , and .

Given , we have
(13) In particular, if does not depend on and , we have
(14) 
We have, .
5 Simulation Study
We conduct a brief simulation study to illustrate the adaptivity of our confidence intervals to whether or not the functional is finite, and to the distance between the distributions being compared. In all simulations below, we generate 100 samples of varying size , from the pair of distributions
and,
for varying values of , where denotes the uniform probability distribution on the circle centered at the origin and with radius in , where we choose . It can be seen that , while , for all and . For each generated sample, we compute a confidence interval for , , with , and . For , we use the onedimensional confidence interval (10) while for , we use the interval (12) with .
For varying values of , we report the average length of the confidence intervals across all 100 replications for and , in Figure 1. We do not report the coverage of the intervals as it was always above 95% of replications. For , it can be seen that the average length monotonically decreases with the parameter . This matches our findings in Theorem 4 and the minimax lower bound in Theorem 6, which show that estimation rates near the null case are slower than away from the null, when . On the other hand, Figure 1 shows that the average length of the confidence intervals for is nearly the same for all values of , as predicted by Theorem 5 which shows that the parametric estimation rate is achievable by the confidence interval in (12) when .
References
 Trimmed comparison of distributions. Journal of the American Statistical Association 103 (482), pp. 697–704. Cited by: §1.
 Onedimensional empirical measures, order statistics and kantorovich transport distances. preprint. Cited by: §1, §2, §2, Appendix.
 On the mean speed of convergence of empirical and occupation measures in wasserstein distance. In Annales de l’IHP Probabilités et statistiques, Vol. 50, pp. 539–563. Cited by: §1.
 Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision 51 (1), pp. 22–45. Cited by: §1.
 Unidimensional and evolution methods for optimal transportation. Ph.D. Thesis, Paris 11. Cited by: §1.
 Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics 27 (3), pp. 642–669. Cited by: §3.1.
 On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields 162 (34), pp. 707–738. Cited by: §1.
 Assessing structural relationships between distributionsa quantile process approach based on mallows distance. In Recent advances and trends in Nonparametric Statistics, pp. 123–137. Cited by: §1.
 A nonparametric test for similarity of marginals—with applications to the assessment of population bioequivalence. Journal of statistical planning and inference 137 (3), pp. 697–711. Cited by: §1.
 Empirical regularized optimal transport: statistical theory and applications. arXiv preprint arXiv:1810.09880. Cited by: §1.
 Generalized sliced wasserstein distances. arXiv preprint arXiv:1902.00434. Cited by: §1.
 Optimal mass transport: signal processing and machinelearning applications. IEEE signal processing magazine 34 (4), pp. 43–59. Cited by: §1.
 Treesliced approximation of wasserstein distances. arXiv preprint arXiv:1902.00342. Cited by: §1.
 Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. arXiv preprint arXiv:1804.10556. Cited by: §1.
 On the minimax optimality of estimating the wasserstein metric. arXiv preprint arXiv:1908.10324. Cited by: §1, §1.
 The tight constant in the dvoretzkykieferwolfowitz inequality. The annals of Probability, pp. 1269–1283. Cited by: §3.1.
 Nonparametric validation of similar distributions and assessment of goodness of fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 (1), pp. 223–241. Cited by: §1, §1.
 Statistical aspects of wasserstein distances. Annual review of statistics and its application. Cited by: §1.
 Subspace robust wasserstein distances. arXiv preprint arXiv:1901.08949. Cited by: §1.
 Computational optimal transport. Foundations and Trends® in Machine Learning 11 (56), pp. 355–607. Cited by: §1.

Wasserstein barycenter and its application to texture mixing.
In
International Conference on Scale Space and Variational Methods in Computer Vision
, pp. 435–446. Cited by: §1. 
Limit laws of the empirical wasserstein distance: gaussian distributions.
Journal of Multivariate Analysis
151, pp. 90–109. Cited by: §1.  Minimax distribution estimation in wasserstein distance. arXiv preprint arXiv:1802.08855. Cited by: §1, §1.
 Inference for empirical wasserstein distances on finite spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (1), pp. 219–238. Cited by: §1.
 Empirical optimal transport on countable metric spaces: distributional limits and statistical applications. arXiv preprint arXiv:1707.00973. Cited by: §1.

The nature of statistical learning theory
. Springer science & business media. Cited by: §3.1, §3.2.  Topics in optimal transportation. American Mathematical Soc.. Cited by: §1.
 Sharp asymptotic and finitesample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087. Cited by: §2.
 Sharp asymptotic and finitesample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087. Cited by: §1.
 Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pp. 423–435. Cited by: Appendix B: Proof of Theorem 6.
Appendix
We provide the proofs of Theorems 4 and 6. The proof of Proposition 1 follows along similar lines as the results of Bobkov and Ledoux (2014), and the proof of Proposition 3 is straightforward. The proofs of Theorems 2 and 5 follow similar lines as the proofs of Theorems 6 and 4 respectively.
Appendix A: Proof of Theorem 4
We provide a general proof of Theorem 1 for functions and not necessarily chosen using the DKW inequality. In what follows, we assume and are both differentiable, invertible with differentiable inverses, and respectively increasing and decreasing as functions of . Furthermore, given , let be a sequence (depending on the fixed level ) such that for any we have
and for any we have we have , and
(15) 
The dependence of on is ignored in the notation for ease of readability. We now turn to the proof of Theorem 1.
Proof of Theorem 1.(i). Let . With probability at least , we have both
(16) 
and,
(17) 
The derivations which follow will be carried out on the event that the above two inequalities are satisfied, which has probability at least .
We will first show that
A similar argument can be used to bound this expression with replaced by , and will lead to the claim.
Note that for all , by Taylor expansion of the map , there exists and such that