DeepAI
Log In Sign Up

Minimax Confidence Intervals for the Sliced Wasserstein Distance

The Wasserstein distance has risen in popularity in the statistics and machine learning communities as a useful metric for comparing probability distributions. We study the problem of uncertainty quantification for the Sliced Wasserstein distance–an easily computable approximation of the Wasserstein distance. Specifically, we construct confidence intervals for the Sliced Wasserstein distance which have finite-sample validity under no assumptions or mild moment assumptions, and are adaptive in length to the smoothness of the underlying distributions. We also bound the minimax risk of estimating the Sliced Wasserstein distance, and show that the length of our proposed confidence intervals is minimax optimal over appropriate distribution classes. To motivate the choice of these classes, we also study minimax rates of estimating a distribution under the Sliced Wasserstein distance. These theoretical findings are complemented with a simulation study.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/24/2018

Minimax Distribution Estimation in Wasserstein Distance

The Wasserstein metric is an important measure of distance between proba...
08/21/2022

Efficient Concentration with Gaussian Approximation

Concentration inequalities for the sample mean, like those due to Bernst...
04/20/2020

Is distribution-free inference possible for binary regression?

For a regression problem with a binary label response, we examine the pr...
11/16/2021

On Adaptive Confidence Sets for the Wasserstein Distances

In the density estimation model, we investigate the problem of construct...
10/30/2015

Principal Differences Analysis: Interpretable Characterization of Differences between Distributions

We introduce principal differences analysis (PDA) for analyzing differen...
08/13/2018

The Gromov-Wasserstein distance between networks and stable network invariants

We define a metric---the Network Gromov-Wasserstein distance---on weight...

1 Introduction

The Wasserstein distance is a metric between probability distributions which has received a surge of interest in statistics and machine learning (Panaretos and Zemel, 2018; Kolouri et al., 2017). This distance is a special case of the optimal transport problem (Villani, 2003), and measures the work required to couple one distribution with another. Specifically, let denote the set of Borel probability measures supported on a set , for some integer , and let denote the set of probability measures with finite -th moment, for some . Given , the -th order Wasserstein distance between and is defined to be

(1)

where denotes the set of joint probability distributions with marginals and , known as couplings. The minimizer is called the optimal coupling between and . The norm is taken to be Euclidean in this paper, but may more generally be replaced by any metric on .

Despite the popularity of Wasserstein distances in recent machine learning methodologies, their high computational complexity often limits their applicability to large-scale problems. Developing efficient numerical approximations of the distance remains an active research area—see Peyré et al. (2019) for a recent review. A key exception to the high computational cost is the univariate case, in which the Wasserstein distance admits a closed form as the

norm between the quantile functions of

and , which can be easily computed. This fact has led to the study of an alternate metric, known as the Sliced Wasserstein distance (Rabin et al., 2011; Bonneel et al., 2015) obtained by averaging the Wasserstein distance between one-dimensional projections of the distributions and . Specifically, let , and let denote the uniform probability measure on . The -th order Sliced Wasserstein distance between and is given by

where for any , and denote the respective probability distributions of and where and . Furthermore,

is the cumulative distribution function (cdf) of

, and denotes its quantile function (and similarly for and ).

The Sliced Wasserstein distance is a generally weaker metric then the Wasserstein distance, as shown by the inequality for all (Bonnotte, 2013)

. It nonetheless preserves many qualitatively similar properties to the Wasserstein distance, making it an attractive, easily computable alternative in many applications. It is well-known that the Wasserstein distance and its sliced analogue are senstive to outliers and thick tails, thus, inspired by

Munk and Czado (1998) and Álvarez-Esteban et al. (2008), we also define the -trimmed Sliced Wasserstein distance,

(2)

Despite the increasing applications of Wasserstein-type distances in machine learning and related fields, uncertainty quantification for these distances has received significantly less attention. In this paper, we derive confidence intervals for the Sliced Wasserstein distance which make either no assumptions or mild moment assumptions on the unknown distributions and . Specifically, given a pre-specified level and iid samples and , we derive confidence sets such that

(3)

Our proposed choices of are adaptive in length to the smoothness of the distributions and , and have optimal length in a minimax sense. To this end, we derive minimax rates for estimating the Sliced Wasserstein distance between two distributions, noting that the minimax length of a confidence interval is always bounded below by the corresponding minimax estimation rate. Specifically, we provide lower bounds on the minimax risk

(4)

where the infimum is over all estimators of the Sliced Wasserstein distance based on samples from and samples from , and is an appropriately chosen collection of pairs of distributions.

To motivate our choice of families , we begin by studying the minimax rates for estimating a distribution under the Sliced Wasserstein distance, that is we begin by bounding the quantity,

(5)

where the infimum is over all estimators of a Borel probability distribution based on a sample of size , and . The rates we obtain for both problems (4) and (5) are dimension-independent, thus showing that estimation of distributions under the Sliced Wasserstein distance, and estimation of the Sliced Wasserstein distance between two distributions, are simpler statistical problems than their analogues for the Wasserstein distance (Singh and Póczos, 2018; Liang, 2019).

Our Contributions and Related Work. The Sliced Wasserstein distance is one of many approximations of the Wasserstein distance based on one-dimensional projections. We mention here the Generalized Sliced (Kolouri et al., 2019), Tree-Sliced (Le et al., 2019), and Subspace Robust (Paty and Cuturi, 2019) Wasserstein distances.

We are not aware of any existing work regarding statistical inference, in particular confidence intervals, for any of these distances, except in the special case when they all coincide with the one-dimensional Wasserstein distance. In this case, Munk and Czado (1998) study limiting distributions of the empirical (plug-in) Wasserstein distance estimator, and Freitag et al. (2003, 2007) establish sufficient conditions for the validity of the bootstrap in estimating the distribution of the empirical Wasserstein distance. These results, however, are only valid under certain smoothness assumptions on and , and imply different inferential procedures at the null and away from the null . In contrast, the confidence intervals derived in the present paper are valid under either no assumptions or mild moment assumptions on and , and are applied more generally to the Sliced Wasserstein distance in arbitrary dimension. Other inferential results for Wasserstein distances include those of Sommerfeld and Munk (2018); Tameling et al. (2017); Klatt et al. (2018) when the support of and are finite or countable sets, and the work of Rippl et al. (2016) when and only differ by a location-scale transformation.

To the best of our knowledge, the only existing results regarding the minimax rate of estimating the Wasserstein distance between two distributions are those of Liang (2019), which are restricted to the case . As we show below, the dependence of the minimax rate on is nontrivial, and in the special case our work generalizes that of Liang. The rate of estimating a distribution under the Wasserstein distance has received significantly more attention. Upper bounds on the convergence rate of the empirical measure have been established by Fournier and Guillin (2015); Boissard and Le Gouic (2014); Bobkov and Ledoux (2014); Weed and Bach (2018); Lei (2018). Singh and Póczos (2018) obtain corresponding minimax lower bounds.

Paper Outline. The rest of this paper is organized as follows. Section 2 establishes minimax rates of estimating a distribution under the Sliced Wasserstein distance. In Section 3, we derive confidence intervals for the one-dimensional and Sliced Wasserstein distances, and we establish lower bounds on the minimax risk of estimating the Sliced Wasserstein distance in Section 4. We illustrate the performance of our confidence intervals via a brief simulation study in Section 5.

Notation. For any , denotes the maximum of and , and denotes the minimum of and . For any sequences of real numbers and , we write if there exists constants such that for all , and we write if . For any , denotes the Dirac delta measure with mass at .

2 Minimax Estimation of a Distribution under the Sliced Wasserstein Distance

Let , , and let be an iid sample. Let denote the corresponding empirical measure. In this section, we first establish upper bounds on the rate of convergence of , extending the comprehensive treatment by Bobkov and Ledoux (2014) of this quantity with in dimension one. We then use these results to bound the minimax risk in (5) of estimating a distribution under the -trimmed Sliced Wasserstein distance.

For any , let denote the density of the absolutely continuous component of , where recall that denotes the probability distribution of , with cdf . Define the functional

(6)

with the convention . When , we write instead of , and in the untrimmed case , we omit the subscript and write or . When and , Bobkov and Ledoux (2014) prove that decays at the standard rate if and only if , and otherwise decays at most at the rate under mild moment assumptions. The following result generalizes their Theorems 5.3 and 7.16, showing that a similar conclusion is true of the -trimmed Sliced Wasserstein distance, with respect to the functional.

Proposition 1

Let , and .

  • Suppose . Them, there exists a universal constant depending only on such that

  • We have, , where

It can be seen that a necessary condition for the finiteness of is that the for -almost every , the density is supported on a (possibly infinite) interval. When , this condition is also sufficient, independently of . On the other hand, when , the value of also depends on the tail behaviour of and the value of . For example, if

is the standard Gaussian distribution, it can be shown that

whenever , and if and only if . On the other hand, if , for some , where

denotes the uniform distribution on the interval

, then if and only if , for every .

We now bound the minimax risk in (5). Consider the class of distributions

(7)

for some . It can be seen that , thus the results of Proposition 1 provide an upper bound on the worst-case risk under the Wasserstein distance, over the family . In view of this fact and the result of Theorem 1, it is natural to further distinguish our minimax bounds over the following two classes of distributions,

We now state the main result of this section.

Theorem 2

Let and . Then,

Theorem 2 shows that the minimax risk of estimating a distribution under the Sliced Wasserstein distance achieves the parametric rate for distributions in , and the generally slower but dimension-independent rate otherwise, under mild moment assumptions. In particular, the empirical measure achieves both rates, as shown in Proposition 1. These results contrast the dimension-dependent rates of estimating a distribution under the Wasserstein distance, which may generically decay at the rate (see for instance Weed and Bach (2017)). In what follows, we show that parametric rates of convergence are also achievable for the problem of estimating the Sliced Wasserstein distance between two distributions, and we begin by deriving confidence intervals whose lengths decay at such a rate.

3 Confidence Intervals for the Sliced Wasserstein Distance

In this section, we propose several confidence intervals for the two-sample problem (3), which have finite sample validity under at most mild moment assumptions, and which are adaptive to whether or not the functional in (6) is finite. We begin by constructing confidence intervals for the one-dimensional Wasserstein distance, and we then extend these results to the Sliced Wasserstein distance.

3.1 Confidence Intervals for the One-Dimensional Wasserstein Distance

Throughout this subsection, let be given, let be probability distributions with respective cdfs , and let and be iid samples. Let and denote their corresponding empirical cdfs, for all . We derive confidence intervals for the -trimmed Wasserstein distance, with the following non-asymptotic coverage guarantee

(8)

for some pre-specified level , where the trimming variable may depend on the sample sizes and . Our approach hinges on the fact that the one-dimensional Wasserstein distance may be expressed as the norm of the quantile functions of and , suggesting that a confidence interval may be derived via uniform control of the empirical quantile process. Specifically, let and be sequences of functions such that

For example, inversion of the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (Dvoretzky et al., 1956; Massart, 1990),

(9)

leads to the choice , . Scale-dependent choices of and may also be obtained, for instance, via the relative Vapnik-Chervonenkis (VC) inequality (Vapnik, 2013). We may then define the confidence interval

(10)

where for all ,

Proposition 3 (Coverage of )

Let be such that . Then, the interval in equation (10) satisfies the coverage guarantee (8).

Although the coverage of requires no assumptions on and , we now show that it is adaptive in length to whether or not the functionals are finite. In what follows, we state our results regarding the length of assuming and are chosen based on the DKW inequality in (9), but more general proofs are available in the supplement.

Let . For any function , we write , and

Finally, let . We then have the following result.

Theorem 4 (Length of )

Let satisfy the conditions of Proposition 3. Then,

  1. There exists a universal constant such that for every ,

    with probability at least , where

    and

  2. Assume . Assume further that the measures and have no singular components. Then, there exists a universal constant such that for all ,

    with probability at least where

    and are the respective densities of the absolutely continuous components of and .

When the trimming constant is strictly positive and does not depend on or , the quantities and are constant for large enough and . In this case, a careful investigation of Theorem 4 shows that, whenever for some positive constant , regardless of the values of . On the other hand, in the finite-sample regime where , and in particular when , the minimax lower bounds derived in Section 2 suggest that the standard rate is not achievable when . Indeed, Theorem 4 implies

3.2 Confidence Intervals for the Sliced Wasserstein Distance

We now derive a confidence interval for , where . In analogy to the previous subsection, an immediate first approach is obtained by choosing functions and such that

(11)

where for all . Such a bound can be obtained, for instance by an application of the VC inequality (Vapnik, 2013) over the set of half-spaces. An assumption-free confidence interval for with finite-sample coverage may then be constructed by following the same lines as in the previous section. Due to the uniformity of (11) over the unit sphere, however, it can be seen that the length of such an interval is necessarily dimension-dependent. In what follows, we instead show that it is possible to obtain a dimension-independent confidence interval by exploiting the fact that the Sliced Wasserstein distance is a mean with respect to the distribution .

Let be an iid sample from the distribution , for some integer , and consider the following Monte Carlo approximation of the Sliced Wasserstein distance between distributions and ,

For any , let be a confidence interval for , as constructed in the previous section, and let

(12)

In analogy to the quantities , and in the statement of Theorem 4, we write their averages with respect to the distributions and , , as,

We then have the following result, which shows that the length of decays at the same rate as the one-dimensional confidence interval (10), up to a polylogarithmic factor in .

Theorem 5

Let , and assume the same conditions on as in Theorem 1.

  1. For all , and for any finite constant ,

  2. Suppose . Then, there exists a universal constant such that for all , we have,

    with probability at least . Furthermore, if and have no singular components for -almost every , and if , then with probability at least ,

4 Minimax Lower Bounds

In this section we provide lower bounds on the minimax rates for estimating the Sliced Wasserstein distance between two distributions, as defined in (4). It should be noted that upper bounds on the (worst case) length of the confidence intervals for the Sliced Wasserstein distance, derived previously, provide an upper on . Given , recall the class in (7). Motivated by (5), we define

We lower bound the minimax risk of estimating the Sliced Wasserstein distance over each of these classes. The upper bounds on the length of in Theorem 5 match these lower bounds (up to a polylogarithmic factor in the number of Monte Carlo iterations ), provided is fixed.

Theorem 6

Let , , and .

  1. Given , we have

    (13)

    In particular, if does not depend on and , we have

    (14)
  2. We have, .

5 Simulation Study

We conduct a brief simulation study to illustrate the adaptivity of our confidence intervals to whether or not the functional is finite, and to the distance between the distributions being compared. In all simulations below, we generate 100 samples of varying size , from the pair of distributions

and,

Figure 1: Average confidence interval length under the two pairs of distributions and , for varying values of the parameter .

for varying values of , where denotes the uniform probability distribution on the circle centered at the origin and with radius in , where we choose . It can be seen that , while , for all and . For each generated sample, we compute a -confidence interval for , , with , and . For , we use the one-dimensional confidence interval (10) while for , we use the interval (12) with .

For varying values of , we report the average length of the confidence intervals across all 100 replications for and , in Figure 1. We do not report the coverage of the intervals as it was always above 95% of replications. For , it can be seen that the average length monotonically decreases with the parameter . This matches our findings in Theorem 4 and the minimax lower bound in Theorem 6, which show that estimation rates near the null case are slower than away from the null, when . On the other hand, Figure 1 shows that the average length of the confidence intervals for is nearly the same for all values of , as predicted by Theorem 5 which shows that the parametric estimation rate is achievable by the confidence interval in (12) when .

References

  • P. C. Álvarez-Esteban, E. Del Barrio, J. A. Cuesta-Albertos, and C. Matran (2008) Trimmed comparison of distributions. Journal of the American Statistical Association 103 (482), pp. 697–704. Cited by: §1.
  • S. Bobkov and M. Ledoux (2014) One-dimensional empirical measures, order statistics and kantorovich transport distances. preprint. Cited by: §1, §2, §2, Appendix.
  • E. Boissard and T. Le Gouic (2014) On the mean speed of convergence of empirical and occupation measures in wasserstein distance. In Annales de l’IHP Probabilités et statistiques, Vol. 50, pp. 539–563. Cited by: §1.
  • N. Bonneel, J. Rabin, G. Peyré, and H. Pfister (2015) Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision 51 (1), pp. 22–45. Cited by: §1.
  • N. Bonnotte (2013) Unidimensional and evolution methods for optimal transportation. Ph.D. Thesis, Paris 11. Cited by: §1.
  • A. Dvoretzky, J. Kiefer, J. Wolfowitz, et al. (1956) Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics 27 (3), pp. 642–669. Cited by: §3.1.
  • N. Fournier and A. Guillin (2015) On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields 162 (3-4), pp. 707–738. Cited by: §1.
  • G. Freitag, A. Munk, and M. Vogt (2003) Assessing structural relationships between distributions-a quantile process approach based on mallows distance. In Recent advances and trends in Nonparametric Statistics, pp. 123–137. Cited by: §1.
  • G. Freitag, C. Czado, and A. Munk (2007) A nonparametric test for similarity of marginals—with applications to the assessment of population bioequivalence. Journal of statistical planning and inference 137 (3), pp. 697–711. Cited by: §1.
  • M. Klatt, C. Tameling, and A. Munk (2018) Empirical regularized optimal transport: statistical theory and applications. arXiv preprint arXiv:1810.09880. Cited by: §1.
  • S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. K. Rohde (2019) Generalized sliced wasserstein distances. arXiv preprint arXiv:1902.00434. Cited by: §1.
  • S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde (2017) Optimal mass transport: signal processing and machine-learning applications. IEEE signal processing magazine 34 (4), pp. 43–59. Cited by: §1.
  • T. Le, M. Yamada, K. Fukumizu, and M. Cuturi (2019) Tree-sliced approximation of wasserstein distances. arXiv preprint arXiv:1902.00342. Cited by: §1.
  • J. Lei (2018) Convergence and concentration of empirical measures under wasserstein distance in unbounded functional spaces. arXiv preprint arXiv:1804.10556. Cited by: §1.
  • T. Liang (2019) On the minimax optimality of estimating the wasserstein metric. arXiv preprint arXiv:1908.10324. Cited by: §1, §1.
  • P. Massart (1990) The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The annals of Probability, pp. 1269–1283. Cited by: §3.1.
  • A. Munk and C. Czado (1998) Nonparametric validation of similar distributions and assessment of goodness of fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 (1), pp. 223–241. Cited by: §1, §1.
  • V. M. Panaretos and Y. Zemel (2018) Statistical aspects of wasserstein distances. Annual review of statistics and its application. Cited by: §1.
  • F. Paty and M. Cuturi (2019) Subspace robust wasserstein distances. arXiv preprint arXiv:1901.08949. Cited by: §1.
  • G. Peyré, M. Cuturi, et al. (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §1.
  • J. Rabin, G. Peyré, J. Delon, and M. Bernot (2011) Wasserstein barycenter and its application to texture mixing. In

    International Conference on Scale Space and Variational Methods in Computer Vision

    ,
    pp. 435–446. Cited by: §1.
  • T. Rippl, A. Munk, and A. Sturm (2016) Limit laws of the empirical wasserstein distance: gaussian distributions.

    Journal of Multivariate Analysis

    151, pp. 90–109.
    Cited by: §1.
  • S. Singh and B. Póczos (2018) Minimax distribution estimation in wasserstein distance. arXiv preprint arXiv:1802.08855. Cited by: §1, §1.
  • M. Sommerfeld and A. Munk (2018) Inference for empirical wasserstein distances on finite spaces. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (1), pp. 219–238. Cited by: §1.
  • C. Tameling, M. Sommerfeld, and A. Munk (2017) Empirical optimal transport on countable metric spaces: distributional limits and statistical applications. arXiv preprint arXiv:1707.00973. Cited by: §1.
  • V. Vapnik (2013)

    The nature of statistical learning theory

    .
    Springer science & business media. Cited by: §3.1, §3.2.
  • C. Villani (2003) Topics in optimal transportation. American Mathematical Soc.. Cited by: §1.
  • J. Weed and F. Bach (2017) Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087. Cited by: §2.
  • J. Weed and F. Bach (2018) Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. arXiv preprint arXiv:1707.00087. Cited by: §1.
  • B. Yu (1997) Assouad, fano, and le cam. In Festschrift for Lucien Le Cam, pp. 423–435. Cited by: Appendix B: Proof of Theorem 6.

Appendix

We provide the proofs of Theorems 4 and 6. The proof of Proposition 1 follows along similar lines as the results of Bobkov and Ledoux (2014), and the proof of Proposition 3 is straightforward. The proofs of Theorems 2 and 5 follow similar lines as the proofs of Theorems 6 and 4 respectively.

Appendix A: Proof of Theorem 4

We provide a general proof of Theorem 1 for functions and not necessarily chosen using the DKW inequality. In what follows, we assume and are both differentiable, invertible with differentiable inverses, and respectively increasing and decreasing as functions of . Furthermore, given , let be a sequence (depending on the fixed level ) such that for any we have

and for any we have we have , and

(15)

The dependence of on is ignored in the notation for ease of readability. We now turn to the proof of Theorem 1.

Proof of Theorem 1.(i). Let . With probability at least , we have both

(16)

and,

(17)

The derivations which follow will be carried out on the event that the above two inequalities are satisfied, which has probability at least .

We will first show that

A similar argument can be used to bound this expression with replaced by , and will lead to the claim.

Note that for all , by Taylor expansion of the map , there exists and such that