The Wasserstein distance [vaserstein1969markov]
has been attracted significant attention as a criterion for discrepancy of two probability measures. It depends on an intuitive mathematical formulation of the optimal transportation[villani2008optimal]
. In these days, it was discovered that the notion of transportation has a suitable property for pattern analysis and feature extraction from data. Rigorously, to measure discrepancy of data distributions with disjoint supports such as real images, the Wasserstein distance can measure the discrepancy, while common measures such as the Kullback-Leibler divergence fail. Due to the advantage, the Wasserstein distance is getting utilized extensively in machine learning and related fields. For examples, generative models[arjovsky2017wasserstein]frogner2015learning], medical image analysis [ruttenberg2013quantifying], genomics [evans2012phylogenetic]
, and computer vision[ni2009local].
Despite the significance, statistical inference (e.g. a goodness-of-fit test) with the Wasserstein distance has been severely restricted. A goodness-of-fit test is a fundamental and important method to evaluate the uncertainty of observed distributions rigorously, and it is largely studied with various distances such as the Kullback-Leibler divergence [massey1951kolmogorov, vasicek1976test, song2002goodness, lloyd2015statistical]. However, statistical inference with the Wasserstein distance is available only for univariate data [munk1998nonparametric, del2000contributions, ramdas2017wasserstein, bernton2017inference, del2019central] or discrete-valued data [sommerfeld2018inference, tameling2017empirical, bigot2017central]. For general multivariate data, a strong assumption, such as Gaussianity of data, are required to develop inference methods [rippl2016limit, del2019central]. The difficulty of statistical inference comes from an obscure limit distribution of the Wasserstein distance, and addressing the difficulty has been an important open question (Described in Section 3 of a review paper [panaretos2018statistical]).
In this paper, we solve the difficulty by developing a non-asymptotic Gaussian approximation [chernozhukov2013gaussian, chernozhukov2014gaussian]
for the empirical Wasserstein distance. Intuitively, we approximate the distance by a supremum of Gaussian process and prove that it is a consistent approximator. Importantly, the approximation does not require a limit distribution, hence we can avoid the problem of an unavailable limit. As a consequence, we can evaluate the uncertainty of the empirical Wasserstein distance on general multivariate data without strong restrictions. For practical use, we also propose an efficient multiplier bootstrap algorithm to calculate the approximator. Based on the approximation scheme, we develop a goodness-of-fit test which can control the type I error arbitrary, and also demonstrate their performance by several experiments.
We summarize the contributions of this paper as follow:
Develop an approximation scheme for the empirical measure with the Wasserstein distance. It is applicable for general multivariate data without strong assumptions, and it solves an open question for statistical inference with the Wasserstein distance.
Provide a multiplier bootstrap method for the statistical inference methods, which is computationally more efficient than other bootstrap and numerical methods.
Develop one-sample and two-sample hypothesis test schemes and model selection algorithm based on the Wasserstein distance. To our knowledge, it is the first study to develop such tests for general multivariate data.
-th element of vectoris denoted by , and is the -norm (). For a matrix , where is a vectorization operator. is the Dirac measure on . For , . All proofs of theorems and lemmas are deferred to the supplementary material.
2. Problem and Preparation
Firstly, we provide a formal definition of the Wasserstein distance, which is a distance between probability measures by using transportation between the measures. Let be a complete metric space with metric . The Wasserstein distance of order between two Borel probability measure and is defined as
where is a set of all Borel probability measures on with marginals and .
With the definition, a formal problem statement of this paper is as follows.
Problem Formulation: Our aim is to approximate the empirical Wasserstein distance . Let be a probability measure on a sample space , and be a set of independent and identical observations from . Let be an empirical probability measure. Regardless to say,
is a random variable due to the randomness of. We are interested in approximating a distribution of with a tractable random variable. Rigorously, our aim is to find a random variable and a scaling sequence such as
as . Such the approximation for is necessary for various statistical inference methods such as a goodness-of-fit test and a confidence analysis.
Dual form of Wasserstein distance: The Wasserstein distance has the following duality [villani2008optimal]:
where is a set of Lipschitz-continuous functions on with their Lipschitz constants are .
Functions by Deep Neural Network:
We provide a class of functions with a form of deep neural networks (DNNs). Leta class of functions by DNN with hidden layers, non-zero parameters (edges). Namely, let and be a matrix and a vector parameter for an -th layer, and . Then, be a set of functions with the following form
such that .
2.2. Related Work and Difficulty of General Case
There are numerous studies for statistical inference with the Wasserstein distance in restricted settings. However, statistical inference for general multivariate data has remained unsolved, and our study addresses the question. We briefly review the related studies and describe a source of the difficulty.
Univariate case: When ’s are univariate (i.e. ), the Wasserstein distance is described by an inverse of distribution functions. Let be a distribution function of data and be an empirical distribution of generated data, then one can derive (see [ambrosio2008gradient]). Several studies [munk1998nonparametric, del1999tests, ramdas2017wasserstein, del2019central] derive an asymptotic distribution of as a limit of the process with and . Another study [bernton2017inference] develops an inference method based on an extended version of the limit distribution. Obviously, the inverses and are valid only with the univariate case, hence it is not applicable for multivariate cases.
Discrete case: When
are discrete random variables (i.e.is a finite space), a study [sommerfeld2018inference] shows that , where is a centered Gaussian random variable and is a suitable convex set. Some works [tameling2017empirical, bigot2017central] inherit the result and develop several inference methods.
Other cases: For the Wasserstein distance with an order (hereafter ), a study [rippl2016limit] shows that
converges to a Gaussian distribution whenis also a Gaussian. With a general , another study [del2019central] shows that converges to a Gaussian distribution, but is generally unknown.
General case: For with , a limit distribution of is not available, hence it is difficult to evaluate its uncertainty. By the dual form (1), the empirical Wasserstein distance is regarded as an empirical process on . However, by the empirical process theory [vdVW1996], with is too broad and thus the empirical process does not converge to a known limit distribution. Namely, is not guaranteed to have a tractable limit distribution as . It is described by the theory of the Donsker class, and a limit of is intractable since does not belong to the Donsker class with (Summarized in Chapter 2 of [vdVW1996]). It is shown in Figure 1. We also note that existing bootstrap approaches (e.g. an empirical bootstrap) are not validated either due to the limit problem.
3. Non-Asymptotic Approximation
3.1. Approximation Theory
Our approximation scheme contains three steps: i) approximate in (1), ii) develop a non-asymptotic Gaussian approximation for , and iii) combine them.
Step i. Approximation by DNNs
To calculate the supremum on on (1), we introduce and represent the Lipschitz functions by DNNs. We define a restricted functional class by DNNs as such that holds for any . Then define an approximated Wasserstein distance as
with given and . In the following theorem, we show that is arbitrary close to for any and as increases with finite .
For any probability measures and on and for all and , there exists a constant such that
We utilize DNNs for approximating by the following reasons. The approximation by DNNs is often employed by the field of generative models, especially for the Wasserstein-GAN [arjovsky2017wasserstein], hence several convenient computational algorithms are developed such as the spectral normalization [miyato2018spectral]. Also, DNNs theoretically and computationally work well with high-dimensional , unlikely to a basis function approach such as the Fourier and wavelet bases.
Step ii. Non-Asymptotic Gaussian approximation
As a key idea of this study, we develop a non-asymptotic Gaussian approximation by applying an approximation scheme [chernozhukov2013gaussian, chernozhukov2014gaussian]. The scheme approximates an empirical stochastic process on an index set by a tractable supremum of stochastic processes. Importantly, the approximation scheme does not require a limit distribution of the empirical process.
We provide an overview of the scheme by [chernozhukov2014gaussian]. Let be a random variable by an empirical process
with some set of functions . To approximate , the scheme derives a stochastic process with its zero mean and known covariance. Then, the scheme approximates the random variable by a supremum of on . Theorem 2.1 in [chernozhukov2014gaussian] proves
holds for each with a constant and high probability approaching to . Here, the approximator does not depend on (Note that does not have to be a limit of ), hence the result holds regardless of a limit distribution of .
|Wasserstein distance||DNN Approx. (2)||Gaussian Approx. (4)||Algorithm 1|
We extend the scheme and approximate the empirical Wasserstein distance with DNNs
To approximate it, we derive a suitable stochastic process on . We introduce a scaled Gaussian process whose mean is and its covariance function is for . Rigorously, for any , is a Gaussian random variable with its mean and covariance with a scaling term . Then, we define the following random variable
as an approximater for . Its validity is proved in the following lemma:
Lemma 2 (Non-Asymptotic Approximation for ).
Set , and as as . Then, we obtain
where and is an existing constant.
Lemma 2 states that is a random variable which behaves sufficiently close to with probability approaching to . The setting guarantees that the bound converges to as increases. Note that the result holds without a limit distribution of , hence we avoid the existence problem of limit distributions discussed in Section 2.2.
Step iii: Distributional Approximation
We combine the previous results and evaluate how approximates the empirical Wasserstein distance . Namely, we specify to balance the approximation error by DNNs and that of the non-asymptotic Gaussian approximation. The result is provided in the following theorem:
Theorem 1 (Non-Asymptotic Approximation for ).
Set and satisfying with finite constants . Then, we have
where is an existing constant.
Here, the multiplier is a scaling sequence for . Theorem 1 states that approximates the randomness of the scaled with arbitrary accuracy as . Note that the result is valid without utilizing a limit distribution of as similar to Lemma 2. Note that the convergence rate in Theorem 1 is reasonably fast, because it is a polynomial rate in although we handle the infinite dimensional parameter .
To obtain a more convenient formulation, we introduce an assumption on .
Assumption 1 (Continuous CDF of ).
With and , a cumulative distribution function of the empirical Wasserstein distance, i.e.
, a cumulative distribution function of the empirical Wasserstein distance, i.e., is continuous.
It is not easy to verify this assumption, but the continuity of the Wasserstein distance (Corollary 6.11 of [villani2008optimal]) is helpful to understand it. Further, we numerically validate this assumption and show it empirically holds in Section C.1.
With the assumption, we can measure the performance of approximation by in terms of the Kolmogorov distance as follows:
3.2. Algorithm: Multiplier bootstrap
In this subsection, we provide an efficient algorithm to calculate the derived random variable . Though is tractable, computing a supremum of Gaussian processes is computationally tough in general, hence the fast algorithm would be profitable for practical use.
The proposed algorithm employs a multiplier bootstrap method which generates a bootstrap Gaussian process with Gaussian random variable as multipliers [chernozhukov2015comparison]. We generate multiplier Gaussian variables which are independent to . Then, define a bootstrap process
The convergence of is provided by the following Theorem. For further discussion for tests, we introduce Assumption 1 and present our result in terms of the Kolmogorov distance.
Theorem 2 (Validity of Gaussian Multiplier Bootstrap).
Theorem 2 shows that the distribution of with fixed can approximate the distribution of by random . The result is obtained by a combination of all the approximators developed in Section 3.1. We provide Table 1 for an overview of our strategy for the approximating the distribution of .
From a computational aspect in terms of , Algorithm 1 requires computational time. It is sufficiently faster than an empirical bootstrap and sub-sampling bootstrap methods which require computational time; they require subsets for sub-samples to obtain the same accuracy to our method.
We summarize the multiplier bootstrap method in Algorithm 1. Here, let be a hyper-parameter for iterations.
4. Applications to Hypothesis Test
We develop one/two-sample tests by applying the distributional approximation. The tests are also known as a goodness-of-fit test which is one of the most fundamental tools as statistical inference, however, the test in terms of the Wasserstein distance has not been available. In the following, fix arbitrary as a significance level of the test.
One-sample Test: We test the probability measure which generates is identical to a pre-specified measure . Namely, we consider the following null/alternative hypotheses,
We measure the divergence between two probability measures using Wasserstein distance with given
as a test statistics. The asymptotic distribution of the test statistics can be approximated by the Gaussian multiplier bootstrap by Algorithm1. A procedure of the one-sample test is provided as follows:
Calculate a distribution of and obtain which is a
If holds, reject . Otherwise, accept .
The validity the test is shown in the following theorem, which shows the Type I error converges to the arbitrary specified .
Also, we can construct a confidence interval which contains a true Wasserstein distance with arbitrary significance level. Namely, an intervalcontains with probability asymptotically.
Two-sample Test: We consider a two-sample test to investigate a correspondence of two probability measures from two sets of observations. Addition to , we suppose that we observe a set of independent observations from another distribution , and be its empirical distribution. Let be an output of Algorithm 1 from , and be from . Here, to calculate , consider that we use a DNN with layers and parameters. As a preparation, define and .
In the two-sample test, we consider the following null/alternative hypotheses.
Here, we employ as a test statistic with fixed and . Define and as -quantiles of the distribution and , and define . A procedure of the two-sample test is as follows:
Calculate from and .
If holds, reject . Otherwise, accept .
The following theorem gives validity of the test by proving a convergence of the Type I error.
Suppose as and Assumption 1 holds. Set and . Then, under the null hypothesis , we have the Type I error as
Note that the proposed two-sample test is conservative unlike the one-sample test, due to the triangle inequality to measure the distribution of . Also, similar to the one-sample test, we can develop a confidence interval which asymptotically contains .
We provide experiments of the proposed method*1*1*1Codes are attached as supplementary material.. To calculate the supremum on DNNs for our proposal, we utilize a fully-connected network with layers and a relu activation [nair2010rectified], and each hidden layer has nodes. We employ Vanilla SGD for optimization and with a learning rate . Also, we employ a spectral normalization [miyato2018spectral] in every epoch to restrict Lipschitz constants of functions as . To derive a distribution of , we iterate it for times.
5.1. Performance of Distributional Approximation
We validate how well our proposal can approximate true distribution experimentally. To the end, we set and simulate a true distribution of with known by repetitions (Details are provided in Section C.2). For the distribution of , we set and as generating -dimensional random vectors whose elements independently follow (i) a normal , (ii) an exponential , and (iii) mixture of Gaussian . For the proposed method , we set for the iteration. We also calculate a naive bootstrap method for comparison.
5.2. Two-sample Test with Hand-Written Letter Images
We carry out the two-sample test for distinguishing hand-written letter images from the Modified National Institute of Standards and Technology database (MNIST) dataset *2*2*2http://yann.lecun.com/exdb/mnist/ [lecun1998gradient]. The dataset contains images with pixels for a hand-written number from . We set and , then also set and as sampled images of a letter , , or from the dataset. Then, our two-sample test investigates whether and represent the same number. Rejection probability is derived from and repetitions for same and different pairs respectively with .
The result in Figure 5 shows our test work well. Namely, if the numbers are identical ( is true), is not rejected with most cases. Otherwise ( is true), is rejected with all the cases. Note that the two-sample test is conservative, hence the type-I error is no less than asymptotically.
We also discuss the interpretability of the test by the Wasserstein distance. Figure 5 plots a pair of the critical value and the test statistics for each of the repetitions in cases. The result shows the difficulty of distinguishing images for and , rather than and . The fact is consistent with intuition from images, and the Wasserstein distance succeeds in capturing the intuition. We provide further analysis of the MNIST in Section C.3 with the illustration of the MNIST images.
5.3. Comparison with Other Hypothesis Tests
We compare the performance of the test with the Wasserstein distance with other tests, by reporting a type I error, a type II error, and computational time of the methods for a two-sample test. As other methods, we consider the Maximum Mean Discrepancy (MMD) test*3*3*3We utilize the code in https://github.com/emanuele/kernel_two_sample_test. [gretton2012kernel] with the Gaussian kernel and the crossmatch test*4*4*4We utilize the package by https://cran.r-project.org/web/packages/crossmatch/index.html. [rosenbaum2005exact]. For the aim, we set and , then generate samples from two distributions with different parameters: (i) two-dimensional Gaussian distributions and with and
is an identity, and (ii) two products of exponential distributionsand with the parameters .
We plot the results in Table 6 and 7. With both of the settings, the type I errors of all the methods are smaller than . About the type II errors, the proposed Wasserstein test works slightly better than the other methods as increases. This result shows that the Wasserstein test can distinguish two close distributions because the Wasserstein distance has a weak topological structure. About the computational time, the proposed Wasserstein test works with reasonable time, while the computational time of the MMD test and the crossmatch test increases rapidly as increases. This is because the proposed method needs computational time.
We develop a hypothesis test with the Wasserstein distance. Since the Wasserstein distance with empirical measures does not have a valid limit distribution, it has been an open question to develop a test with the Wasserstein distance on multi-dimensional spaces. To the end, we utilize the non-asymptotic Gaussian approximation and develop a valid hypothesis test. For practical use, we develop an approximation with deep neural networks and a multiplier bootstrap method. We numerically validate the performance of the proposed test with the Wasserstein distance. The experimental results verify that the proposed method controls the errors. Also, the method works well for distinguishing similar but different distributions by weak topology by the Wasserstein distance.
– Supplementary Material –
Appendix A Supportive results
Theorem 5 (Theorem 5 in [schmidt2017nonparametric], adapted to our setting).
For any function , there exists a constant such that for any , there is a neural-network with at most layers and all its parameters are bounded by , such as
Theorem 6 (Corollary 2.2 in [chernozhukov2016empirical], adapted to our setting).
Let be a supremum of an empirical process as (3). Also, let be a functional class wich an envelope function such that with existing constants for all . Suppose for some and , for and holds. Then, for every , there exists a random variable such that
where and are constants depends on only .
Lemma 3 (-entropy number of , Lemma 5 in [schmidt2017nonparametric]).
Given a neural network, let be a product of a number of nodes in layers. For any , we have
where is any finite measure on .
Lemma 4 (Uniform entropy integral for ).
Let be a envelope function of . For each integer , define the uniform entropy integral as
where the supremum is taken over all finitely discrete probability measures on . Then, for each integer , there exists a global constant such that
Appendix B Proofs
b.1. Proof of Lemma 1
Fix and arbitrary, and set .
Firstly, we show . By the definition of , we have , hence obviously we obtain
thus we obtain the inequality.
Secondly, we show that . We have
where is the supremum in .
By Theorem 5, there exists which satisfies
where is an existing constant. Here, all parameters for are bounded by , hence , and thus we have . By using , we continue the inequality as
Also, since is a probability measure, we obtain
For , we obtain the same inequality respectively. Hence, we obtain that
Combining the first inequality and setting , then we obtain the statement. ∎
b.2. Proof of Lemma 3
Firstly, by the definition of , we have
for any and . Then, From Lemma 9.22 in [kosorok2008introduction], for any norm dominated by , we have
Finally, we apply Lemma 5 in [schmidt2017nonparametric] which bounds directly, and obtain the statement. ∎
b.3. Proof of Lemma 4
For any finite measure on and ,