On the rate of convergence of empirical measure in ∞-Wasserstein distance for unbounded density function

07/22/2018 ∙ by Anning Liu, et al. ∙ Duke University Tsinghua University 0

We consider a sequence of identically independently distributed random samples from an absolutely continuous probability measure in one dimension with unbounded density. We establish a new rate of convergence of the ∞-Wasserstein distance between the empirical measure of the samples and the true distribution, which extends the previous convergence result by Trilllos and Slepčev to the case of unbounded density.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Consider a sequence of identically independently distributed (i.i.d.) random variables

sampled from a given probability measure

with probability density function

. Here denotes the space of all probability measures on . We define the empirical measure associated to the samples by

The well-known Glivenko–Cantelli theorem [18] states that converges weakly to as . In recent years, there has been growing interest in quantifying the rate of convergence of to with respect to Wasserstein distances. Recall that the -Wasserstein distance between two probability measures is defined as

and

where is the set of all probability measures on with two marginals and .

The purpose of this paper is to prove the rate of convergence of to w.r.t. -Wasserstein distance when the density function of is unbounded. For simplicity, we will focus on the one dimensional case, but the arguments of the proof are expected to be generalized to high dimensions.


1.1. Motivation and Related Work

Estimating the distance between the empirical measure of a sequence of i.i.d. random variables and its true distribution is a highly important problem in probability and statistics. For example, in statistics, it is usually impossible to access to the true distribution, e.g. the posterior distribution in a Bayesian procedure. So in order to extract useful information from the true distribution, a common approach is to generate i.i.d samples from the true distribution via various sampling algorithms (Markov chain Monte Carlo for instance), from which one can approximately compute many statistical quantities of interest, such as the mean or variance by their empirical counterparts. Hence understanding the statistical error in estimating the statistics requires a quantification of the distance between the empirical measure and the true distribution.

The Wasserstein distance is a natural choice for measuring the closeness of two probability measures in the problem of consideration since it allows the probability measures to be singular to each other, which typically allows including Dirac masses or the empirical measures. This is prohibited if total variation distance or Hellinger distance[12] are used. We are particularly interested in the -Wasserstein distance for several reasons. First, the -Wasserstein distance reduces to the so-called min-max matching distance [1, 2, 13] when both and

are discrete measures with the same number of Diracs. Such min-max matching distance plays an important role in the analysis of shape matching problems in computer vision; see

[9] and the references therein. Moreover, the

-Wasserstein distance is also useful in understanding the asymptotic performance of spectral clustering

[16, 17]. In fact, in [16]

, the authors studied the consistency of spectral clustering algorithms in the large graph limit. By formulating the clustering procedure in a variational framework, they characterized the convergence of eigenvalues, eigenvectors of a weighted graph Laplacian, and that of spectral clustering to their underlying continuum limits using

-convergence. One crucial ingredient needed in their proof is exactly a convergence rate estimate on the -Wasserstein distance between the empirical measures and the true distribution which was established in [17]. However, they made a strong assumption that the density function of the true distribution is strictly bounded from above and below. We aim to extend the result in [17] to the case where the true distribution has an unbounded density in one dimensional space.

Let us briefly review some important previous works on the rate of convergence of with . For , it was shown by Dudley in [10] that when ,

Based on Sanov’s theorem, Bolley, Guillin and Villani [6] proved a concentration estimate on for in any dimension

Boissard [4] extended this result to more general spaces rather than when and applied it to the occupation measure of a Markov chain. In [5], Boissard and Gouic gave the rate of convergence for when . Fournier and Guillin [11] presented a better result than [6, 4]

for non-asymptotic moment estimates and concentration estimates. They showed that if

has finite -th moment and , then

and

(We only list the case here. For other cases, one can refer to Theorem 1 and 2 in [11].) Weed and Bach gave a new definition of the upper Wasserstein dimension for measure . They proved that for and ,

As for , its rate of convergence is less studied than that of with . As far as we know, most results on are obtained when and are both discrete measures. As mentioned above, the Wasserstein distance between two discrete measures is closely linked to the min-max matching problem. Many results have been obtained for the latter when

is a uniform distribution. Let

. Define a regularly spaced array of grid points on S (with for some ) by and the i.i.d. random samples with uniform distribution on S by . Leighton and Shor [13], and Yukich and Shor [14] showed that as , it holds with high probability that

where is a permutation of . Trillos and Slepčev [17] proved that the above estimate still holds when the underlying measure has a strictly positive and bounded density.


1.2. Main Results

The purpose of this paper is to improve the results of [17] in 1-D by removing the boundedness constraint on . Our first result is a rate of convergence result in the case where the density function is bounded from below, but not from above.


Theorem 1.1.

Let and be a probability measure in D with a density function . Assume that there exists a constant such that

Let be i.i.d. random variables sampled from and let be the corresponding empirical measure. Then for any ,

In particular, for any constant , except on a set with probability ,

(1)

Remark 1.1.

Note that the right hand side of (1) will blow up if . That’s why we assume that has a uniform positive lower bound in Theorem 1.1

. Moreover, the exponent one half is sharp owing to the central limit theorem.


We proceed to discussing the case when the density function is not strictly bounded away from zero. We first comment that if the density function of is zero in a connected region, then by definition the -Wasserstein distance between and can not go to zero as goes to infinity. In fact, consider the probability measure with the density function

Let be the empirical measure of . Since depends on a sequence of random variables, there is no guarantee that Assume that , where is a small parameter. Since is also the maximal distance that an optimal transportation map from to moves the mass by (which will be mentioned later in Lemma 2.2), it follows that

Therefore, in Theorem 1.2 below, we assume that only has a finite number of zero points.


Theorem 1.2.

Let and be a probability measure in D with a density function . Assume that there exists a constant such that for all ,

Suppose additionally that there are only N points satisfying . For each , further assume that

(2)

where is a small neighborhood of and are positive numbers, . Let be i.i.d. random variables sampled from and let be the corresponding empirical measure. Then there exists a positive constant such that except on a set with probability


We would like to sketch the proof of the theorems above. To prove Theorem 1.1, we use the fact that in one dimension, distance between two measures can be written as the

norm of the difference of their quantile functions. Moreover, thanks to the

- Lipschitz continuity of the quantile function of , which follows by the assumption that , the

norm of the difference of the quantile functions can be bounded from above by the difference between the cumulative distribution function of the true distribution

and that of the empirical distribution . Finally, the latter can be bounded by using the Dvoretzky-Kiefer-Wolfowitz inequality [dvoretzky1956asymptotic].

For the proof of Theorem 1.2, we first divide the domain into a family of sub-domains according to the value of . Then, we use the following scaling equality in each sub-domain

with an appropriate scaling parameter such that after rescaling, the Lebesgue density of the rescaled measure is bounded from above and below. With the density being bounded, we can estimate the -Wasserstein distance by using the same method in [17]. However, the mass of and may not be equal in each sub-domain. To resolve this issue, we introduce a new measure such that has the same mass as in each sub-domain. Since the distance between and can be bounded by an argument similar to Theorem 1.1 in [17], it suffices to estimate the distance between and .

The following corollary is a direct consequence of Theorem 1.1 and Theorem 1.2.


Corollary 1.1.

Let and be a probability measure in D with density . Assume that there are only N points satisfying For each , satisfies (2). Let be i.i.d. random variables sampled from . Then there exists a positive constant such that except on a set with probability

(3)

1.3. Discussion

As we mentioned earlier, quantifying the rate of convergence of to with respect to Wasserstein distance is very useful for understanding the consistency of spectral clustering[16]. Our new convergence rate estimates will reshape the convergence of spectral clustering in the case where the density of true distribution is unbounded, as we discuss in what follows.

Let be a set of data points in sampled from a probability measure . For each pair of points and , we construct a weight between them to characterize their similarities. In general, the weight has the form of

where and is an appropriate kernel function(for example, Gaussian kernel). The weight matrix is then defined by . Let be a diagonal matrix with . Then the discrete Dirichlet energy and the relevant continuum Dirichlet energy are defined by

and

where is the density function of the underlying measure . The unnormalized graph Laplacian is defined by

The aim of spectral clustering is to partition the data points into meaningful groups. To do this, the spectrum of unnormalized graph Laplacian

is used to embed the data points into a low dimensional space. Then we can apply some clustering algorithms like k-means to these points. For more details about spectral clustering, one can see

[19].

In [16], the authors proved that when the density function of is bounded from above and below, the spectrum of unnormalized graph Laplacian converges to the spectrum of the corresponding continuum operator , which implies the consistency of spectral clustering. They also gave a lower bound of the convergence rate at which the connectivity radius as . With our theorems, the results in [16] can be generalized to the case when is unbounded. In particular, the kernel width should be chosen to be slightly bigger than the right side of (3), which is different from [16].

The proof will not be included in this paper since it is similar to the proof in [16]. We sketch the outline of the proof as follows: First, we prove the convergence of Dirichlet energy to . Our theorems are used in this step to establish the probabilistic estimates and the constraint on . Next, by min-max theorem, we know that the eigenvalues of can be written as the minimizers of . Therefore, the convergence of spectrum is equivalent to the convergence of the minimizers of , which can be proved by the convergence and compactness properties of . Finally, with the convergence of spectrum, we can prove the consistency of spectral clustering.

The paper is organized as follows: In section 2, we introduce some preliminaries and notations. In section 3.1 and section 3.2, we prove Theorem 1.1 and Theorem 1.2 respectively. Finally, the proof of Corollary 1.1 is presented in section 3.3.

2. Preliminaries and notations

2.1. Notations

Let and be the set of all probability measures on . Given a probability measure and a Borel-measurable map , we define the pushforward of measure under the map by setting

for any measurable set . We call the transportation map between and .

The Wasserstein distance is defined by

where is the set of all couplings between and , i.e.


Remark 2.1.

Note that the definition of can be generalized to the case where and have the same mass on . Therefore, in the sequential, we still write when even though and are not necessarily probability measures.


It was proved in [7] that if is absolutely continuous with respect to the Lebesgue measure, then for any optimal transport plan of , there exists a transportation map such that and . In particular, the optimal transportation plan of , with being the empirical measure of the absolutely continuous probability measure is unique.

2.2. Useful lemmas

The following lemma collects some properties on to be used in subsequent sections. The proof is trivial and thus is omitted.


Lemma 2.1.

Given measures defined on with , then the followings hold:

  1. Triangle inequality: .

  2. Scaling equality: , for .

  3. .

  4. If then


The following two lemmas gives two different characterizations of .


Lemma 2.2 ([7]).

Let be two Borel measures with absolutely continuous with respect to the Lebesgue measure and . Then there exists an optimal transportation map such that and

Furthermore, if with and positive numbers , then there exists a unique transportation map such that


Lemma 2.3 ([villani2003topics, Remark 2.19]).

Let , be two probability measures on . Denote the cumulative distribution functions of and by and respectively. Then we have the following equality that


Lemma 2.4 ([17, Lemma 2.2]).

Let and be two probability measures defined on with density functions and respectively. Assume that there exists a positive constant such that

Then there exists such that


The following three probability inequalities on binomial random variables and the Dvoretzky-Kiefer-Wolfowitz inequality will be used in the proofs of main results.


Lemma 2.5.

Let be the independent binomial random variables. For , Chebychev’s inequality[15] states that

The Chernoff’s inequality [8] states that

Bernstein’s inequality [3] states that


Lemma 2.6 ( Dvoretzky-Kiefer-Wolfowitz inequality [dvoretzky1956asymptotic]).

Let be the i.i.d. random variables sampled from a probability measure . Let be the cumulative distribution function of and be the cumulative distribution function of . Then for ,


3. Convergence of empirical measure

3.1. Proof of Theorem 1.1

Proof.

Denote the cumulative distribution function of by and that of by . Thanks to the Dvoretzky-Kiefer-Wolfowitz inequality [dvoretzky1956asymptotic],

From this, we claim that

(4)

which implies that

To prove (4), it suffices to show that implies . To this end, fix . Let and . Then from the fact that the density function has a lower bound we know that

It follows that

where the last inequality is obtained from . Therefore, for any ,

which completes the proof of (4). It follows from (4) and Lemma 2.3 that

By taking we get that except on a set with probability ,


3.2. Proof of Theorem 1.2

Lemma 3.1.

If , then .

Proof.

By induction, we only need to prove that implies . From we know . Therefore,

(5)

In Theorem 1.2, we give the rate of convergence of when the density function is not strictly bounded away from zero. The proof is a refinement of the proof of [17, Theorem 1.1], which deals with the case where is bounded. We sketch the rough idea of our proof in the followings before we give the details.

To prove the theorem, we would like to use Lemma 2.1-(4) to reduce the estimate of to that of , where is a small neighborhood of the zero point . For doing so, we need to modify the measure locally (denote the new measure to be after modification) so that has the same mass as on . Then, we divide into a family of sub-domains according to the value of so that is bounded from above and below on . Thus we can adapt similar arguments from [17] to obtain bounds on . However, may not have the same mass as on each . So, in order to remove this mass discrepancy, we introduce another new measure such that . At last, thanks to Lemma 2.1, we can establish an upper bound on with the estimates of .

Proof.

Let . Then is a partition of . Let for and be a probability measure defined on

(6)

Then it’s clear that

Combining this with Lemma 2.1, we obtain that

Choose . To estimate , we divide into a family of sub-domains and use scaling property to bound distance on each sub-domain .

Define by , (If is empty, just neglect it). Then,

Let and define a measure on by

Then it’s easy to see that

Again, with this and Lemma 2.1, we can bound as follows

Therefore, to estimate , it suffices to estimate , , , and respectively.


Step 1: We first estimate It’s easy to deduce, via Lemma 2.1, that

To ease the notations, we write and

Clearly, is the restriction of to and is the empirical measure of . Furthermore, we note that is bounded from below in due to the fact that is a small neighborhood of zero point and . Therefore, we can use Theorem 1.1 to give an estimate on (We remark that Theorem 1.1 holds true for any domain by replacing with in the proof.)

Let . Then we have . It follows from Theorem 1.1 that there exists a constant such that


Step 2:  We then estimate . To achieve this, set

and consider the following two cases: 1) and 2) .

We claim that, when , . To show the claim, we first recall the definition that and the assumption that in . To simplify the notations, we denote by and by . Let and be positive constants satisfying ,

From we know that Moreover, when is large enough,

where .

Therefore, when ,