# Wasserstein Identity Testing

Uniformity testing and the more general identity testing are well studied problems in distributional property testing. Most previous work focuses on testing under L_1-distance. However, when the support is very large or even continuous, testing under L_1-distance may require a huge (even infinite) number of samples. Motivated by such issues, we consider the identity testing in Wasserstein distance (a.k.a. transportation distance and earthmover distance) on a metric space (discrete or continuous). In this paper, we propose the Wasserstein identity testing problem (Identity Testing in Wasserstein distance). We obtain nearly optimal worst-case sample complexity for the problem. Moreover, for a large class of probability distributions satisfying the so-called "Doubling Condition", we provide nearly instance-optimal sample complexity.

## Authors

• 5 publications
• 5 publications
• 14 publications
04/27/2020

### Testing Data Binnings

Motivated by the question of data quantization and "binning," we revisit...
03/26/2021

### Testing identity of collections of quantum states: sample complexity analysis

We study the problem of testing identity of a collection of unknown quan...
09/14/2020

### Optimal Testing of Discrete Distributions with High Probability

We study the problem of testing discrete distributions with a focus on t...
07/31/2021

### Two-sample goodness-of-fit tests on the flat torus based on Wasserstein distance and their relevance to structural biology

This work is motivated by the study of local protein structure, which is...
07/06/2019

### Testing Mixtures of Discrete Distributions

There has been significant study on the sample complexity of testing pro...
12/29/2020

### Testing Product Distributions: A Closer Look

We study the problems of identity and closeness testing of n-dimensional...
04/11/2019

### Max-Sliced Wasserstein Distance and its use for GANs

Generative adversarial nets (GANs) and variational auto-encoders have si...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Property Testing is proposed in the seminal work of Goldreich et.al (Goldreich et al. (1998)), which is generally the study of designing and analyzing of randomized decision algorithm on efficiently making decision whether the given instance is having certain property or somewhat far from having it. Significantly, the query complexity of efficient property testing algorithm is often sublinear on the size of its accessing instance.

In recent years, distribution property testing has received much attention from theoretical computer science research. On most problems in distribution property testing, the input is a set of independent samples from an unknown distribution, and the decision is on whether the distribution has certain properties or not. Researchers have investigated the sample complexity of various testing problems of distribution properties such as uniformity, identity to certain distribution, closeness testing, having small entropy, having small support, being uniform on a small subset and so on (Goldreich and Ron (2011),Paninski (2008), Chan et al. (2014), Valiant and Valiant (2014),Diakonikolas et al. (2014), Valiant and Valiant (2010b),Valiant and Valiant (2010a), Batu and Canonne (2017), Diakonikolas et al. (2017),Diakonikolas and Kane (2016)).

In this paper, we focus on the problem of identity testing. Arguably, identity testing together with its special case uniformity testing are the best studied problems in distribution property testing. In identity testing, we are given sample access to an unknown distribution , the explicit description of a known distribution and a proximity parameter . Then we are required to distinguish the following two cases: 1) is identical to ; 2) certain distance (e.g. the -distance, the Hellinger distance, or the th Wasserstein distance) between and is larger than .

The sample complexity of identity testing in -distance (equivalently statistical distance or total variation distance) is now fully understood in a series of work (Goldreich and Ron (2011), Paninski (2008), Chan et al. (2014)). Specifically, testing if a distribution supported on is uniform with proximity parameter in -distance requires many samples (Paninski (2008),Valiant and Valiant (2014)). However, consider the case where the support is continuous, the bound above becomes meaningless. For example, the natural problem of testing whether a distribution supported on is uniform in -distance would require an infinite number of samples.

Motivated by the these issues, we would like to study the testing problem under a probability distance that metrizes the weak convergence (on the other hand, convergence in -distance is a strong convergence). A popular choice is the Wasserstein distance (a.k.a. transportation distance or earthmover distance, see Definition 1

). Using Wasserstein distance, identity testing is well defined in arbitrary, even continuous, metric space (with Borel point sets of positive finite measure). We also note that using Wasserstein distance as the defining metric has gained significant attention in machine learning community recently (e.g., in generative models

Arjovsky et al. (2017) and mixture models Li et al. (2015) ).

[Wasserstein Distance] Let be two distributions supported on metric space , the Wasserstein distance (or transportation distance) between and with respect to is defined to be:

 Wd(p,q)=infM∈coup(p,q)∫d(x,y)dM(x,y)

where is the set of all coupling distributions of and , i.e. all distributions on that have marginal distributions and .

We define the problem of Wasserstein identity testing as the following.

[Wasserstein Identity Testing, ] Let be a metric space and a distribution on . For a proximity parameter , denote the problem of designing an algorithm which, given sample access to an unknown distribution ,

• accepts with probability at least if ;

• rejects with probability at least if .

Moreover, investigating the sample complexity lower bound of these algorithms.

For notational convenience, we use for short when there is no risk of ambiguity. Denote by

the uniform distribution on

, then is the Wasserstein uniformity testing problem.

When is not discrete, the meaning of ”the explicit description of ” is confusing. However, whenever

is separable, we can estimate

by a distribution supported on a countable -net of for any . This transformation makes the problem discrete. To make life even easier, in what follows we assume that all the time and the explicit description of is well defined.

#### Testing versus Learning

A direct approach of identity testing is to learn the distribution. Specifically, on testing if an unknown distribution is uniform with proximity parameter , we can estimate the unknown distribution by the empirical distribution such that the distance between and is less than . Then we accept if the distance between and the uniform distribution is less than and reject otherwise. A tester is efficient if it uses less samples than estimating by empirical distribution. For statistical efficiency, we are seeking for such efficient tester. For example, if the support is and the distance is distance, the sample complexity of learning is (see e.g. Devroye and Lugosi (2001)) while the sample complexity of testing is (Paninski (2008),Valiant and Valiant (2014)).

In our case for Wasserstein identity testing, in the natural metric space -dimension hypercube for

equipped with the Euclidean metric, the Wasserstein Law of Large Number (see e.g.

van Handel (2014)) shows that many samples are sufficient and necessary to estimate the distribution up to in Wasserstein distance. Hence we automatically obtain a tester with sample complexity for the problem . On the other hand, in Corollary 2, a tester with sample complexity for the problem is given.

#### The Chaining Method

The primary technique in this paper is choosing a sequence of -nets then decomposing the original testing problem into multiple easier sub-problems according to the nets. This technique is highly related to Talagrand’s ”Chaining Method” which plays a central roll on proving upper and lower bounds of stochastic process (M.Talagrand (2014)).

### 1.1 Main Contributions

Our first contribution is characterizing the worst-case sample complexity of in arbitrary metric space by giving nearly optimal upper bound and matching lower bound.

Let be a metric space endowed with a distribution . Let be its diameter. Let be a sequence of well-separated -net of (see Definition 2). There is an algorithm, given sample access to an unknown distribution over and a proximity parameter ,

• accepts with probability at least if ;

• rejects with probability at least if .

The sample complexity of this algorithm is

 ~O(max{22i|Ni|1/2ε2:logε8≤i≤logD}).

Moreover, any algorithm which distinguishes the two cases for any fixed and unknown with probability at least takes

 Ω(max{22i|Ni|1/2ε2:logε8≤i≤logD})

many samples in the worst case.

Actually, Theorem 1.1 is a worst-case result for problems in Definition 1. The sample complexity bound is oblivious on , the target distribution. One may wonder if we can obtain some instance bound which is nearly optimal for every , like what appeared in Valiant and Valiant (2014). We show that if the distribution is not too singular (e.g. highly concentrated on one point), characterized by satisfying the following ”Doubling Condition” (see Definition 1.1), then we can obtain nearly instance-optimal sample complexity bounds (see Theorem 3.3).

[Doubling Condition] Let be a metric space and be a distribution on . For , , denote the ball . is said to satisfy the ”doubling condition” if there exists a constant such that for every and , where .

#### Why Doubling Condition?

Doubling dimension is introduced in Assouad (1983) and D.G.Larman (1967) which has become a popular notion of complexity measure of metric space. In Definition 1.1, a counterpart of this notion in a metric space endowed some distribution is given for our use. Generally, regarding as a measure of , the ”doubling condition” says every ball’s volume is upper bounded by a universal constant times the volume of the ball with the same center but half radius. It measures the complexity of distribution . The distribution satisfying the ”doubling condition” somewhat has similar property as the uniform distribution on a compact set of Euclidean space.

Since uniform distribution on a compact set (e.g. hypercube or unit ball ) of Euclidean space satisfies the doubling condition, as an interesting and important corollary (see Corollary 2), we show the sample complexity of problem and is .

### 1.2 Other Related Work

There are also recent papers regarding identity or uniformity testing beyond the classical problem of -testing. Batu and Canonne (2017) presented the generalized uniformity testing problem which asks if a discrete distribution we are taking samples from is uniform on its support. Diakonikolas et al. (2017) then investigated the exact sample complexity of this problem. On testing in other distribution distances, Daskalakis et al. (2017) gave characterizations of the sample complexity of identity testing in a variety of distance besides -distance.

The study of metric space has a long history, we refer to Deza and Laurent (2009) as a complete and in-depth treatment of metric space. The doubling dimension is introduced in Assouad (1983) and D.G.Larman (1967), and in theoretical computer science community, it’s first used in the paper Clarkson (1997) regarding nearest neighbor search.

Chaining is an efficient way of proving union bound for a variety of possibly dependent variables. The study of chaining dates back to Kolmogorov’s study of Brownian motion. M.Talagrand (2014)

is a highly suggested book regarding the application of chaining methods in modern probability theory. In recent years, the chaining method finds many applications in theoretical computer science, we refer to

at Harvard (2016) as an introduction of chaining methods in theoretical computer science.

## 2 Preliminary

Some notations go first. The

norm of a vector in

is defined to be and .

We then define some notations about metric space. Let be a metric space where is a ground set and is a metric on which satisfies:

• for .

• for .

• Triangle Inequality: for .

The diameter of is defined as . For a distribution supported on , we mean by a sample from a point in . For a subset , .

The following classical definitions about -net and -packing are essential in this paper.

[-net, -packing and well separated -net] Let be a metric space and . A subset is called an -net of if

 ∀x∈M,∃y∈N,d(x,y)≤ε.

A subset is called an -packing of if

 ∀x,y∈P,x≠y,d(x,y)>ε.

A subset is called a well separated -net of if it’s an -net as well as an -packing of .

The following lemma shows the duality between -net and packing. (see e.g. van Handel (2014)) Let be a metric space. Let denote the minimum size of -net of and denote the maximum size of -packing . Then we have

 N(X,d,ε)≤P(X,d,ε)≤N(X,d,ε/2).

To acknowledge the great importance of the work by Valiant and Valiant (2014), we restate their core theorem here, and show how it implies other worst-case bounds.

[Valiant and Valiant (2014)] There exists an algorithm such that, when given sample access to an unknown distribution and full description of , both supported on , it uses samples from to distinguish from with success probability at least . Moreover, any such algorithm requires samples from .

The worst-case upper and lower bounds of Theorem 2 is given by , the uniform distribution, where .

There exists an algorithm such that, when given sample access to an unknown distribution and full description of , both supported on , it uses samples from to distinguish from with success probability at least . Moreover, any such algorithm requires in the worst case, in the choices of .

## 3 Wasserstein Identity Testing

First, we restate Theorem 1.1 and give its proof in two folds.

Let be a metric space endowed with a distribution . Let be its diameter. Let be a sequence of well-separated -net of . There is an algorithm, given sample access to an unknown distribution over and ,

• accepts with probability at least if ;

• rejects with probability at least if .

The sample complexity of this algorithm is

 ~O(max{22i|Ni|1/2ε2:logε8≤i≤logD}).

Moreover, any algorithm which distinguishes the two cases for any fixed and unknown with probability at least takes

 Ω(max{22i|Ni|1/2ε2:logε8≤i≤logD})

many samples in the worst case of .

### 3.1 The Upper Bound

The high level idea of our testing algorithm is by converting into a tree metric space with and when the latter is restricted on , hence . This means identity testing to in is at least as hard as in , so a tester which works on also works on . More specifically, we make use of -net of metric space to do the construction of .

Recall that is a sequence of -net of . For each , we define . Denote and .

We convert the metric space to a tree metric in the following way: Let (with replacement, see the figure below) where every corresponds to a leaf of . There are many levels of internal nodes, every node in the -th level of the tree represents a point in . For every leaf , add an edge with weight . For each internal node , add an edge with weight . Since the diameter of is , contains only one point which is the root of . Define the tree metric to be the sum of weights of edges in the unique shortest tree path from to . Converting () into a distribution supported on such that it’s supported on the leaves of with the same probability mass. With a little abuse of notation, we also use () to denote the transformed distribution on leaves of .

For , let (resp. ) denote the sum of probability mass of all leaves in the subtree rooted at . Then , can be regarded as a distribution over .

Having defined the distributions induced by the well-separated -nets, we are ready to give the algorithm that solves the problem below.

If then . Moreover,

 WdT(p,q)≥Wd(p,q) (1)

[Proof of Lemma 3.1] The construction is deterministic, so we know implies . To prove , we only need to show for every . Assume the lowest common ancestor of in is in the -th level and all internal nodes along the unique tree path from to is where . So by triangle inequality and the construction of ,

 d(x,y) ≤ d(x,zl)+j−l−1∑i=0d(zl+i,zl+i+1)+j−l−1∑i=0d(wj−i,wj−i−1)+d(wl,y) ≤ 2l+2l+1+...2j+2j+2j−1+...+2l = dT(x,zl)+j−l−1∑i=0dT(zl+i,zl+i+1)+j−l−1∑i=0dT(wj−i,wj−i−1)+dT(wl,y) = dT(x,y)

We have the following simple characterization of Wasserstein distance w.r.t. . This lemma shows that actually, we can convert the the problem to some sub-problems in -distance.

 WdT(p,q)=2lL1(p,q)+r−1∑i=l2i+1L1(~pi,~qi) (2)

where is the distance between two probability distributions with the same support.

[Proof of Lemma 3.1] Consider an edge which connects a node in the th-level and its father, it has weight of . Since the probability mass of and on the leaves inside the subtree rooted differ by , hence there is exactly probability mass transported along which produces the cost in Wasserstein distance. Summing over all edges, we have

 WdT(p,q)=2lL1(p,q)+r−1∑i=l2i+1L1(~pi,~qi).

where we note that every leaf has an edge incident on it with weight .

Now we can prove the correctness of Algorithm 1.

[Proof of Upper Bound in Theorem 3] By Corollary 2 and the median trick, many samples suffice to test versus with probability at least . Choose such that , then by union bound, with probability at least , all testers succeed, and when all sub-testers succeed, we are guaranteed to report a correct answer.

When , we have for each and , thus with probability at least , every sub-tester accepts.

When , by and , one has

therefore, there is some such that , so the corresponding tester rejects, and the algorithm rejects, with probability at least . To satisfy the sample complexity of all sub-testers, the overall upper bound is finally given by,

 O(max{(r−l)322i|Ni|1/2ε2:l≤i≤r−1})=~O(max{22i|Ni|1/2ε2:logε8≤i≤logD}).

### 3.2 Lower Bound over General Metric Space

We prove the worst-case lower bound of sample complexity for the problem , which completes the proof of Theorem 3.

[Proof of Lower Bound in Theorem 3] Let . Denote the number of points in and . We then show how to convert the identity testing problem on in distance to the Wasserstein identity testing problem on .

On testing if an unknown distribution is identical to in -distance where and are supported on . We make the following transformation: let be a distribution supported on such that , and construct a distribution supported on by using such that a sample from is mapped into a sample . Hence by construction.

So if then while if then (recall is an -packing of ). So if we can test versus by using samples, we can distinguish from by using samples, which contradicts the existing worst-case lower bound in Corollary 2.

Hence an algorithm which solves the problem for every distribution uses at least

 Ω(max{22i|Ni|1/2ε2:logε8≤i≤logD})

many samples in the worst case (over the choices of distribution ).

### 3.3 Nearly Optimal Instance Sample Complexity Provided the ”Doubling Condition”

In this section, we characterize nearly optimal instance sample complexity of Problem 1, by additionally assuming that satisfies the ”Doubling Condition” (Definition 1.1). For convenience, we define some new notations.

Let be a metric space endowed with a distribution . Assume is the diameter of , and for evert , is a well separated -net of .

For every , define . For every , define the clustering of to be . Let . Let . we can regard as a discrete distribution on . With a little abuse of notation, for every , let .

Assume is a metric space endowed with a probability distribution , is a well separated -net and are constructed as in Definition 3.3. Assume , then for every ,

 p(B(xj,2i−1))≤pi(j)≤p(B(xj,2i)). (3)

[Proof of Lemma 3.3] We only need to prove that . Recall that is a -net as well as -packing of .

For every , if is clustered to some , then by definition . Hence we have which contradicts the fact that is a -packing. So we have, .

For every , note that is a -net, so there is some such that which means . So we have .

The reader may have a natural question, why do we use instead of defined in the proof of Theorem 3? From a technical perspective, our answer is that we cannot obtain upper and lower bound for as good as (3), which will be essential in the proof of the instance lower bound.

Let be a metric space endowed with a distribution and provided satisfies the ”doubling condition” Definition 1.1. Let be a sequence of well separated -net of . There is an algorithm, given sample access to an unknown distribution over ,

• accepts with probability at least if ;

• rejects with probability at least if .

Let be as defined in Definition 3.3 then the sample complexity of this algorithm is

 ~O(max{max{22iε−2|pi(⋅)|2/3,2iε−1}:logε8≤i≤logD}).

Moreover, the following is a sample complexity lower bound for this task.

 Ω(max{max{22iε−2|pi(⋅)−max−2−iε|2/3,2iε−1}:logε8≤i≤logD})

many samples. Here represents the probability vector obtained by removing element with the largest probability mass and keeping moving the element with the smallest probability mass until mass is removed.

[Proof of Theorem 3.3] Firstly, we prove the upper bound, which is relatively simpler. We proceed as in the proof of Theorem 3 to construct distributions for every . Then we use an instance version of Algorithm 1 by using instance optimal version subtester from Theorem 2 instead of the worst case version. By Theorem 2 and the union bound, we know that samples can guarantee each subtester succeed with probability at least , then Algorithm 2 works by the same reason.

The only remaining work is to convert into in the sample complexity. Recall that for , is the sum of probability mass on the leaves inside the subtree rooted at . Now note that for any such leaf , by construction, which means every leaf inside the subtree root at is contained in the ball . So by doubling condition and Lemma 3.3,

 ~pi(x)≤p(B(x,2i+1))≤Cp(B(x,2i))≤C2p(B(x,2i−1))≤C2pi(x). (4)

Since is a universal constant, we can bound by in big-O notation.

Now we turn to the proof of lower bound. By Theorem 2, we know any algorithm which tests the identity to in distance with proximity parameter requires

 Ω(max{2−iε−1,ε−222i|pi−max−ε|2/3:i∈[logε8,logD]})

many samples.

For an unknown discrete distribution on , we show how to convert to a distribution on so as we can reduce the problem of identity testing to in distance to the problem of identity testing to in Wasserstein distance. We assume in what follows.

Being precise, fix a , every time we need to take a sample from , we do the following: first take a sample from . With probability , is regarded as the sample of . With probability , is regarded as the sample of .

Obviously we have that .

 Wd(p,q∗)≥∑j∈[n]2i−1|p(B(xj,2i−1))−q∗(B(xj,2i−1))| (5)

[Proof of Lemma 3.3] The Wasserstein distance is the cost of transporting probability mass from to . Recall that is a -packing, hence the ball doesn’t intersect with each other. That means for every , much probability mass needs transporting into or out of the ball . Noting that by construction, , the probability mass of is concentrated on , hence the cost of transporting per unit probability mass is at least . Summing over , we have,

 Wd(p,q∗)≥∑j∈[n]2i−1|p(B(xj,2i−1))−q∗(B(xj,2i−1))|. (6)

So by Lemma 3.3, the construction, the doubling condition and Lemma 3.3 respectively,

 Wd(p,q∗) ≥ ∑j∈[n]2i−1|p(B(xj,2i−1))−q∗(B(xj,2i−1))| = ∑j∈[n]2i−1p(B(xj