1 Introduction
Property Testing is proposed in the seminal work of Goldreich et.al (Goldreich et al. (1998)), which is generally the study of designing and analyzing of randomized decision algorithm on efficiently making decision whether the given instance is having certain property or somewhat far from having it. Significantly, the query complexity of efficient property testing algorithm is often sublinear on the size of its accessing instance.
In recent years, distribution property testing has received much attention from theoretical computer science research. On most problems in distribution property testing, the input is a set of independent samples from an unknown distribution, and the decision is on whether the distribution has certain properties or not. Researchers have investigated the sample complexity of various testing problems of distribution properties such as uniformity, identity to certain distribution, closeness testing, having small entropy, having small support, being uniform on a small subset and so on (Goldreich and Ron (2011),Paninski (2008), Chan et al. (2014), Valiant and Valiant (2014),Diakonikolas et al. (2014), Valiant and Valiant (2010b),Valiant and Valiant (2010a), Batu and Canonne (2017), Diakonikolas et al. (2017),Diakonikolas and Kane (2016)).
In this paper, we focus on the problem of identity testing. Arguably, identity testing together with its special case uniformity testing are the best studied problems in distribution property testing. In identity testing, we are given sample access to an unknown distribution , the explicit description of a known distribution and a proximity parameter . Then we are required to distinguish the following two cases: 1) is identical to ; 2) certain distance (e.g. the distance, the Hellinger distance, or the th Wasserstein distance) between and is larger than .
The sample complexity of identity testing in distance (equivalently statistical distance or total variation distance) is now fully understood in a series of work (Goldreich and Ron (2011), Paninski (2008), Chan et al. (2014)). Specifically, testing if a distribution supported on is uniform with proximity parameter in distance requires many samples (Paninski (2008),Valiant and Valiant (2014)). However, consider the case where the support is continuous, the bound above becomes meaningless. For example, the natural problem of testing whether a distribution supported on is uniform in distance would require an infinite number of samples.
Motivated by the these issues, we would like to study the testing problem under a probability distance that metrizes the weak convergence (on the other hand, convergence in distance is a strong convergence). A popular choice is the Wasserstein distance (a.k.a. transportation distance or earthmover distance, see Definition 1
). Using Wasserstein distance, identity testing is well defined in arbitrary, even continuous, metric space (with Borel point sets of positive finite measure). We also note that using Wasserstein distance as the defining metric has gained significant attention in machine learning community recently (e.g., in generative models
Arjovsky et al. (2017) and mixture models Li et al. (2015) ).[Wasserstein Distance] Let be two distributions supported on metric space , the Wasserstein distance (or transportation distance) between and with respect to is defined to be:
where is the set of all coupling distributions of and , i.e. all distributions on that have marginal distributions and .
We define the problem of Wasserstein identity testing as the following.
[Wasserstein Identity Testing, ] Let be a metric space and a distribution on . For a proximity parameter , denote the problem of designing an algorithm which, given sample access to an unknown distribution ,

accepts with probability at least if ;

rejects with probability at least if .
Moreover, investigating the sample complexity lower bound of these algorithms.
For notational convenience, we use for short when there is no risk of ambiguity. Denote by
the uniform distribution on
, then is the Wasserstein uniformity testing problem.When is not discrete, the meaning of ”the explicit description of ” is confusing. However, whenever
is separable, we can estimate
by a distribution supported on a countable net of for any . This transformation makes the problem discrete. To make life even easier, in what follows we assume that all the time and the explicit description of is well defined.Testing versus Learning
A direct approach of identity testing is to learn the distribution. Specifically, on testing if an unknown distribution is uniform with proximity parameter , we can estimate the unknown distribution by the empirical distribution such that the distance between and is less than . Then we accept if the distance between and the uniform distribution is less than and reject otherwise. A tester is efficient if it uses less samples than estimating by empirical distribution. For statistical efficiency, we are seeking for such efficient tester. For example, if the support is and the distance is distance, the sample complexity of learning is (see e.g. Devroye and Lugosi (2001)) while the sample complexity of testing is (Paninski (2008),Valiant and Valiant (2014)).
In our case for Wasserstein identity testing, in the natural metric space dimension hypercube for
equipped with the Euclidean metric, the Wasserstein Law of Large Number (see e.g.
van Handel (2014)) shows that many samples are sufficient and necessary to estimate the distribution up to in Wasserstein distance. Hence we automatically obtain a tester with sample complexity for the problem . On the other hand, in Corollary 2, a tester with sample complexity for the problem is given.The Chaining Method
The primary technique in this paper is choosing a sequence of nets then decomposing the original testing problem into multiple easier subproblems according to the nets. This technique is highly related to Talagrand’s ”Chaining Method” which plays a central roll on proving upper and lower bounds of stochastic process (M.Talagrand (2014)).
1.1 Main Contributions
Our first contribution is characterizing the worstcase sample complexity of in arbitrary metric space by giving nearly optimal upper bound and matching lower bound.
Let be a metric space endowed with a distribution . Let be its diameter. Let be a sequence of wellseparated net of (see Definition 2). There is an algorithm, given sample access to an unknown distribution over and a proximity parameter ,

accepts with probability at least if ;

rejects with probability at least if .
The sample complexity of this algorithm is
Moreover, any algorithm which distinguishes the two cases for any fixed and unknown with probability at least takes
many samples in the worst case.
Actually, Theorem 1.1 is a worstcase result for problems in Definition 1. The sample complexity bound is oblivious on , the target distribution. One may wonder if we can obtain some instance bound which is nearly optimal for every , like what appeared in Valiant and Valiant (2014). We show that if the distribution is not too singular (e.g. highly concentrated on one point), characterized by satisfying the following ”Doubling Condition” (see Definition 1.1), then we can obtain nearly instanceoptimal sample complexity bounds (see Theorem 3.3).
[Doubling Condition] Let be a metric space and be a distribution on . For , , denote the ball . is said to satisfy the ”doubling condition” if there exists a constant such that for every and , where .
Why Doubling Condition?
Doubling dimension is introduced in Assouad (1983) and D.G.Larman (1967) which has become a popular notion of complexity measure of metric space. In Definition 1.1, a counterpart of this notion in a metric space endowed some distribution is given for our use. Generally, regarding as a measure of , the ”doubling condition” says every ball’s volume is upper bounded by a universal constant times the volume of the ball with the same center but half radius. It measures the complexity of distribution . The distribution satisfying the ”doubling condition” somewhat has similar property as the uniform distribution on a compact set of Euclidean space.
Since uniform distribution on a compact set (e.g. hypercube or unit ball ) of Euclidean space satisfies the doubling condition, as an interesting and important corollary (see Corollary 2), we show the sample complexity of problem and is .
1.2 Other Related Work
There are also recent papers regarding identity or uniformity testing beyond the classical problem of testing. Batu and Canonne (2017) presented the generalized uniformity testing problem which asks if a discrete distribution we are taking samples from is uniform on its support. Diakonikolas et al. (2017) then investigated the exact sample complexity of this problem. On testing in other distribution distances, Daskalakis et al. (2017) gave characterizations of the sample complexity of identity testing in a variety of distance besides distance.
The study of metric space has a long history, we refer to Deza and Laurent (2009) as a complete and indepth treatment of metric space. The doubling dimension is introduced in Assouad (1983) and D.G.Larman (1967), and in theoretical computer science community, it’s first used in the paper Clarkson (1997) regarding nearest neighbor search.
Chaining is an efficient way of proving union bound for a variety of possibly dependent variables. The study of chaining dates back to Kolmogorov’s study of Brownian motion. M.Talagrand (2014)
is a highly suggested book regarding the application of chaining methods in modern probability theory. In recent years, the chaining method finds many applications in theoretical computer science, we refer to
at Harvard (2016) as an introduction of chaining methods in theoretical computer science.2 Preliminary
We then define some notations about metric space. Let be a metric space where is a ground set and is a metric on which satisfies:

for .

for .

Triangle Inequality: for .
The diameter of is defined as . For a distribution supported on , we mean by a sample from a point in . For a subset , .
The following classical definitions about net and packing are essential in this paper.
[net, packing and well separated net] Let be a metric space and . A subset is called an net of if
A subset is called an packing of if
A subset is called a well separated net of if it’s an net as well as an packing of .
The following lemma shows the duality between net and packing. (see e.g. van Handel (2014)) Let be a metric space. Let denote the minimum size of net of and denote the maximum size of packing . Then we have
To acknowledge the great importance of the work by Valiant and Valiant (2014), we restate their core theorem here, and show how it implies other worstcase bounds.
[Valiant and Valiant (2014)] There exists an algorithm such that, when given sample access to an unknown distribution and full description of , both supported on , it uses samples from to distinguish from with success probability at least . Moreover, any such algorithm requires samples from .
The worstcase upper and lower bounds of Theorem 2 is given by , the uniform distribution, where .
There exists an algorithm such that, when given sample access to an unknown distribution and full description of , both supported on , it uses samples from to distinguish from with success probability at least . Moreover, any such algorithm requires in the worst case, in the choices of .
3 Wasserstein Identity Testing
First, we restate Theorem 1.1 and give its proof in two folds.
Let be a metric space endowed with a distribution . Let be its diameter. Let be a sequence of wellseparated net of . There is an algorithm, given sample access to an unknown distribution over and ,

accepts with probability at least if ;

rejects with probability at least if .
The sample complexity of this algorithm is
Moreover, any algorithm which distinguishes the two cases for any fixed and unknown with probability at least takes
many samples in the worst case of .
3.1 The Upper Bound
The high level idea of our testing algorithm is by converting into a tree metric space with and when the latter is restricted on , hence . This means identity testing to in is at least as hard as in , so a tester which works on also works on . More specifically, we make use of net of metric space to do the construction of .
Recall that is a sequence of net of . For each , we define . Denote and .
We convert the metric space to a tree metric in the following way: Let (with replacement, see the figure below) where every corresponds to a leaf of . There are many levels of internal nodes, every node in the th level of the tree represents a point in . For every leaf , add an edge with weight . For each internal node , add an edge with weight . Since the diameter of is , contains only one point which is the root of . Define the tree metric to be the sum of weights of edges in the unique shortest tree path from to . Converting () into a distribution supported on such that it’s supported on the leaves of with the same probability mass. With a little abuse of notation, we also use () to denote the transformed distribution on leaves of .
For , let (resp. ) denote the sum of probability mass of all leaves in the subtree rooted at . Then , can be regarded as a distribution over .
Having defined the distributions induced by the wellseparated nets, we are ready to give the algorithm that solves the problem below.
If then . Moreover,
(1) 
[Proof of Lemma 3.1] The construction is deterministic, so we know implies . To prove , we only need to show for every . Assume the lowest common ancestor of in is in the th level and all internal nodes along the unique tree path from to is where . So by triangle inequality and the construction of ,
We have the following simple characterization of Wasserstein distance w.r.t. . This lemma shows that actually, we can convert the the problem to some subproblems in distance.
(2) 
where is the distance between two probability distributions with the same support.
[Proof of Lemma 3.1] Consider an edge which connects a node in the thlevel and its father, it has weight of . Since the probability mass of and on the leaves inside the subtree rooted differ by , hence there is exactly probability mass transported along which produces the cost in Wasserstein distance. Summing over all edges, we have
where we note that every leaf has an edge incident on it with weight .
Now we can prove the correctness of Algorithm 1.
[Proof of Upper Bound in Theorem 3] By Corollary 2 and the median trick, many samples suffice to test versus with probability at least . Choose such that , then by union bound, with probability at least , all testers succeed, and when all subtesters succeed, we are guaranteed to report a correct answer.
When , we have for each and , thus with probability at least , every subtester accepts.
When , by and , one has
therefore, there is some such that , so the corresponding tester rejects, and the algorithm rejects, with probability at least . To satisfy the sample complexity of all subtesters, the overall upper bound is finally given by,
3.2 Lower Bound over General Metric Space
We prove the worstcase lower bound of sample complexity for the problem , which completes the proof of Theorem 3.
[Proof of Lower Bound in Theorem 3] Let . Denote the number of points in and . We then show how to convert the identity testing problem on in distance to the Wasserstein identity testing problem on .
On testing if an unknown distribution is identical to in distance where and are supported on . We make the following transformation: let be a distribution supported on such that , and construct a distribution supported on by using such that a sample from is mapped into a sample . Hence by construction.
So if then while if then (recall is an packing of ). So if we can test versus by using samples, we can distinguish from by using samples, which contradicts the existing worstcase lower bound in Corollary 2.
Hence an algorithm which solves the problem for every distribution uses at least
many samples in the worst case (over the choices of distribution ).
3.3 Nearly Optimal Instance Sample Complexity Provided the ”Doubling Condition”
In this section, we characterize nearly optimal instance sample complexity of Problem 1, by additionally assuming that satisfies the ”Doubling Condition” (Definition 1.1). For convenience, we define some new notations.
Let be a metric space endowed with a distribution . Assume is the diameter of , and for evert , is a well separated net of .
For every , define . For every , define the clustering of to be . Let . Let . we can regard as a discrete distribution on . With a little abuse of notation, for every , let .
Assume is a metric space endowed with a probability distribution , is a well separated net and are constructed as in Definition 3.3. Assume , then for every ,
(3) 
[Proof of Lemma 3.3] We only need to prove that . Recall that is a net as well as packing of .
For every , if is clustered to some , then by definition . Hence we have which contradicts the fact that is a packing. So we have, .
For every , note that is a net, so there is some such that which means . So we have .
The reader may have a natural question, why do we use instead of defined in the proof of Theorem 3? From a technical perspective, our answer is that we cannot obtain upper and lower bound for as good as (3), which will be essential in the proof of the instance lower bound.
Let be a metric space endowed with a distribution and provided satisfies the ”doubling condition” Definition 1.1. Let be a sequence of well separated net of . There is an algorithm, given sample access to an unknown distribution over ,

accepts with probability at least if ;

rejects with probability at least if .
Let be as defined in Definition 3.3 then the sample complexity of this algorithm is
Moreover, the following is a sample complexity lower bound for this task.
many samples. Here represents the probability vector obtained by removing element with the largest probability mass and keeping moving the element with the smallest probability mass until mass is removed.
[Proof of Theorem 3.3] Firstly, we prove the upper bound, which is relatively simpler. We proceed as in the proof of Theorem 3 to construct distributions for every . Then we use an instance version of Algorithm 1 by using instance optimal version subtester from Theorem 2 instead of the worst case version. By Theorem 2 and the union bound, we know that samples can guarantee each subtester succeed with probability at least , then Algorithm 2 works by the same reason.
The only remaining work is to convert into in the sample complexity. Recall that for , is the sum of probability mass on the leaves inside the subtree rooted at . Now note that for any such leaf , by construction, which means every leaf inside the subtree root at is contained in the ball . So by doubling condition and Lemma 3.3,
(4) 
Since is a universal constant, we can bound by in bigO notation.
Now we turn to the proof of lower bound. By Theorem 2, we know any algorithm which tests the identity to in distance with proximity parameter requires
many samples.
For an unknown discrete distribution on , we show how to convert to a distribution on so as we can reduce the problem of identity testing to in distance to the problem of identity testing to in Wasserstein distance. We assume in what follows.
Being precise, fix a , every time we need to take a sample from , we do the following: first take a sample from . With probability , is regarded as the sample of . With probability , is regarded as the sample of .
Obviously we have that .
(5) 
[Proof of Lemma 3.3] The Wasserstein distance is the cost of transporting probability mass from to . Recall that is a packing, hence the ball doesn’t intersect with each other. That means for every , much probability mass needs transporting into or out of the ball . Noting that by construction, , the probability mass of is concentrated on , hence the cost of transporting per unit probability mass is at least . Summing over , we have,
(6) 
Comments
There are no comments yet.