A fundamental challenge in statistics and computer science is to devise hypothesis tests that use a small number of samples. A classic problem of this type is identity testing (or, goodness-of-fit testing): given samples from an unknown distribution over a domain , does equal a specific reference distribution ? A sequence of works [Pan08, BatuFRSW13, Valiant:2014:AIP:2706700.2707449, ChanDVV14] in the property testing literature has pinned down the finite sample complexity of this problem. It is known that with samples from
, one can, with probability at least, distinguish whether or whether ; also, samples are necessary for this task. A related problem is closeness testing (or, two-sample testing): given samples from two unknown distributions and over , does ? Here, it is known that samples are necessary and sufficient to distinguish from with probability at least . The corresponding algorithms for both identity and closeness testing run in time polynomial in and .
However, in order to solve these testing problems in many real-life settings, there are two issues that need to be surmounted.
High dimensions: In typical applications, the data is described using a huge number of (possibly redundant) features; thus, each item in the dataset is represented as a point in a high-dimensional space. If
, then from the results quoted above, identity testing or closeness testing for arbitrary probability distributions overrequires many samples, which is clearly unrealistic. Hence, we need to restrict the class of input distributions.
Approximation: A high-dimensional distribution requires a large number of parameters to be specified. So, for identity testing, it is unlikely that we can ever hypothesize a reference distribution such that it exactly equals the data distribution . Similarly, for closeness testing, two data distributions and are most likely not exactly equal. Hence, we would like to design tolerant testers for identity and closeness that distinguish between and where and are user-supplied parameters.
In this work, we design sample- and time-efficient tolerant identity and closeness testers for natural classes of distributions over . More precisely, we focus on distance approximation algorithms:
Let be two families of distributions over . A distance approximation algorithm for is a randomized algorithm which takes as input , and sample access to two unknown distributions . The algorithm returns as output a value such that, with probability at least :
If , then we refer to such an algorithm as a distance approximation algorithm for .
The success probability can be amplified to by taking the median of independent repetitions of the algorithm with success probability .
The distance approximation problem and the tolerant testing problem are equivalent in the setting we consider. A distance approximation algorithm for immediately gives a tolerant closeness testing algorithm for two input distributions and with the same asymptotic sample and time complexity bounds. Also a tolerant closeness testing algorithm for distributions in and gives a distance approximation algorithm for , although with slightly worse sample and time complexity bounds (resulting from a binary search approach). Indeed this connection was explored in the property testing setting in [DBLP:journals/jcss/ParnasRR06] which established a general translation result. Thus, in the rest of this paper we will focus on the distance approximation problem and the results translate to appropriate tolerant testing problems. The bounds on the sample and time complexity will be phrased in terms of the description lengths of and .
2 New Results
We design new sample and time efficient distance approximation algorithms for several well-studied families of high-dimensional distributions given sample access. We accomplish this by prescribing a general strategy for designing distance approximation algorithms. In particular, we first design an algorithm to approximate the distance between a pairs of distributions. However, this algorithm needs both sample access and an approximate evaluation oracle. We crucially observe that a learning algorithm that outputs a representation of the unknown distribution given sample access, can often efficiently simulate the approximation oracle. Thus the final algorithm only needs sample access. This general strategy coupled with appropriate learning algorithms, leads to a number of new distance approximation algorithms (and hence new tolerant testers) for well-studied families of high-dimensional probability distributions.
2.1 Distance Approximation from Approximators
Given a family of distributions , a learning algorithm for is an algorithm that on input and sample access to a distribution promised to be in , returns the description of a distribution such that with probability at least , . It turns out that for many natural distribution families over , one can easily modify known learning algorithms for to efficiently output not just a description of but the value of for any . More precisely, they yield what we call approximators:
Let be a distribution over a finite set . A function is a - approximator for if there exists a distribution over such that
Typically, the learning algorithm outputs parameters that describe , and then can be computed (or approximated) efficiently in terms of these parameters.
Suppose is the family of product distributions on . That is, any can be described in terms of parameters where each is the probability of the ’th coordinate being . It is folklore that there is a learning algorithm which gets samples from and returns the parameters of a product distribution satisfying with probability . It is clear that given , we can compute for any in linear time as:
Thus, there is an algorithm that takes as input sample access to any product distribution , has sample and time complexity , and returns a circuit implementing an - approximator for . Moreover, any call to the circuit returns in time.
We establish the following link between approximators and distance approximation.
Suppose we have sample access to distributions and over a finite set. Also, suppose we have access to - approximators for and . Then, with probability at least 2/3, can be approximated to within additive error using samples from and calls to the two approximators.
Thus, in the context of Example 2.2, the above theorem immediately implies a distance approximation algorithm for product distributions using samples and time. Theorem 2.3 extends the work of Canonne and Rubinfeld [DBLP:conf/icalp/CanonneR14] who considered the setting . We discuss the relation to prior work in Section 2.7.
2.2 Bayesian Networks
A standard way to model structured high-dimensional distributions is through Bayesian networks
. A Bayesian network describes how a collection of random variables can be generated one-at-a-time in a directed fashion, and they have been used to model beliefs in a wide variety of domains (see[JN07, KF09] for many pointers to the literature). Formally, a probability distribution over variables is said to be a Bayesian network on a directed acyclic graph with nodes if†††We use the notation to denote for a set . for every , is conditionally independent of given . Equivalently, admits the factorization:
For example, product distributions are Bayesian networks on the empty graph.
Invoking our framework of distance approximation via approximators on Bayesian networks, we obtain the following:
Suppose and are two DAGs on vertices with in-degree at most . Let and be the family of Bayesian networks on and respectively. Then, there is a distance approximation algorithm for that gets samples and runs in time.
We design a learning algorithm for Bayesian networks on a known DAG that uses samples where is the maximum in-degree. It returns another Bayesian network on
, described in terms of the conditional probability distributionsfor all and all settings of . Given these conditional probability distributions, we can easily obtain for any , and hence, an - approximator for , by using (1). Theorem 2.4 then follows from Theorem 2.3.
Theorem 2.4 extends the works of Daskalakis et al. [DBLP:conf/colt/DaskalakisP17] and Canonne et al. [DBLP:conf/colt/CanonneDKS17] who designed efficient non-tolerant identity and closeness testers for Bayesian networks. Their arguments appear to be inadequate to design tolerant testers. In addition, their results for general Bayesian networks were restricted to the case when . Theorem 2.4 immediately gives efficient tolerant identity and closeness testers for Bayesian networks even when . Canonne et al. [DBLP:conf/colt/CanonneDKS17] obtain better sample complexity but they make certain balancedness assumption on each conditional probability distribution. Without such assumptions, the sample complexity of our algorithm is optimal.
2.3 Ising Models
Another widely studied model of high-dimensional distributions is the Ising model. It was originally introduced in statistical physics as a way to study spin systems ([Isi25]) but has since emerged as a versatile framework to study other systems with pairwise interactions, e.g., social networks ([MS10]), learning in coordination games ([Ell93]), phylogeny trees in evolution ([Ney71, Far73, Cav78]
) and image models for computer vision ([GG86]). Formally, a distribution over variables is an Ising model if for all :
where is called the external field and are called the interaction terms. An Ising model is called ferromagnetic if all . The width of an Ising model as in (2) is .
Invoking our framework on Ising models, we obtain:
Let be the family of ferromagnetic Ising models having width at most . Then, there is a distance approximation algorithm for with sample complexity and runtime .
We use the parameter learning algorithm by Klivans and Meka [KM17] that learns the parameters of another Ising model such that is a approximation of for every . This results holds for any Ising model, ferromagnetic or not. But in order to get an approximator, we need to compute from . In general, the partition function (i.e., the sum in the denominator of Equation 2) may be -hard to compute, but for ferromagnetic Ising models, Jerrum and Sinclair [JS93] gave a PTAS for this problem. Thus, we obtain an - approximator for ferromagnetic Ising models that runs in polynomial time, and then Theorem 2.5 follows from Theorem 2.3.
Daskalakis et al. [DBLP:journals/tit/DaskalakisD019] studied independent testing and identity testing for Ising models and design non-tolerent testers. Their sample and time complexity have polynomial dependence on the width instead of exponential (as in our case), but their algorithms seem to be inherently non-tolerant. In contrast, our distance approximation algorithm leads to a tolerant closeness-testing algorithm for ferromagnetic Ising models. Also, Theorem 2.5 offers a template for distance approximation algorithms whenever the partition function can be approximated efficiently. In particular, Sinclair et al [SST14] showed a PTAS for computing the partition function of anti-ferromagnetic Ising models in certain parameter regimes.
We also show that we can efficiently approximate the distance to uniformity for any Ising model, whether ferromagnetic or not. Below,
is the uniform distribution over.
There is an algorithm which, given independent samples from an unknown Ising model over with width at most , takes samples, time and returns a value such that with probability at least 7/12, where is the uniform distribution over .
The proof of Theorem 2.6 again proceeds by learning the parameters of an Ising model that is a multiplicative approximation fo
. As we mentioned earlier, computing the partition function is in general hard, but now, we can efficiently estimate the ratiobetween any two . At this point, we invoke the uniformity tester shown by Canonne et al. [DBLP:journals/siamcomp/CanonneRS15] that uses samples from the input distribution as well as pairwise conditional samples (the so-called PCOND oracle model).
2.4 Multivariate Gaussians
Theorem 2.3 applies also when is not finite, e.g., the reals. Then, in the definition of the - approximator for a distribution , we require that there is a distribution such that and is a -approximation of the probability density function of at any .
The most prominent instance in which we can apply our framework in this setting is for the class of multivariate gaussians, again another widely used model for high-dimensional distributions used throughout the natural and social sciences (see, e.g., [MDLW18]
). There are two main reasons for their ubiquity. Firstly, because of the central limit theorem, any physical quantity that is a population average is approximately distributed as a gaussian. Secondly, the gaussian distribution has maximum entropy among all real-valued distributions with a particular mean and covariance; therefore, a gaussian model places the least restrictions beyond the first and second moments of the distribution.
For and positive definite , the distribution has the density function:
Invoking our framework on multivariate gaussians, we obtain:
Let be the family of multivariate gaussian distributions, . Then, there is a distance approximation algorithm for with sample complexity and runtime (where is the matrix multiplication constant).
It is folklore that for any , the empirical mean and empirical covariance obtained from samples from determines a gaussian satisfying with probability at least 3/4. To get an approximator, we need evaluations of for any as in (3). Since is computable in time , Theorem 2.7 follows from Theorem 2.3.
This result is interesting because there is no closed-form expression known for the total variation distance between two gaussians of specified mean and covariance. Devroye et al. [DMR18] give expressions for lower- and upper-bounding the total variation distance that are a constant multiplicative factor away from each other. On the other hand, our approach (see Corollary 6.3) yields a polynomial time randomized algorithm that, given , approximates the total variation distance upto additive error.
2.5 Interventional Distributions in Causal Models
A causal model for a system of random variables describes not only how the variables are correlated but also how they would change if they were to be externally set to prescribed values. To be more formal, we can use the language of causal Bayesian networks due to Pearl [Pearl00]. A causal Bayesian network is a Bayesian network with an extra modularity assumption: for each node in the network, the dependence of on is an autonomous mechanism that does not change even if other parts of the network are changed.
Suppose is a causal Bayesian network over variables on a directed acyclic graph with nodes labeled . The nodes in are partitioned into two sets: observable and hidden . A sample from the observational distribution yields the values of variables . The modularity assumption allows us to define the result of interventions on causal Bayesian networks. An intervention is specified by a subset and an assignment . In the resulting interventional distribution, the variables in are fixed to , while the variables for are sampled in topological order as it would have been in the original Bayesian network, according to the conditional probability distribution , where consist of either variables previously sampled in the topological order or variables in set by the intervention. Finally, the variables in are marginalized out. The resulting distribution on is denoted .
The question of inferring the interventional distribution from samples is a fundamental one. We focus on atomic interventions, i.e., where the intervention is on a single node . In this case, Tian and Pearl [TP02b, tian-thesis] exactly characterized the graphs such that for any causal Bayesian network on and for any assignment to , the interventional distribution is identifiable‡‡‡That is, there exists a well-defined function mapping to but which may not be computationally effective. from the observational distribution on . For identification to be computationally effective, it is also natural to require a strong positivity condition on , defined in Section 7. We show that we can efficiently estimate the distances between interventional distributions of causal Bayesian networks whenever the identifiability and strong positivity conditions are met:
Theorem 2.8 (Informal).
Suppose are two unknown causal Bayesian networks on two known graphs and on a common observable set containing a special node and having bounded in-degree and c-component size. Suppose and both satisfy the identifiability condition, and the observational distributions and satisfy the strong positivity condition.
Then there is an algorithm which for any and parameter returns a value such that with probability at least 2/3 using samples from the observational distributions and and running in time .
We again use the framework of approximators to prove the theorem, but there is a complication: we do not get samples from the distributions and , but only from and . We build on a recent work ([BGKMV20]) that shows how to efficiently learn and sample from interventional distributions of atomic interventions using observational samples, assuming the identifiability and strong positivity conditions.
Theorem 2.8 solves a very natural problem. To concoct a somewhat realistic example, suppose a biologist wants to compare how a particular point mutation affects the activity of other genes for Africans and for Europeans. Because of ethical reasons, she cannot conduct randomized controlled trials by actively inducing the mutation, but she can draw random samples from the two populations. It is reasonable to assume that the graph structure of the regulatory network is the same for all individuals, and we further assume that the causal graph over the genes of interest is known (or can be learned through other methods). Also, suppose that the gene expression levels can be discretized. She can then, in principle, use the algorithm proposed in Theorem 2.8 to test whether the effect of the mutation is approximately the same for Africans and Europeans.
2.6 Improving Success of Learning Algorithms Using Distance Estimation
Finally we give a link between efficient distance approximation algorithms and boosting the success probability of learning algorithms. Specifically, let be a family of distributions for which we have a learning algorithm in distance that succeeds with probability 3/4. Suppose there is also a distance approximation algorithm for . We prescribe a method to combine the two algorithms and to learn an unknown distribution from with probability at least . To the best of our knowledge, this connection has not been stated explicitly in the literature. The proof of the following theorem is given in Section 8.
Let be a family of distributions. Suppose there is a learning algorithm which for any takes samples from and in time outputs a distribution such that with probability at least 3/4. Suppose there is a distance approximation algorithm for that given any two completely specified distributions and estimates up to an additive error in time with probability at least . Then there is an algorithm that uses and as subroutines, takes samples from , runs in time and returns a distribution such that with probability at least .
To achieve the above result we repeat independently times which guarantees at least successful repetitions from Chernoff’s bound except probability, which we condition on. Sucessful repetitions must produce distributions which are pairwise close by triangle inequality. We approximate the pairwise distances between all pairs of repetitions up to an additive and then find out a repetition whose learnt distribution has the most number of other repetitions within distance. The later number must be at least , guaranteeing must have a successful repetition within distance. Thus must be at most close to from triangle inequality.
2.7 Previous work
Prior work most related to our work is in the area of distribution testing. The topic of distribution testing is rooted in statistical hypothesis testing and goes back to Pearson’s chi-squared test in 1900. In theoretical computers science, distribution testing research is relatively new and focuses on designing hypothesis testers with optimal sample complexity. Goldreich and Ron[GoldreichR11] investigated uniformity testing (distinguishing whether an input distribution is uniform over its support or -far from uniform in total variation distance) and designed a tester with sample complexity (where is the size of the sample space). Paninski [Pan08] showed that samples are necessary for uniformity testing, and gave an optimal tester when . Batu et al. [BatuFRSW13] initiated the investigation of identity (goodness-of-fit) testing and closeness (two-sample) testing and gave testers with sample complexity and respectively. Optimal bounds for these testing problems were obtained in Valiant and Valiant [Valiant:2014:AIP:2706700.2707449] () and Chan et al. [ChanDVV14] () respectively. Tolerant versions of these testing problems have very different sample complexity. In particular, Valiant and Valiant [ValiantV11, ValiantV10] showed that tolerant uniformity, identity, and closeness testing with respect to the total variation distance have a sample complexity of . Since the seminal papers of Goldreich and Ron and Batu et al., distribution testing grew into a very active research topic and a wide range of properties of distributions have been studied under this paradigm. This research led to sample-optimal testers for many distribution properties. We refer the reader to the surveys [DBLP:journals/eccc/Canonne15, DBLP:journals/crossroads/Rubinfeld12] and references therein for more details and results on the topic.
When the sample space is a high-dimensional space (such as ), the testers designed for general distributions require exponential number of samples () if the sample space is for a constant ). Thus structural assumptions are to be made to design efficient () and practical testers for many of the testing problems. The study of testing high-dimensional distributions with structural restrictions was initiated only very recently. The work that is most closely related to our work appears in [DBLP:journals/tit/DaskalakisD019, DBLP:conf/colt/CanonneDKS17, DBLP:conf/colt/DaskalakisP17, 10.5555/3327546.3327616] (these works also give good expositions to other prior work on this topic). These papers consider distributions coming from graphical models including Ising models and Bayes nets. In Daskalakis et al. [DBLP:journals/tit/DaskalakisD019], the authors consider distributions that are drawn from an Ising model and show that identity testing and independence testing (testing whether an unknown distribution is close to a product distribution) can be done with samples where is the number nodes in the graph associated with the Ising model. In Canonne et al. [DBLP:conf/colt/CanonneDKS17] and Daskalakis et al. [DBLP:conf/colt/DaskalakisP17], the authors consider identity testing and closeness testing for distributions given by Bayes networks of bounded in-degree. Specifically, they design algorithms with sample complexity that test closeness of distributions over the same Bayes net with nodes and in-degree . They also show that and samples are necessary and sufficient for identity testing and closeness testing respectively of pairs of product distributions (Bayes net with empty graph). Finally, in Acharya et al.[10.5555/3327546.3327616], the authors investigate testing problems on causal Bayesian networks as defined by Pearl [Pearl00] and design efficient testing algorithms for certain identity and closeness testing problems for them. All these papers consider designing non-tolerant testers and leave open the problem of designing efficient testers that are tolerant for high-dimensional distributions which is the main focus in this paper.
Our main technical result builds on the work of Canonne and Rubinfeld [DBLP:conf/icalp/CanonneR14]. They consider a dual access model for testing distributions. In this model, in addition to independent samples, the testing algorithm has also access to an evaluation oracle that gives probability of any item in the sample space. They establish that having access to evaluation oracle leads to testing algorithms with sample complexity independent of the size of the sample space. Indeed, in order to design testing algorithms, they give an algorithm to additively estimate the total variation distance between two unknown distributions in the dual access model. Our distance estimation algorithm is a direct extension of this algorithm.
Another access model considered in the literature for which such domain independent results are obtained is the conditional sampling model introduced independently in Chakraborty et al. [DBLP:journals/siamcomp/ChakrabortyFGM16] and Canonne et al. [DBLP:conf/soda/CanonneRS14]. In this model, the tester has access to a conditional sampling oracle that given a subset of the sample space outputs a random sample from the unknown distribution conditioned on . The conditional sampling model lends itself to algorithms for testing uniformity and testing identity to a known distribution with sample complexity . Building on Chakraborty et al. [DBLP:journals/siamcomp/ChakrabortyFGM16], Chakraborty and Meel [CM19] proposed a tolerant testing algorithm with sample complexity independent of domain size for testing uniformity of a sampler that takes in a Boolean formula as input and the sampler’s output generates a distribution over the witnesses of .
3 Distance Approximation Algorithm
In this section, we prove Theorem 2.3 which underlies all the other results in this work. In fact, we show the following theorem that is more detailed.
Suppose we have sample access to distributions and over a finite set. Also, suppose we can make calls to two circuits and which implement - approximators for and respectively. Let be the maximum running time for any call to or .
Then for any , can be approximated up to an additive error with probability at least , using samples from and runtime.
Note that the approximators in Theorem 3.1 must return rational numbers with bounded denominators as they are implemented by circuits with bounded running time. The exact model of computation for the circuits does not matter so much, so we omit its discussion.
We now turn to the proof of Theorem 3.1. As mentioned in the Introduction, if and were - approximators, the result already appears in [DBLP:conf/icalp/CanonneR14]. The proof below analyzes how having nonzero and affects the error bound.
We invoke Algorithm 1. Notice that the algorithm only requires sample access to one of the two distributions but to both of the approximators. Let be the distribution -close to which is approximated by the output of ; similarly define .
We have from the triangle inequality. Hence, it is sufficient to approximate additively up to .
From the above, if we have complete access (both evaluation and sample) to and , then we can estimate the distance with samples and evaluations. However as we have only approximate evaluations of and and samples from the original distribution , we need some additional arguments. Let and be the functions implemented by the circuits and respectively.
We start with an upper bound for the absolute value of the error term . We consider the partition of sample space into and , where , and .
For in with , so that For in , implies and hence, so that . For in , implies , and hence, . Therefore:
Now consider the term :
Note that: . So, . We can rewrite as . Since lies in , by the Chernoff bound, we can estimate the expectation up to additive error with probability at least by averaging samples from . ∎
Theorem 3.1 can be extended to the case that and are distributions over with infinite support. We change Definition 2.1 so that is a -approximation of where is the probability density function for . Then, Theorem 3.1 and Algorithm 1 continue to hold as stated. In the proof, we merely have to replace the summations with the appropriate integrals.
4 Bayesian networks
First we apply our distance estimation algorithm for tolerant testing of high dimensional distributions coming from bounded in-degree Bayesian networks. Bayesian networks defined below are popular probabilistic graphical models for describing high-dimensional distributions succinctly.
A Bayesian network on a directed acyclic graph over the vertex set is a joint distribution of the
is a joint distribution of therandom variables over the sample space such that for every is conditionally independent of given , where for , is the joint distribution of , and parents and non-descendants are defined from .
factorizes as follows:
Hence a Bayesian network can be completely described by a set of conditional distributions for every variable , for every fixing of its parents .
To construct an approximator for a Bayesian network, we first learn it using an efficient algorithm. Such a learning algorithm was claimed in the appendix of [DBLP:conf/colt/CanonneDKS17] but the analysis there appears to be incomplete [CanComm]. We show the following proper learning algorithm for Bayesian networks that uses the optimal sample complexity.
There is an algorithm that given a parameter and sample access to an unknown Bayesian network distribution on a known directed acyclic graph of in-degree at most , returns a Bayesian network on such that with probability . Letting denote the range of each variable , the algorithm takes samples and runs in time.
This directly gives us a distance estimation algorithm for Bayesian networks.
Given samples from and we first learn them as and using Theorem 4.2 in distance . This step costs samples and time and succeeds with probability 4/5. and gives efficient - approximators from Equation 4. It follows from Theorem 3.1 that we can estimate up to an additive error using additional samples from except for 1/5 probability. ∎
Our distance estimation algorithm has optimal dependence on and from the following non-tolerant identity testing lower bound of Daskalakis et al.
Theorem 4.3 ([DBLP:journals/tit/DaskalakisD019]).
Given sample access to two unknown Bayesian network distributions and over on a common known graph, testing versus with probability requires samples.
It remains to prove Theorem 4.2.
4.1 Learning Bayesian networks
In this section, we prove a strengthened version of Theorem 4.2 that holds for any desired error probability .
There is an algorithm that given parameters and sample access to an unknown Bayesian network distribution on a known directed acyclic graph of in-degree at most , returns a Bayesian network on such that with probability . Letting denote the alphabet for each variable , the algorithm takes