# Inference under Information Constraints I: Lower Bounds from Chi-Square Contraction

We consider a distributed inference problem where only limited information is allowed from each sample. We present a general framework where multiple players are given one independent sample each about which they can only provide limited information to a central referee. Motivated by important instances of communication and privacy constraints, in our abstraction we allow each player to describe its observed sample to the referee using a channel from a prespecified family of channels W. This family W can be instantiated to capture both the communication- and privacy-constrained settings and beyond. The central referee uses the messages from players to solve an inference problem for the unknown underlying distribution that generated samples of the players. We derive lower bounds for sample complexity of learning and testing discrete distributions in this information-constrained setting. Underlying our lower bounds is a quantitative characterization of the contraction in chi-square distances between the observed distributions of the samples when an information constraint is placed. This contraction is captured in a local neighborhood in terms of chi-square and decoupled chi-square fluctuations of a given channel, two quantities we introduce in this work. The former captures the average distance between two product distributions and the latter the distance of a product distribution to a mixture of product distributions. These quantities may be of independent interest. As a corollary, we quantify the sample complexity blow-up in the learning and testing settings when enforcing the corresponding local information constraints. In addition, we systematically study the role of randomness and consider both private- and public-coin protocols.

## Authors

• 27 publications
• 22 publications
• 28 publications
• ### Interactive Inference under Information Constraints

We consider distributed inference using sequentially interactive protoco...
07/21/2020 ∙ by Jayadev Acharya, et al. ∙ 0

• ### Domain Compression and its Application to Randomness-Optimal Distributed Goodness-of-Fit

We study goodness-of-fit of discrete distributions in the distributed se...
07/20/2019 ∙ by Jayadev Acharya, et al. ∙ 0

• ### Robust Testing and Estimation under Manipulation Attacks

We study robust testing and estimation of discrete distributions in the ...
04/21/2021 ∙ by Jayadev Acharya, et al. ∙ 5

• ### Estimating Sparse Discrete Distributions Under Local Privacy and Communication Constraints

We consider the task of estimating sparse discrete distributions under l...
10/30/2020 ∙ by Jayadev Acharya, et al. ∙ 6

• ### Inference under Information Constraints II: Communication Constraints and Shared Randomness

A central server needs to perform statistical inference based on samples...
05/20/2019 ∙ by Jayadev Acharya, et al. ∙ 0

• ### Distributed Simulation and Distributed Inference

Independent samples from an unknown probability distribution p on a doma...
04/19/2018 ∙ by Jayadev Acharya, et al. ∙ 0

• ### Inference under Information Constraints III: Local Privacy Constraints

We study goodness-of-fit and independence testing of discrete distributi...
01/20/2021 ∙ by Jayadev Acharya, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Large-scale distributed inference has taken a center stage in many machine learning tasks. In these settings, it is becoming increasingly critical to operate under limited resources at each player, where the players may be limited in their computational capabilities, communication capabilities, or may restrict the information about their data to maintain privacy. Our focus in this work will be on the last two constraints of communication and privacy, and, in general, on local information constraints on each player’s data.

We propose a general framework for distributed statistical inference under local information constraints. Consider a distributed model where players observe independent samples from an unknown distribution on , with player getting the sample . The players are constrained in the amount of information they can reveal about their observations in the following way: Player must choose a channel from a prespecified class of channels to report its observed sample to a central referee . In particular, player passes its observation as input to its chosen channel and receives the corresponding channel output . The central referee uses messages

from the players to complete an inference task such as estimation or testing on the underlying distribution

. See Fig. 1 for an illustration of the setup.

The family of allowed channels serves as an abstraction of the information constraints placed on each player’s data. Before moving ahead, we instantiate our abstract setup with two important examples of local communication constraints and local privacy constraints by specifying the corresponding ’s.

• Communication-Limited Inference. Each player can only send bits about their sample. This limitation can be captured by restricting to , the family of channels with output alphabet , i.e., for ,

• Locally Differentially Private Inference. Each player seeks to maintain privacy of their own data. We adopt the notion of local differential privacy which, loosely speaking, requires that no output message from a player reveals too much about its input data. This is captured by restricting to , the family of -locally differentially private (-LDP) channels that satisfy the following (cf. [21, 35, 9, 19]): For ,

 W(y∣x1)W(y∣x2)≤eρ,∀x1,x2∈[k],∀y∈{0,1}∗.

These specific cases of communication and privacy constraints have received a lot of attention in the literature, and we emphasize these cases separately in our results. Nonetheless, our results are valid for arbitrary families and can handle other examples from the literature such as the -step Markov transition matrices considered in [10].

Our proposed framework can be applied to inference for belonging to any family of distributions . For simplicity, however, in this work we restrict ourselves to a finite alphabet and consider the canonical inference problems of estimating and testing goodness-of-fit. Motivated by applications in distributed inference in a resource-constrained setting, we seek algorithms that enable the desired inference using the least number of samples, or equivalently, the least number of players. Our main results present a general approach for establishing lower bounds on the sample complexity of performing a given inference task under the aforementioned information-constrained setting. Underlying our lower bounds is a new quantitative characterization of contraction in chi-square distance between distributions of observations due to imposed information constraints.

We allow randomized selection of ’s from at each player and distinguish between private-coin protocols, where this randomized selection is done independently for each player, and public-coin protocols, where the players can use shared randomness. Interestingly, our chi-square contraction bounds provide a quantitative separation of sample complexity for private-coin and public-coin protocols, an aspect hitherto ignored in the distributed inference literature and which is perhaps the main contribution of our work.

We summarize our results below, after a formal description of our problem setting.

### I-a Information-constrained inference framework

We begin by recalling standard formulations for learning and testing discrete distributions. Denote by the set of all distributions over . In this work, we consider observation alphabet and the set of unknown distributions . Let be independent samples from an unknown distribution . We focus on the following two inference tasks for .

Distribution Learning. In the -distribution learning problem, we seek to estimate a distribution in to within in total variation distance. Formally, a (randomized) mapping constitutes an -estimator if

where denotes the total variation distance between and (see Section II for definition of total variation distance). Namely, estimates the input distribution to within distance

with probability at least

. Note that this choice of probability is arbitrary and can be replaced with any positive constant.

The sample complexity of -distribution learning is the minimum such that we can find an -estimator for . It is well-known that the sample complexity of distribution learning is and the empirical distribution estimate attains it.

Identity Testing. Given a known reference distribution , in the -identity testing problem, we seek to use samples from to test if equals or if it is -far from in total variation distance. Specifically, an -test is given by a (randomized) mapping such that

 if p=q, PrXn∼pn[T(Xn)=0]>2/3 if dTV(p,q)>ε.

Namely, upon observing independent samples , the algorithm should “accept” with high constant probability if the samples come from the reference distribution and “reject” with high constant probability if they come from a distribution significantly far from . Note again that this choice of probability is arbitrary and can be replaced with any constant greater than .

The sample complexity of -identity testing is the minimum such that we can find an -test for . Clearly, this quantity will depend on the reference distribution . However, it is customary to consider sample complexity over the worst-case .111The sample complexity for a fixed has been studied under the “instance-optimal” setting (see [46, 11]): while the question is not fully resolved, nearly tight upper and lower bounds are known. In this worst-case setting, while it has been known for some time that the most stringent sample requirement arises for

set to the uniform distribution, a recent result of

[27] provides a formal reduction of arbitrary to the uniform distribution case. It is therefore enough to restrict to being the uniform distribution; identity testing for the uniform reference distribution is termed the -uniformity testing problem. The sample complexity of -uniformity testing was shown to be in [38].

Moving to our distributed setting, the estimator and the test must now be implemented by central referee using , where denotes the output of the channel (chosen by player from ) when the input is . The message constitutes communication from player to . Formally, we restrict to simultaneous message passing (SMP) protocols for communication, wherein the messages from all players are transmitted simultaneously to the central server, and no other communication is allowed. This restriction is motivated by applications in distributed inference where the players are users or nodes connected to a central server and there is no direct link for communication between these users. Although one could consider natural extensions to a more general adaptive communication setting, we restrict ourselves to the practically relevant SMP setting in this work.

Note that the SMP setting forbids communication between the players, but does allow them to a priori agree on a strategy to select different mappings from . In this context, the role of shared randomness available to the players is important, and motivates us to distinguish the settings of private-coin and public-coin protocols. In fact, as pointed earlier, a central theme of this work is to demonstrate the role of shared randomness available as public-coins in enabling distributed inference. We show that it is indeed a resource that can greatly reduce the sample complexity of distributed inference.222The distinction between public-coin and private-coin protocols is not so pronounced when multiple rounds of interaction between players are allowed. For instance, the first player may share the value of its private coins in the first round of communication, providing shared randomness. Thus, our results also imply a strict improvement in sample complexity of distributed inference by allowing multiple rounds of interaction.

Formally, the private- and public-coin SMP protocols are described as follows.

###### Definition .

Let

denote independent random variables, which are independent jointly of

.333 In this work, we are not concerned with the amount of private or public randomness used. Thus, we can assume

’s to be discrete random variables, distributed uniformly over a domain of sufficiently large cardinality.

In a private-coin SMP protocol, player is given access to and the channel is chosen as a function of . The central referee is given access to the random variables as well and its estimator and test can depend on .

###### Definition .

Let be a random variable independent of . In a public-coin SMP protocol, all players are given access to , and they select their respective channels as a function of . The central referee is given access to the random variables as well and its estimator and test can depend on .

Hence, in a private-coin SMP protocol, the communication from player is a (randomized) function of . Note that since both and are generated from a product distribution, so is . In contrast, in a public-coin SMP protocol, the communication from player is a (randomized) function of and the ’s are not independent. They are, however, independent conditioned on the shared randomness .

###### Remark I.1.

Throughout we assume that some randomness is available to generate the output of the channel given its input . This randomness is assumed to be private as well, and, in fact, is assumed to be available only to each player – not even to . This distinction is important in the context of privacy where the information available to is seen as “leaked” and private randomness available only to the players is critical for enabling LDP channels.444Perhaps a more apt name for private-coin protocols in our setting would be pairwise-coin protocols; however, we have avoided this additional terminology for ease of presentation. Note that this assumption stands even for public-coin SMP protocols, implying the conditional independence of ’s given mentioned above.

The sample complexities of -distribution learning and -uniformity testing using in the distributed setting can now be defined analogously to the centralized setting by replacing with and , respectively, for public-coin and private-coin protocols. Note that since we are restricting to one sample per player, the sample complexity of these problems corresponds to the minimum number of players required to solve them. Our main objective in this line of work is the following:

Characterize the sample complexity for inference tasks using for public- and private-coin protocols.

### I-B Summary of our results and contributions

We are initiating a systematic study of the distributed inference problem described in the previous section. In this paper, the first in our series, we shall focus on lower bounds. As is well-known from data-processing inequalities of information theory, the output distributions and for channel are “closer” than the corresponding input distributions and . At a high level, we derive lower bounds for distributed inference by providing a quantitative characterization of this reduction in distance for the chi-square distance, which we term chi-square contraction.

More technically, we consider probability distributions obtained by perturbing a nominal distribution. These perturbations are chosen so that in order to accomplish the given inference task, an algorithm must roughly distinguish the perturbed elements. In particular, we relate the difficulty of inference problems to the average chi-square distance of the perturbed distributions to the nominal distribution and the chi-square distance of the average perturbed distribution to the nominal distribution. For our distributed inference setting, we need to bound these two quantities for distributions induced at the outputs of the chosen channels from

.

We provide bounds for these two quantities for channel output distributions in terms of two new measures of average distance in a neighborhood: the chi-square fluctuation for the average distance and the decoupled chi-square fluctuation for the distance to the average. The former notion has appeared earlier in the literature, albeit in different forms, and recovers known bounds for distributed distribution learning problems. The second quantity, the decoupled chi-square fluctuation, is the main technical tool introduced in this work, and leads to new lower bounds for distributed identity testing.

Observe that the general approach sketched above can be applied to any perturbations. We obtain lower bounds for public-coin protocols by a minmax evaluation of these bounds where the minimum is over perturbations and the maximum is over the choice of channels from . In contrast, we show that the performance of private-coin protocols is determined by a maxmin evaluation of these bounds. Remarkably, we establish that the maxmin evaluation is significantly smaller than minmax evaluation, leading to a quantitative separation in performance of private-coin and public-coin protocols for testing problems.

This separation has a heuristic appeal: On the one hand, in public-coin protocols players can use shared randomness to sample channels that best separate the current point in the alternative hypothesis class from the null. On the other hand, for a fixed private-coin protocol, one can identify a perturbation in a “direction” where the current choice of channels will face difficulty in distinguishing the perturbed distributions. Further, we remark that this separation only holds for testing problems. This, too, makes sense in light of the previous heuristic since learning problems require us to distinguish a neighborhood around the current hypothesis, without any preference for a particular “direction” of perturbation.

We develop these techniques systematically in Section III and Section IV. We begin by recasting the lower bounds for standard, centralized setting in our chi-square fluctuation language in Section III before extending these notions to the distributed setting in Section IV. Finally, we evaluate our general lower bounds for distribution learning and identity testing problems.

The resulting lower bounds involve matrices given by

 H(W)i1,i2:=∑y∈Y(W(y∣2i1−1)−W(y∣2i1))(W(y∣2i2−1)−W(y∣2i2))∑x∈[k]W(y∣x),i1,i2∈[k/2]. (1)

Our bounds rely on the Frobenius norm and the nuclear norm of the matrix ; see Section II for definitions. These norms roughly measure how informative a particular channel is for distributed inference, and our bounds involve the maximum of these norms over in .

We summarize in Table I our sample complexity lower bounds for the -distribution learning and -identity testing problems using for public- and private-coin protocols. The form here is only indicative; formal statements for results for general channels are available in  Sections IV-B, IV-B and IV-B in Section IV and implications for specific are given in Section V, with results on communication-limited setting in Theorems V.3, V.2 and V.1 and LDP setting in Theorems V.6, V.5 and V.4. The terms in each cell denotes the lower bound obtained by our approach. The first row contains our lower bounds for a general family . For comparison, we recall in the second row the results for the standard, centralized setting, and highlight the change in sample complexity lower bound as a multiplicative factor. Note that we have the same factor increase in the sample complexity for -distribution learning for both private- and public-coin protocols. We obtain the same factor increase for identity testing using private-coin protocols. On the other hand, we show that the factor increase for identity testing using public-coin protocols is , which in general can be much smaller.

As a corollary of these general bounds, we obtain and lower bounds for distribution learning using (the communication-limited setting) and (the LDP setting), respectively. These bounds have been previously obtained in other works as well and are known to be tight.

For identity testing, we obtain and lower bounds using and , respectively, for public-coin protocols. In the subsequent papers in this series ([3, 1]), we will present public-coin protocols to match these bounds, thereby establishing the optimality of these lower bounds.

Finally, for identity testing using private-coin protocols, we obtain and lower bounds using and , respectively, which too will be seen to be optimal among private-coin protocols in subsequent papers.

### I-C Prior work

The statistical tasks of distribution learning and identity testing considered in this work have a rich history. The former requires no special techniques other than those used in parametric estimation problems with finite-dimensional parameter spaces, which are standard textbook material. The identity testing problem is the same as the classic goodness-of-fit problem. The latter goes beyond the discrete setting considered here, but often starts with a quantization to a uniform reference distribution (see [33, 36]). The focus in this line of research has always been on the relation of the performance to the support size (cf. [36]), with particular interest on the large-support and small-sample case where the usual normal approximations of statistics do not apply (cf. [37, 8]). Closer to our setting, Paninski [38] (see, also, [46]) established the sample complexity of uniformity testing, showing that it is sublinear in and equal to . As mentioned earlier, in this work we are following this sample complexity framework that has received attention in recent years. We refer the reader to surveys [17, 40, 14, 7] for a comprehensive review of recent results on discrete distribution learning and testing.

Distributed inference problems, too, have been studied extensively, although for the asymptotic, large-sample case and for simpler hypothesis classes. There are several threads here. Starting from Tsitsiklis [45], decentralized detection has received attention in the control and signal processing literature, with main focus on information structure, likelihood ratio tests and combining local decisions for global inference. In a parallel thread, distributed statistical inference under communication constraints was initially studied in the information theory community [5, 28, 29], with the objective to characterize the asymptotic error exponents as a function of the communication rate. Recent results in this area have focused on more complicated communication models [50, 49] and, more recently, on the minimum communication requirements for large sample sizes [41, 6].

Our focus is different from that of the works above. In our setting, independent samples are not available at one place, but instead information constraints are placed on individual samples. This is along the line of recent work on distributed mean estimation under communication constraints [54, 26, 42, 13, 51], although some of these works consider more general communication models than what we allow. The distribution learning problem under communication constraints has been studied in [18]. However, in that paper the authors consider a blackboard model of communication and strives to minimize the total number of bits communicated, without placing any restriction on the number of bits per sample. A variant of the distribution testing problem is considered in [24] where players observe multiple samples and communicate their local test results to the central referee who is required to use simple aggregation rules such as AND. Interestingly, such setups have received a lot of attention in sensor network literature where fusion center combines local decisions using simple rules such as majority; see [47] for an early review.

Closest to our work and independent of it is [30], which studies the -distribution learning problem using bits of communication per sample. It was shown that the sample complexity for this problem is . This paper in turn uses a general lower bound from [31, 32], which yields lower bounds for distributed parametric estimation under suitable smoothness conditions. For this special case, our general approach reduces to a similar procedure as [32], which was obtained independently of our work.

Distribution learning under LDP constraints has been studied in [19, 34, 52, 4, 48], all providing sample-optimal schemes with different merits. Our lower bound when specialized for this setting coincides with the one derived in [19].

In spite of this large body of literature closely related to our work, there are two distinguishing features of our approach. First, the methods for deriving lower bounds under local information constraints in all these works, while leading to tight bounds for distribution learning, does not extend to identity testing. In fact, our decoupled chi-square fluctuation bound fills this gap in the literature. We remark that distributed uniformity testing under LDP constraints has been studied recently in [43], but the lower bounds derived there are significantly weaker than what we obtain. Second, our approach allows us to prove a separation between the performances of public-coin and private-coin protocols. This qualitative lesson – namely that shared public randomness reduces the sample complexity – is in contrast to the prescription of [45] which showed that shared randomness does not help for simple hypothesis testing problems.555

Identity testing is a composite hypothesis testing problem with null hypothesis

and alternative comprising all distributions that are -far from in total variation distance.

We observe that the unifying treatment based on chi-square distance is reminiscent of the lower bounds for learning under statistical queries (SQ) derived in [23, 22, 44]. On the one hand, the connection between these two problems can be expected based on the relation between LDP and SQ learning established in [35]. On the other hand, this line of work only characterizes sample complexity up to polynomial factors. In particular, it does not lead to lower bounds we obtain using our decoupled chi-square fluctuation bounds.

We close with a pointer to an interesting connection to the capacity of an arbitrary varying channel (AVC). At a high level, our minmax lower bound considers the worst perturbation for the best channel. This is semantically dual to the expression for capacity of an AVC with shared randomness, where the capacity is determined by the maxmin mutual information, with maximum over input distributions and minimum over channels (cf. [16]).

### I-D Organization

We specify our notation in Section II and recall some basic inequalities needed for our analysis. This is followed by a review of the existing lower bounds for sample complexity of distribution learning and identity testing in Section III. In doing so, we introduce the notions of chi-square fluctuations which will be central to our work, and cast existing lower bounds under our general formulation. In Section IV, we generalize these notions to capture the information-constrained setting. Further, we apply our general approach to distribution learning and identity testing in the information-constrained setting. Then, in Section V, we instantiate these results to the settings of communication-limited and LDP inference and obtain our order-optimal bounds for testing and learning under these constraints. We conclude with pointers to schemes matching our lower bounds which will be reported in the subsequent papers in this series.

## Ii Notation and preliminaries

Throughout this paper, we denote by the logarithm to the base and by the natural logarithm. We use standard asymptotic notation , , and for complexity orders.

Denote by for the set of integers . Given a fixed (and known) discrete domain of cardinality , we write for the set of probability distributions over , i.e.,

 Δk={p:[k]→[0,1]:∥p∥1=1}.

For a discrete set , we denote by the uniform distribution on and will omit the subscript when the domain is clear from context.

The total variation distance between two probability distributions is defined as

 dTV(p,q):=supS⊆X(p(S)−q(S))=12∑x∈X|p(x)−q(x)|,

namely, is equal to half of the distance of and . In addition to total variation distance, we will extensively rely on the chi-square distance and Kullback–Leibler (KL) divergence between distributions , defined by

 dχ2(p,q) :=∑x∈X(p(x)−q(x))2q(x), and D(p∥q) :=∑x∈Xp(x)logp(x)q(x).

Using concavity of logarithms and Pinsker’s inequality, we can relate these two quantities to total variation distance as follows:

 dTV(p,q)2≤12D(p∥q)≤12dχ2(p,q).

In our results, we will rely on the following norms for matrices. Given a real-valued matrix

with singular values

, the Frobenius norm (or Schatten 2-norm) of is given by

 ∥A∥F=(m∑i=1n∑j=1a2ij)1/2=⎛⎝min(m,n)∑k=1σ2k⎞⎠1/2=√TrATA.

Similarly, the nuclear norm (also known as trace or Schatten 1-norm) of is defined as

where is the (principal) square root of the positive semi-definite matrix . For any , the Frobenius and nuclear norms satisfy the following inequality

 ∥A∥F≤∥A∥∗≤√rankA⋅∥A∥F, (2)

which can be seen to follow, for instance, from an / inequality and Cauchy–Schwarz inequality. Finally, the spectral radius of complex square matrix

with eigenvalues

, is defined as .

## Iii Perturbed families and chi-square fluctuations

To build basic heuristics, we first revisit the derivation of lower bounds for sample complexity of -distribution learning and -identity testing. As mentioned previously, it suffices to derive lower bound for -uniformity testing. For brevity, we will sometimes refer to distribution learning as learning and identity testing as testing. We present both proofs in a unifying framework which, in addition to its generality, will extend to our communication-constrained setting.666Although we restrict ourselves to the discrete setting here, the framework extends in a straightforward manner to more general parametric families.

Lower bounds for both learning and testing can be derived from a local view of the geometry of product distributions around the uniform distribution. Denote by the -fold product distribution with each marginal given by , the uniform distribution on . A typical lower bound proof entails finding an appropriate family of distributions close to for which it is information-theoretically difficult to solve the underlying problem. We call such a family a perturbed family and define it next.

###### Definition .

For and a given -ary distribution , an -perturbed family around is a finite collection of distributions satisfying .

When is clear from context, we simply use the phrase perturbed family around .

As we shall see below, the bottleneck for learning distributions, which is a parametric estimation problem, arises from the difficulty in solving a multiple hypothesis testing problem with hypotheses given by the elements of a perturbed family around . Using Fano’s inequality, we can show that this difficulty is captured by the average KL divergence between and the elements of the perturbed family. In fact, for a unified treatment, we shall simply bound KL divergences by chi-square distances. This motivates the following definition.

###### Definition .

Given a -ary distribution and a perturbed family around , the chi-square fluctuation of is given by

 χ2(P):=1|P|∑q∈Pdχ2(q,p).

The aforementioned average divergence is bounded above by the chi-square fluctuation of , which will be used to obtain a lower bound for sample complexity of learning in the next section.

On the other hand, the bottleneck for testing, which is a composite hypothesis testing problem, arises from the difficulty in solving the binary hypothesis testing problem with as one hypothesis and a uniform mixture of the -fold product of elements of the perturbed family as the other. This difficulty is captured by the total variation distance between these two distributions on , for which a simple upper bound is . However, this bound turns out to be far from optimal.

Instead, an alternative bound derived using a recipe from Pollard [39] was shown to be tight in Paninski [38]. To understand this bound, we parameterize the elements of the perturbed family as , for . Denoting by the normalized perturbation with entries given by

 δz(x)=pz(x)−p(x)p(x),x∈[k],

we can re-express as

 χ2(P)=E[dχ2(pZ,p)]=EZ[∥δZ∥22],

where

is the second moment of the random variable

(for drawn from ) and the outer expectation is over which is uniformly distributed over (independently of ). Using a technique of Pollard ( [38]), we can essentially replace in the previously mentioned upper bound by a quantity we term the decoupled chi-square fluctuation of , defined next.

###### Definition .

Given a -ary distribution and a perturbed family around , the -fold decoupled chi-square fluctuation of is given by

 χ(2)(Pn):=logEZZ′[exp(n⋅⟨δZ,δZ′⟩)],

where is the correlation inner product for drawn from and the outer expectation is over distributed uniformly over and an independent copy of .

While the quantities and are new, they are implicit in previous work. The abstraction here allows us to have a clear geometric view and lends itself to the more general local information-constrained setting. For completeness, we review the proofs of existing lower bounds using our chi-square fluctuations terminology.

In the sections below, we will present the proofs of lower bounds for sample complexity of learning and testing using a specific perturbed family and bring out the role of and in these bounds. In particular, both bounds will be derived using the -perturbed family around due to Paninski [38], consisting of distributions given by

 pz=1k(1+2εz1,1−2εz1,…,1+2εzt,1−2εzt,…,1+2εzk2,1−2εzk2),z∈{−1,+1}k2. (3)

The normalized perturbations for this perturbed family are given by

 δz(x)={2εzi,x=2i−1,2εzi,x=2i,i∈[k/2].

Note that this perturbed family is closely related to the standard one used in statistics where is proportional to for each ; the variant above ensures additionally that the probabilities of pairs of elements are preserved, whereby the perturbed family consists of elements of the -dimensional probability simplex.

### Iii-a Chi-square fluctuation and the learning lower bound

For learning, we consider the multiple hypotheses testing problem where the hypotheses are , , given in Eq. 3. Specifically, denote by the random variable distributed uniformly on and by the random variable with distribution given . We can relate the accuracy of a probability estimate to the probability of error for the multiple hypothesis testing problem with hypotheses given by using the standard Fano’s method (cf. [53]). In particular, we can use a probability estimate to solve the hypothesis testing problem by returning as a that minimizes . The difficulty here is that the total variation distance may not be , and therefore, an -estimator may not return the correct hypothesis.

One way of circumventing this difficulty is to restrict to a perturbed family where pairwise-distances are . Note that for the perturbed family in (3)

 dTV(pz,pz′)=dist(z,z′)⋅2εk, (4)

where is the Hamming distance. This simple observation allows us to convert the problem of constructing a “packing” in total variation distance to that of constructing a packing in Hamming space. Indeed, a standard Gilbert–Varshamov construction of packing in Hamming space yields a subset with such that for every in . Using Fano’s inequality to bound the probability of error for this new perturbed family, we can relate the sample complexity of learning to . However, when later extending our bounds to the information-constrained setting, this construction would create difficulties in bounding for public-coin protocols. We avoid this complication by relying instead on a slightly modified form of the classic Fano’s argument from [20]; this form of Fano’s argument was used in [32] as well to obtain a lower bound for the sample complexity of learning under communication constraints.

Specifically, in view of (4), it is easy to see that for an estimate such that for all , we must have

 Pr[dist(Z,^Z)>k6]≤13.

On the other hand, the proof of Fano’s inequality in [15] can be extended easily to obtain (see, also, [20])

 Pr[dist(Z,^Z)>k6]≥1−I(Z∧Yn)+1log2|Z|−log2Bk/6,

where denotes the cardinality of Hamming ball of radius . Noting that

 log2B(k/6)≤k2⋅h(13), (5)

and combining the bounds above, we obtain

 I(Z∧Yn)+1≥k40. (6)

Therefore, to obtain a lower bound for sample complexity it suffices to bound from above. It is in this part that we bring in the role of chi-square fluctuations.

Indeed, we have

 I(Z∧Yn) =minQ∈ΔknE[D(pnZ∥Q)] ≤E[D(pnZ∥un)] =nE[D(pZ∥u)] ≤nE[dχ2(pZ,u)] =n⋅χ2(Pε), (7)

where the last inequality uses . Combining Eq. 6 and Eq. 7, we obtain that , yielding the desired lower bound for sample complexity.

In fact, the argument above is valid for any perturbation with desired pairwise minimum total variation distance, namely any perturbed family satisfying an appropriate replacement for Eq. 5. In particular, it suffices to impose the following condition:

 maxz∈Z∣∣∣{z′∈Z:dTV(pz,pz′)≤ε3}∣∣∣≤Cε. (8)

The foregoing arguments lead to the next result.

###### Lemma .

For and a -ary distribution , let be an -perturbed family around satisfying Eq. 8. Then, the sample complexity of -distribution testing must be at least

 Ω(log|Pε|−logCεχ2(Pε)).

For Paninski’s perturbed family in Eq. 3, , , and an easy calculation yields

 χ2(Pε)=4ε2, (9)

which with the previous result recovers the lower bound for sample complexity of learning.

### Iii-B Decoupled chi-square fluctuation and the testing lower bound

As is the case with distribution learning, the pairwise hypothesis testing problems emerging from the perturbed family do not yield the desired dependence of sample complexity on . The bottleneck is obtained by realizing that the actual problem we end up solving is a composite binary hypothesis testing where the null hypothesis is given by and the alternative can be any of the , . In particular, any test for uniformity using samples will also constitute a test for versus for every random variable . Thus, another aspect of the geometry around that enters our consideration is the distance between and .

A simple approach to bound this quantity is to observe that by the convexity of the norm, we have

 (10)

where the last identity is by (9). Thus, this upper bound of the distance between and in terms of the chi-square fluctuation only yields a sample complexity lower bound of , much lower than the bound that we strive for.

Instead, we bound this distance in terms of the decoupled chi-square fluctuation using a recipe from [39]. To that end, we rely on the following result, which is a slight extension of a similar result in [39] (see, also, [38]). This crucial extension will allow us to handle local information constraints later; we include a proof for completeness.

###### Lemma .

Consider a random variable such that for each the distribution is defined as . Further, let be a fixed product distribution. Then,

 χ2(Eθ[Qnθ],Pn)=Eθθ′[n∏i=1(1+Hi(θ,θ′))]−1,

where is an independent copy of , and with ,

 Hi(ϑ,ϑ′):=⟨δϑi,δϑ′i⟩=E[δϑi(Xi)δϑ′i(Xi)],

where the expectation is over distributed according to .

###### Proof.

Using the definition of chi-square distance, we have

 χ2(Eθ[Qnθ],Pn) =EPn⎡⎣(Eθ[Qnθ(Xn)Pn(Xn)])2⎤⎦−1=EPn⎡⎣(Eθ[n∏i=1(1+Δθi)])2⎤⎦−1,

where the outer expectation is for using the distribution . For brevity, denote by the random variable . The product in the expression above can be expanded as

 n∏i=1(1+Δθi)=1+∑i∈[n]Δθi+∑i1>i2Δθi1Δθi2+…,

whereby we get

 χ2(Eθ[Qnθ],Pn) =EPn[∑iEθ[Δθi]+∑jEθ′[Δθ′j]+∑i,jEθ,θ′[ΔθiΔθ′j]…].

Observe now that for every . Furthermore, is an independent copy of and and are independent for . Therefore, the expectation on the right-side above equals

 E[∑iHi(θ,θ′)+∑i1>i2Hi1(θ,θ′)Hi2(θ,θ′)+…]=E[n∏i=1(1+Hi(θ,θ′))]−1,

which completes the proof. ∎

Proceeding as in [39], we obtain the following result which will be seen to yield the desired lower bound for sample complexity.

###### Lemma .

For and a -ary distribution , let be an -perturbed family around . Then, the sample complexity for -identity testing with reference distribution must satisfy

 χ(2)(Pnε)≥c,

for some constant depending only on the probability of error.

###### Proof.

The proof uses Le Cam’s two-point method. We note first that

 dTV(E[pnZ],pn)2≤dχ2(E[pnZ],pn),

and bound the right-side further using Section III-B with replaced by , , and to get

 dTV(E[pnZ],pn)2 ≤EZZ′[(1+H1(Z,Z′))n]−1 =exp(χ(2)(Pnε))−1, (11)

since . Now, to complete the proof, consider an -test . By definition, we have and for every , whereby

 12PrXn∼pn[T(Xn)≠1]+12PrXn∼E[pnZ][T(Xn)≠0]≤13. (12)

The left-hand-side above coincides with the Bayes error for test for the simple binary hypothesis testing problem of versus , which must be at least

 12(1−dTV(E[pnZ],pn)).

Thus, we obtain , which together with (11) completes the proof. ∎

In particular, going back to Paninski’s perturbed family of Eq. 3, observe that

 ⟨δZ,δ′Z⟩=8ε2kk2∑i=1ZiZ′i=2ε2kk2∑i=1Vi,

where are independent and distributed uniformly over , we can bound the decoupled chi-square fluctuation using Hoeffding’s Lemma ( [12]) as

 χ(2)(Pnε)=logE⎡⎣e8nε2k∑k2i=1Vi⎤⎦≤16n2ε4k. (13)

Thus, Section III-B implies that samples are needed for testing (in particular, for uniformity testing).

We summarize the geometry captured by the bounds derived in this section in Fig. 2. This geometry is a local view in the neighborhood of the uniform distribution obtained using the perturbed family in Eq. 3. Each is at a total variation distance from . The mixture distribution we use is obtained by uniformly choosing the perturbation over .

The chi-square fluctuation of is whereby the average total variation distance to is . On the other hand, the decoupled chi-square fluctuation of is and thus the total variation distance of the mixture of to is . Note that for , the total variation distance between the mixture and is much smaller than the average total variation distance.

## Iv Results: The chi-squared contraction bounds

We now extend our notions of chi-square fluctuation and decoupled chi-square fluctuation to the information-constrained setting. We follow the same notation as the previous section. Recall that in the information-constrained setting each player sends information about its sample by choosing a channel from a family to communicate to the central referee . The perturbed family will now induce a distribution on the outputs of the chosen channels . The difficulty of learning and testing problems will thus be determined by chi-square fluctuations for this induced perturbed family, extending the results of the previous section to the information-constrained setting. The difficulty of inference gets amplified by information constraints since the induced distributions are closer than the original ones and the chi-square fluctuation decreases.

As one of our main results in this section, we provide a bound for chi-square fluctuations of the induced perturbed family corresponding to Paninski’s perturbed family of Eq. 3, for a given . Underlying these bounds is a precise characterization of the contraction in chi-square fluctuation owing to information constraints. One can view this as a bound for the minmax chi-square fluctuation for an induced perturbed family, where the minimum is taken over perturbed families and the maximum over all channels in . We will see that for public-coin protocols, the bottleneck is indeed captured by this minmax chi-square fluctuation.

On the other hand, for private-coin protocols the bottleneck can be tightened further by designing a perturbation specifically for each choice of channels from . In other terms, we can in this case use a bound for maxmin chi-square fluctuation. Another main result of this section, perhaps our most striking one, is a tight bound for this maxmin chi-square fluctuation for the aforementioned induced perturbed family. This bound turns out to be more restrictive than the minmax chi-square fluctuation bound and leads to the separation for private- and public-coin protocols for the cases and considered in the next section.

We begin by noting that Section III-A and Section III-B extend to the information-constrained setting. Our extension involves the notions of an induced perturbed family and its chi-square fluctuations, defined next. Throughout we assume that the family of channels consists of channels where the input alphabet is and the output alphabet is finite, and the perturbed family can be parameterized as .

###### Definition .

For a perturbed family and channels , the induced perturbed family comprises distributions on given by

To extend the notion of chi-square fluctuations to induced perturbed families, we need to capture the corresponding notion of normalized perturbation. Let and , respectively, be the output distributions for a channel with input distributions and . Then, for , we have

 qW(y)−pW(y)pW(y)=∑x∈X(q(x)−p(x))W(y∣x)pW(y)=∑xp(x)W(y∣x)δ(x)∑x′p(x)W(y∣x