I Introduction
Estimating a distribution from samples is a fundamental unsupervised learning problem that has been studied in statistics since the late nineteenth century
[19]. Consider the following distribution estimation modelwhere we would like to estimate the unknown distribution under loss. Unlike the traditional statistical setting where samples are available to the estimator as they are, in this paper we consider a distributed setting where each observation is available at a different node and has to be communicated to a central processor by using bits. We consider three different communication protocols:

Independent communication protocol : each node sends a bit string simultaneously (independent of the other nodes) to the central processor and the final transcript is ;

Sequential communication protocol : the nodes sequentially send a bit string , where the quantization of the sample can depend on the previous messages ;

Blackboard communication protocol [17]: all nodes communicate via a publicly shown blackboard while the total number of bits each node can write in the final transcript is limited by . When one node writes a message (bit) on the blackboard, all other nodes can see the content of the message and depending on the written bit another node can take the turn to write a message on the blackboard.
Upon receiving the transcript , the central processor produces an estimate of the distribution based on the transcript and known procotol which can be of type , , or . Our goal is to jointly design the protocol and the estimator so as to minimize the worst case squared risk, i.e., to characterize
where denotes the class of distributions which belongs to. We study three different instances of this estimation problem:

Highdimensional discrete distributions: in this case we assume that is a discrete distribution with known support size and
denotes the probability simplex over
elements. By “highdimensional” we mean that the support size of the underlying distribution may be comparable to the sample size . 
Nonparametric distributions: in this case , where is a density that possesses some standard Hölder continuity property [18].

Structured distributions: in this case, we assume that we have some additional information regarding the structure of the underlying distribution or density. In particular, we assume that the underlying distribution or density can be parametrized such that
where . In this case, estimating the underlying distribution amounts to estimating the parameters of this distribution and we are interested in the following parameter estimation problem under squared risk
where is an estimator of .
Statistical estimation in distributed settings has gained increasing popularity over the recent years motivated by the fact that modern data sets are often distributed across multiple machines and processors, and bandwidth and energy limitations in networks and within multiprocessor systems often impose significant bottlenecks on the performance of algorithms. There are also an increasing number of applications in which data is generated in a distributed manner and the data (or features of it) are communicated over bandwidthlimited links to central processors. In particular, recent works [6, 12, 13] focus on a special case of the distributed parameter estimation problem described above, when the underlying distribution is known to have Gaussian structure, i.e. with known and , often called the Gaussian location model. On the other hand, [4] focuses on the first two problems described above, distributed estimation of highdimensional and nonparametric distributions, under loss.
In this paper, we approach all of these problems in a unified way by developing a framework that characterizes the Fisher information for estimating an underlying unknown parameter from a bit quantization of a sample . Equivalently, we ask the question: how can we best represent with bits so as to maximize the Fisher information it provides about ? This framework was first introduced by the authors in [1],[2]
. There has been some previous work in analyzing Fisher information from a quantized scalar random variable such as
[7, 8, 9, 10]. Different from these works, here we consider the arbitrary quantization of a random vector and are able to study the impact of the quantization rate
along with the dimension of the underlying statistical model on the Fisher information. As an application of our framework, we use upper bounds on Fisher information to derive lower bounds on the minimax risk of the distributed estimation problems we discuss above and recover many of the existing results in the literature [6, 12, 13, 4], which are known to be (orderwise) tight. Our technique is significantly simpler and more transparent than those in [6, 12, 13, 4]. In particular, [6, 12, 13] focus on the Gaussian location model and its variants, and rely on strong/distributed data processing inequalities to develop their converse results. Proving strong/distributed data processing inequalities is typically quite technical and as discussed in [4], this approach seems to be only applicable to models where the fundamental dependence of the minimax risk on the quantization rate is linear, such as the Gaussian location model. On the other hand, the approach in [3, 4] is to convert the estimation problem to a hypothesis testing problem and involves a highly technical analysis of the divergence between distributions of the transcript under different hypotheses. The current paper recovers the results in [3] (with the exception of their result on the sparse Gaussian location model which closes a logarithmic gap left open in [12]) via the simpler approach that builds on the analysis of Fisher information from quantized samples. We also extend the results of [3]to find minimax lower bounds for statistical models with subexponential score functions, which is the case, for example, when we are interested in estimating not the mean but the variance of a Gaussian distribution.
Ia Organization of the Paper
In the next section, we introduce the problem of characterizing Fisher information from a quantized sample. We present a geometric characterization for this problem and use it to derive two upper bounds on Fisher information as a function of the quantization rate. We evaluate these upper bounds for common statistical models. In Section III, we formulate the problem of distributed learning of distributions under communication constraints with independent, sequential and blackboard communication protocols. We use the upper bounds on Fisher information from Section II to derive lower bounds on the minimax risk of distributed estimation of discrete and structured distributions. There we also provide a more detailed comparison of our results with those in the literature. Finally, in Section IV we discuss extending these results to nonparametric density estimation.
Ii Fisher information from a quantized sample
Let be a family of probability measures on parameterized by . Suppose each is dominated by some base measure and that each has density with respect to . Let be a single sample drawn from . Any (potentially randomized) bit quantization strategy for can be expressed in terms of the conditional probabilities
We assume that there is a welldefined joint probability distribution with density
and that is a regular conditional probability. For a given and quantization strategy, denote the likelihood that the quantization takes a specific value by . Let
be the score of this likelihood. In an abuse of notation, we will also denote the score of the likelihood by
The Fisher information matrix for estimating from is
and likewise the Fisher information matrix for estimating from an is
We will assume throughout that satisfies the following regularity conditions:

For each and fixed , the function , thought of as a function of , is continuously differentiable with respect to at almost every .

For each and all , the expected value exists and is continuous in .
These two conditions justify interchanging differentiation and integration as in
for each , and they also ensure that is continuously differentiable with respect to each (see Lemma 1, Section 26 in [14]). We will also assume, without loss of generality, that . For each fixed , this can be done by restricting the domain to only include those values such that when taking an expectation, or equivalently, by defining whenenver . In the same way we will assume that .
Our first two lemmas establish a geometric interpretation of the trace . The first lemma is a slight variant of Theorems 1 and 2 from [5], and our debt to that work is clear.
Lemma 1.
The th entry of the Fisher information matrix is
The inner conditional expectation is with respect to the distribution , while the outer expectation is over the conditioning random variable .
Proof.
Squaring both sides and taking the expectation over gives
where the lefthand side is by definition . ∎
Lemma 2.
The trace of the Fisher information matrix can be written as
(1) 
Proof.
By Lemma 1,
∎
In order to get some geometric intuition for the quantity (1), consider a special case where the quantization is deterministic and the score function is a bijection. In this case, the quantization map partitions the space into disjoint quantization bins, and this induces a corresponding partitioning of the score functions values . Each vector is then the centroid of the set of values corresponding to quantization bin (with respect to the induced probability distribution on ). Lemma 2 shows that is equal to the average magnitude squared of these centroid vectors.
Iia Upper Bounds on
In this section, we give two upper bounds on . The proofs appear in Appendix A. The first theorem upper bounds in terms of the variance of when projected onto any unit vector.
Theorem 1.
If for any and any unit vector ,
then
The upper bound follows easily from the data processing inequality for Fisher information [11]. The theorem shows that when is finite, can increase at most exponentially in .
Recall that for , the Orlicz norm of a random variable is defined as
where
A random variable with finite Orlicz norm is subexponential, while a random variable with finite Orlicz norm is subGaussian [15]. Our second theorem shows that when the Orlicz norm of the projection of onto any unit vector is bounded by some finite constant, can increase at most like .
Theorem 2.
If for any , some , and any unit vector ,
then
where
IiB Applications to Common Statistical Models
We next apply the above two results to common statistical models. We will see that neither of these bounds is strictly stronger than the other and depending on the statistical model, one may yield a tighter bound than the other. The proofs of Corollaries 1 through 4 appear in Appendix B. In the next section, we show that Corollaries 1, 3, 4 yield tight results for the minimax risk of the corresponding distributed estimation problems.
Corollary 1 (Gaussian location model).
Consider the Gaussian location model where we are trying to estimate the mean of a dimensional Gaussian random vector with fixed covariance . In this case,
(2) 
where
The above corollary follows by showing that the score function associated with this model has finite Orlicz norm and applying Theorem 2.
Corollary 2 (variance of a Gaussian).
Suppose and with . In this case,
where
Corollary 2 similarly follows by showing that the score function associated with this model has finite Orlicz norm and applying Theorem 2.
Corollary 3 (distribution estimation).
Suppose that and that
Let be the free parameters of interest and suppose they can vary from . In this case,
(3) 
Corollary 3 is a consequence of Theorem 1 along with characterizing the variance to the score function associated with this model.
Corollary 4 (product Bernoulli model).
Consider . If for some , i.e. the model is relatively dense, then
for some constant that depends only on . If , i.e. the model is sparse, then
With the product Bernoulli model,
so that when , and are both . In this case, Theorem 1 gives
while Theorem 2 gives
In this situation Theorem 2 gives the better bound. On the other hand, if , then and . In this case Theorem 1 gives
while Theorem 2 gives
In the sparse case , so only the bound from Theorem 1 is nontrivial. It is interesting that Theorem 2 is able to use the subGaussian structure in the first case to yield a better bound – but in the second case, when the tail of the score function is essentially not subGaussian, Theorem 1 yields the better bound.
Iii Distributed Parameter Estimation
In this section, we focus on distributed estimation of parameters of an underlying distribution under communication constraints. Estimation of discrete distributions occurs as a special case of this problem. In the next section, we extend the discussion to distributed estimation of nonparametric distributions. The main technical exercise involves the application of Theorems 1 and 2 to statistical estimation with multiple quantized samples where the quantization of different samples can be independent or dependent as dictated by the communication protocol.
Iiia Problem Formulation
Let
where . We consider three different models for communicating each of these samples with bits to a central processor that is interested to estimate the underlying parameter :

Independent communication protocol : under this model, each sample is independently quantized to bits and communicated. Formally, each sample , for , is encoded to a bit string by a possibly randomized quantization strategy, denoted by , which can be expressed in terms of the conditional probabilities

Sequential communication protocol : under this model, we assume that samples are communicated sequentially by broadcasting the communication to all nodes in the system including the central processor. Therefore, the quantization of the sample can depend on the previously transmitted quantized samples corresponding to samples respectively. Formally, each sample , for , is encoded to a bit string by a set of possibly randomized quantization strategies , where each strategy can be expressed in terms of the conditional probabilities
While these two models can be motivated by a distributed estimation scenario where the topology of the underlying network can dictate the type of the protocol (see Figure 0(b)) to be used, they can also model the quantization and storage of samples arriving sequentially at a single node. For example, consider a scenario where a continuous stream of samples is captured sequentially and each sample is stored in digital memory by using bits/sample. In the independent model, each sample would be quantized independently of the other samples (even though the quantization strategies for different samples can be different and jointly optimized ahead of time), while under the sequential model the quantization of each sample would depend on the information stored in the memory of the system at time . This is illustrated in Figure 0(a).
We finally introduce a third communication protocol that allows nodes to communicate their samples to the central processor in a fully interactive manner while still limiting the number of bits used per sample to bits. Under this model, each node can see the previously written bits on a public blackboard, and can use that information to determine its quantization strategy for subsequently transmitted bits. This is formally defined below.

Blackboard communication protocol : all nodes communicate via a publicly shown blackboard while the total number of bits each node can write in the final transcript is limited by bits. When one node writes a message (bit) on the blackboard, all other nodes can see the content of the message. Formally, a blackboard communication protocol can be viewed as a binary tree [17], where each internal node of the tree is assigned a deterministic label indicating the identity of the node to write the next bit on the blackboard if the protocol reaches tree node ; the left and right edges departing from correspond to the two possible values of this bit and are labeled by and respectively. Because all bits written on the blackboard up to the current time are observed by all nodes, the nodes can keep track of the progress of the protocol in the binary tree. The value of the bit written by node (when the protocol is at node of the binary tree) can depend on the sample observed by this node (and implicitly on all bits previously written on the blackboard encoded in the position of the node in the binary tree). Therefore, this bit can be represented by a function , which we associate with the tree node ; node transmits with probability and with probability . Note that a proper labeling of the binary tree together with the collection of functions (where ranges over all internal tree nodes) completely characterizes all possible (possibly probabilistic) communication strategies for the nodes.
The bit communication constraint for each node can be viewed as a labeling constraint for the binary tree; for each , each possible path from the root node to a leaf node can visit exactly internal nodes with label . In particular, the depth of the binary tree is and there is onetoone correspondence between all possible transcripts and paths in the tree. Note that there is also onetoone correspondence between and the bit messages transmitted by the nodes. In particular, the transcript contains the same amount of information as , since given the transcript (and the protocol) one can infer and vice versa (for this direction note that the protocol specifies the node to transmit first, so given one can deduce the path followed in the protocol tree).
Under all three communication protocols above, the end goal is to produce an estimate of the underlying parameter from the bit transcript or equivalently the collection of bit messages observed by the estimator. Note that the encoding strategies/protocols used in each case can be jointly optimized and agreed upon by all parties ahead of time. Formally, we are interested in the following parameter estimation problem under squared risk
where is an estimator of based on the quantized observations. Note that with an independent communication protocol, the messages are independent, while this is no longer the case with the sequential and blackboard protocols.
IiiB Main Results for Distributed Parameter Estimator
We next state our main results for distributed parameter estimation. We will show in the next subsection that the next two theorems can be applied to obtain tight lower bounds for distributed estimation under many common statistical models, including discrete distribution estimation and the Gaussian location model.
Theorem 3.
With the blackboard communication protocol we have the following slightly more restrictive theorem where we can prove the second bound only for the case .
Theorem 4.
We next show how Theorem 3 by a rather straight forward application of the Van Trees Inequality combined with the conclusions of Theorems 1 and 2. The proof of Theorem 4 is given in the Appendix C.
Proof of Theorem 3:
We are interested in the quantity
under each model. We have
(4) 
due to the chainrule for Fisher information. Under the independent model,
Under the the sequential model, conditioning on specific only effects the distribution by fixing the quantization strategy for . Formally, for the sequential model,
where the last step follows since is independent of and therefore conditioning of does not change the distribution of . Since the bounds from Theorems 1 and 2 apply for any quantization strategy, they apply to each of the terms in (4), and the following statements hold under both quantization models:
Consider the squared error risk in estimating :
In order to lower bound this risk, we will use the van Trees inequality from [16]. Suppose we have a prior for the parameter . For convenience denote . The van Trees inequality for the component gives
(5) 
where is the Fisher information from the prior. Note that the required regularity condition that follows trivially since the expectation over is just a finite sum:
The prior can be chosen to minimize this Fisher information and achieve [14]. Let By summing over each component,
(6)  
(7)  
Therefore,
(8) 
The inequaltiy (7) follows from Jensen’s inequality via the convexity of for , and the inequality (6) follows both from this convexity and (5). We could have equivalently used the multivariate version of the van Trees inequality [16] to arrive at the same result, but we have used the singlevariable version in each coordinate instead in order to simplify the required regularity conditions.
Combining (8) with (i) and (ii) proves the theorem.
IiiC Applications to Common Statistical Models
Using the bounds we developed in Section IIB, Theorems 3 and 4 give a lower bound on the minimax risk for the distributed estimation of under common statistical models. We summarize these results in the following corollaries:
Corollary 5 (Gaussian location model).
Let with . For , we have
for any communication protocol of type , , or and estimator , where is a universal constant independent of .
Note that the condition in the above corollary is a weak condition that ensures that we can ignore the second term in the denominator of (8). For fixed this condition is weaker than just assuming that is at least order , which is required for consistent estimation anyways. We will make similar assumptions in the subsequent corollaries.
The corollary recovers the results in [6, 13] (without logarithmic factors in the risk) and the corresponding result from [3] without the condition . A blackboard communication protocol that matches this result is given in [13].
Corollary 6 (variance of a Gaussian).
Let with . For , we have
for any communication protocol of type or and estimator , where is a universal constant independent of .
The bound in this Corollary 6 is new, and it is an unknown whether or not it is order optimal.
Corollary 7 (distribution estimation).
Suppose that and that
Let be the probability simplex with variables. For , we have
for any communication protocol of type , , or and estimator , where is a universal constant independent of .
This result recovers the corresponding result in [3] and matches the upper bound from the achievable scheme developed in [4] (when the performance of the scheme is evaluated under loss rather than ).
Corollary 8 (product Bernoulli model).
Suppose . If then for we have
for any communication protocol of type of type , , or and estimator , where is a universal constant independent of .
If , then for , we get instead
The corollary recovers the corresponding result from [3] and matches the upper bound from the achievable scheme presented in the same paper.
Iv Distributed Estimation of NonParametric Densities
Finally, we turn to the case where the are drawn independently from some probability distribution on with density with respect to the uniform measure. We will assume that is (uniformly) Hölder continuous with exponent and constant , i.e. that
for all . Let be the space of all such densities. We are interested in characterizing the minimax risk
where the estimators are functions of the transcript . We have the following theorem.
Theorem 5.
For any blackboard protocol and estimator ,
Moreover, this rate is achieved by an independent protocol so that