Learning Distributions from their Samples under Communication Constraints

02/07/2019
by   Leighton Pate Barnes, et al.
Stanford University
0

We consider the problem of learning high-dimensional, nonparametric and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an independent sample from the underlying distribution and can use k bits to communicate its sample to a central processor. We consider three different models for communication. Under the independent model, each node communicates its sample to a central processor by independently encoding it into k bits. Under the more general sequential or blackboard communication models, nodes can share information interactively but each node is restricted to write at most k bits on the final transcript. We characterize the impact of the communication constraint k on the minimax risk of estimating the underlying distribution under ℓ^2 loss. We develop minimax lower bounds that apply in a unified way to many common statistical models and reveal that the impact of the communication constraint can be qualitatively different depending on the tail behavior of the score function associated with each model. A key ingredient in our proof is a geometric characterization of Fisher information from quantized samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/23/2018

Geometric Lower Bounds for Distributed Parameter Estimation under Communication Constraints

We consider parameter estimation in distributed networks, where each nod...
05/25/2021

On learning parametric distributions from quantized samples

We consider the problem of learning parametric distributions from their ...
09/14/2021

On Distributed Learning with Constant Communication Bits

In this paper, we study a distributed learning problem constrained by co...
04/19/2018

Distributed Simulation and Distributed Inference

Independent samples from an unknown probability distribution p on a doma...
09/24/2014

Quantized Estimation of Gaussian Sequence Models in Euclidean Balls

A central result in statistical theory is Pinsker's theorem, which chara...
03/04/2018

Distributed Nonparametric Regression under Communication Constraints

This paper studies the problem of nonparametric estimation of a smooth f...
10/07/2021

Pointwise Bounds for Distribution Estimation under Communication Constraints

We consider the problem of estimating a d-dimensional discrete distribut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Estimating a distribution from samples is a fundamental unsupervised learning problem that has been studied in statistics since the late nineteenth century

[19]. Consider the following distribution estimation model

where we would like to estimate the unknown distribution under loss. Unlike the traditional statistical setting where samples are available to the estimator as they are, in this paper we consider a distributed setting where each observation is available at a different node and has to be communicated to a central processor by using bits. We consider three different communication protocols:

  1. Independent communication protocol : each node sends a -bit string simultaneously (independent of the other nodes) to the central processor and the final transcript is ;

  2. Sequential communication protocol : the nodes sequentially send a -bit string , where the quantization of the sample can depend on the previous messages ;

  3. Blackboard communication protocol [17]: all nodes communicate via a publicly shown blackboard while the total number of bits each node can write in the final transcript is limited by . When one node writes a message (bit) on the blackboard, all other nodes can see the content of the message and depending on the written bit another node can take the turn to write a message on the blackboard.

Upon receiving the transcript , the central processor produces an estimate of the distribution based on the transcript and known procotol which can be of type , , or . Our goal is to jointly design the protocol and the estimator so as to minimize the worst case squared risk, i.e., to characterize

where denotes the class of distributions which belongs to. We study three different instances of this estimation problem:

  1. High-dimensional discrete distributions: in this case we assume that is a discrete distribution with known support size and

    denotes the probability simplex over

    elements. By “high-dimensional” we mean that the support size of the underlying distribution may be comparable to the sample size .

  2. Non-parametric distributions: in this case , where is a density that possesses some standard Hölder continuity property [18].

  3. Structured distributions: in this case, we assume that we have some additional information regarding the structure of the underlying distribution or density. In particular, we assume that the underlying distribution or density can be parametrized such that

    where . In this case, estimating the underlying distribution amounts to estimating the parameters of this distribution and we are interested in the following parameter estimation problem under squared risk

    where is an estimator of .

Statistical estimation in distributed settings has gained increasing popularity over the recent years motivated by the fact that modern data sets are often distributed across multiple machines and processors, and bandwidth and energy limitations in networks and within multiprocessor systems often impose significant bottlenecks on the performance of algorithms. There are also an increasing number of applications in which data is generated in a distributed manner and the data (or features of it) are communicated over bandwidth-limited links to central processors. In particular, recent works [6, 12, 13] focus on a special case of the distributed parameter estimation problem described above, when the underlying distribution is known to have Gaussian structure, i.e. with known and , often called the Gaussian location model. On the other hand, [4] focuses on the first two problems described above, distributed estimation of high-dimensional and non-parametric distributions, under loss.

In this paper, we approach all of these problems in a unified way by developing a framework that characterizes the Fisher information for estimating an underlying unknown parameter from a -bit quantization of a sample . Equivalently, we ask the question: how can we best represent with bits so as to maximize the Fisher information it provides about ? This framework was first introduced by the authors in [1],[2]

. There has been some previous work in analyzing Fisher information from a quantized scalar random variable such as

[7, 8, 9, 10]

. Different from these works, here we consider the arbitrary quantization of a random vector and are able to study the impact of the quantization rate

along with the dimension of the underlying statistical model on the Fisher information. As an application of our framework, we use upper bounds on Fisher information to derive lower bounds on the minimax risk of the distributed estimation problems we discuss above and recover many of the existing results in the literature [6, 12, 13, 4], which are known to be (order-wise) tight. Our technique is significantly simpler and more transparent than those in [6, 12, 13, 4]. In particular, [6, 12, 13] focus on the Gaussian location model and its variants, and rely on strong/distributed data processing inequalities to develop their converse results. Proving strong/distributed data processing inequalities is typically quite technical and as discussed in [4], this approach seems to be only applicable to models where the fundamental dependence of the minimax risk on the quantization rate is linear, such as the Gaussian location model. On the other hand, the approach in [3, 4] is to convert the estimation problem to a hypothesis testing problem and involves a highly technical analysis of the divergence between distributions of the transcript under different hypotheses. The current paper recovers the results in [3] (with the exception of their result on the sparse Gaussian location model which closes a logarithmic gap left open in [12]) via the simpler approach that builds on the analysis of Fisher information from quantized samples. We also extend the results of [3]

to find minimax lower bounds for statistical models with sub-exponential score functions, which is the case, for example, when we are interested in estimating not the mean but the variance of a Gaussian distribution.

I-a Organization of the Paper

In the next section, we introduce the problem of characterizing Fisher information from a quantized sample. We present a geometric characterization for this problem and use it to derive two upper bounds on Fisher information as a function of the quantization rate. We evaluate these upper bounds for common statistical models. In Section III, we formulate the problem of distributed learning of distributions under communication constraints with independent, sequential and blackboard communication protocols. We use the upper bounds on Fisher information from Section II to derive lower bounds on the minimax risk of distributed estimation of discrete and structured distributions. There we also provide a more detailed comparison of our results with those in the literature. Finally, in Section IV we discuss extending these results to non-parametric density estimation.

Ii Fisher information from a quantized sample

Let be a family of probability measures on parameterized by . Suppose each is dominated by some base measure and that each has density with respect to . Let be a single sample drawn from . Any (potentially randomized) -bit quantization strategy for can be expressed in terms of the conditional probabilities

We assume that there is a well-defined joint probability distribution with density

and that is a regular conditional probability. For a given and quantization strategy, denote the likelihood that the quantization takes a specific value by . Let

be the score of this likelihood. In an abuse of notation, we will also denote the score of the likelihood by

The Fisher information matrix for estimating from is

and likewise the Fisher information matrix for estimating from an is

We will assume throughout that satisfies the following regularity conditions:

  • For each and fixed , the function , thought of as a function of , is continuously differentiable with respect to at -almost every .

  • For each and all , the expected value exists and is continuous in .

These two conditions justify interchanging differentiation and integration as in

for each , and they also ensure that is continuously differentiable with respect to each (see Lemma 1, Section 26 in [14]). We will also assume, without loss of generality, that . For each fixed , this can be done by restricting the domain to only include those values such that when taking an expectation, or equivalently, by defining whenenver . In the same way we will assume that .

Our first two lemmas establish a geometric interpretation of the trace . The first lemma is a slight variant of Theorems 1 and 2 from [5], and our debt to that work is clear.

Lemma 1.

The -th entry of the Fisher information matrix is

The inner conditional expectation is with respect to the distribution , while the outer expectation is over the conditioning random variable .

Proof.

Squaring both sides and taking the expectation over gives

where the left-hand side is by definition . ∎

Lemma 2.

The trace of the Fisher information matrix can be written as

(1)
Proof.

By Lemma 1,

In order to get some geometric intuition for the quantity (1), consider a special case where the quantization is deterministic and the score function is a bijection. In this case, the quantization map partitions the space into disjoint quantization bins, and this induces a corresponding partitioning of the score functions values . Each vector is then the centroid of the set of values corresponding to quantization bin (with respect to the induced probability distribution on ). Lemma 2 shows that is equal to the average magnitude squared of these centroid vectors.

Ii-a Upper Bounds on

In this section, we give two upper bounds on . The proofs appear in Appendix -A. The first theorem upper bounds in terms of the variance of when projected onto any unit vector.

Theorem 1.

If for any and any unit vector ,

then

The upper bound follows easily from the data processing inequality for Fisher information [11]. The theorem shows that when is finite, can increase at most exponentially in .

Recall that for , the Orlicz norm of a random variable is defined as

where

A random variable with finite Orlicz norm is sub-exponential, while a random variable with finite Orlicz norm is sub-Gaussian [15]. Our second theorem shows that when the Orlicz norm of the projection of onto any unit vector is bounded by some finite constant, can increase at most like .

Theorem 2.

If for any , some , and any unit vector ,

then

where

Ii-B Applications to Common Statistical Models

We next apply the above two results to common statistical models. We will see that neither of these bounds is strictly stronger than the other and depending on the statistical model, one may yield a tighter bound than the other. The proofs of Corollaries 1 through 4 appear in Appendix -B. In the next section, we show that Corollaries 1, 3, 4 yield tight results for the minimax risk of the corresponding distributed estimation problems.

Corollary 1 (Gaussian location model).

Consider the Gaussian location model where we are trying to estimate the mean of a -dimensional Gaussian random vector with fixed covariance . In this case,

(2)

where

The above corollary follows by showing that the score function associated with this model has finite Orlicz norm and applying Theorem 2.

Corollary 2 (variance of a Gaussian).

Suppose and with . In this case,

where

Corollary 2 similarly follows by showing that the score function associated with this model has finite Orlicz norm and applying Theorem 2.

Corollary 3 (distribution estimation).

Suppose that and that

Let be the free parameters of interest and suppose they can vary from . In this case,

(3)

Corollary 3 is a consequence of Theorem 1 along with characterizing the variance to the score function associated with this model.

Corollary 4 (product Bernoulli model).

Consider . If for some , i.e. the model is relatively dense, then

for some constant that depends only on . If , i.e. the model is sparse, then

With the product Bernoulli model,

so that when , and are both . In this case, Theorem 1 gives

while Theorem 2 gives

In this situation Theorem 2 gives the better bound. On the other hand, if , then and . In this case Theorem 1 gives

while Theorem 2 gives

In the sparse case , so only the bound from Theorem 1 is nontrivial. It is interesting that Theorem 2 is able to use the sub-Gaussian structure in the first case to yield a better bound – but in the second case, when the tail of the score function is essentially not sub-Gaussian, Theorem 1 yields the better bound.

Iii Distributed Parameter Estimation

In this section, we focus on distributed estimation of parameters of an underlying distribution under communication constraints. Estimation of discrete distributions occurs as a special case of this problem. In the next section, we extend the discussion to distributed estimation of non-parametric distributions. The main technical exercise involves the application of Theorems 1 and 2 to statistical estimation with multiple quantized samples where the quantization of different samples can be independent or dependent as dictated by the communication protocol.

Iii-a Problem Formulation

Let

where . We consider three different models for communicating each of these samples with bits to a central processor that is interested to estimate the underlying parameter :

  • Independent communication protocol : under this model, each sample is independently quantized to -bits and communicated. Formally, each sample , for , is encoded to a -bit string by a possibly randomized quantization strategy, denoted by , which can be expressed in terms of the conditional probabilities

  • Sequential communication protocol : under this model, we assume that samples are communicated sequentially by broadcasting the communication to all nodes in the system including the central processor. Therefore, the quantization of the sample can depend on the previously transmitted quantized samples corresponding to samples respectively. Formally, each sample , for , is encoded to a -bit string by a set of possibly randomized quantization strategies , where each strategy can be expressed in terms of the conditional probabilities

While these two models can be motivated by a distributed estimation scenario where the topology of the underlying network can dictate the type of the protocol (see Figure 0(b)) to be used, they can also model the quantization and storage of samples arriving sequentially at a single node. For example, consider a scenario where a continuous stream of samples is captured sequentially and each sample is stored in digital memory by using bits/sample. In the independent model, each sample would be quantized independently of the other samples (even though the quantization strategies for different samples can be different and jointly optimized ahead of time), while under the sequential model the quantization of each sample would depend on the information stored in the memory of the system at time . This is illustrated in Figure 0(a).

We finally introduce a third communication protocol that allows nodes to communicate their samples to the central processor in a fully interactive manner while still limiting the number of bits used per sample to bits. Under this model, each node can see the previously written bits on a public blackboard, and can use that information to determine its quantization strategy for subsequently transmitted bits. This is formally defined below.

  • Blackboard communication protocol : all nodes communicate via a publicly shown blackboard while the total number of bits each node can write in the final transcript is limited by bits. When one node writes a message (bit) on the blackboard, all other nodes can see the content of the message. Formally, a blackboard communication protocol can be viewed as a binary tree [17], where each internal node of the tree is assigned a deterministic label indicating the identity of the node to write the next bit on the blackboard if the protocol reaches tree node ; the left and right edges departing from correspond to the two possible values of this bit and are labeled by and respectively. Because all bits written on the blackboard up to the current time are observed by all nodes, the nodes can keep track of the progress of the protocol in the binary tree. The value of the bit written by node (when the protocol is at node of the binary tree) can depend on the sample observed by this node (and implicitly on all bits previously written on the blackboard encoded in the position of the node in the binary tree). Therefore, this bit can be represented by a function , which we associate with the tree node ; node transmits with probability and with probability . Note that a proper labeling of the binary tree together with the collection of functions (where ranges over all internal tree nodes) completely characterizes all possible (possibly probabilistic) communication strategies for the nodes.

    The -bit communication constraint for each node can be viewed as a labeling constraint for the binary tree; for each , each possible path from the root node to a leaf node can visit exactly internal nodes with label . In particular, the depth of the binary tree is and there is one-to-one correspondence between all possible transcripts and paths in the tree. Note that there is also one-to-one correspondence between and the -bit messages transmitted by the nodes. In particular, the transcript contains the same amount of information as , since given the transcript (and the protocol) one can infer and vice versa (for this direction note that the protocol specifies the node to transmit first, so given one can deduce the path followed in the protocol tree).

(a) Storing a stream of samples

(b) Distributed communication of samples
Fig. 1: Two applications that require quantization of samples. The quantization strategy can be independent or sequential.

Under all three communication protocols above, the end goal is to produce an estimate of the underlying parameter from the bit transcript or equivalently the collection of -bit messages observed by the estimator. Note that the encoding strategies/protocols used in each case can be jointly optimized and agreed upon by all parties ahead of time. Formally, we are interested in the following parameter estimation problem under squared risk

where is an estimator of based on the quantized observations. Note that with an independent communication protocol, the messages are independent, while this is no longer the case with the sequential and blackboard protocols.

Iii-B Main Results for Distributed Parameter Estimator

We next state our main results for distributed parameter estimation. We will show in the next subsection that the next two theorems can be applied to obtain tight lower bounds for distributed estimation under many common statistical models, including discrete distribution estimation and the Gaussian location model.

Theorem 3.

Suppose . For any estimator and communication protocol or , if satisfies the hypotheses in Theorem 1 then

and if satisfies the hypotheses in Theorem 2 then

where .

With the blackboard communication protocol we have the following slightly more restrictive theorem where we can prove the second bound only for the case .

Theorem 4.

Suppose . For any estimator and communication protocol , if satisfies the hypotheses in Theorem 1 then

and if satisfies the hypotheses in Theorem 2 with then

We next show how Theorem 3 by a rather straight forward application of the Van Trees Inequality combined with the conclusions of Theorems 1 and 2. The proof of Theorem 4 is given in the Appendix -C.

Proof of Theorem 3:

We are interested in the quantity

under each model. We have

(4)

due to the chain-rule for Fisher information. Under the independent model,

Under the the sequential model, conditioning on specific only effects the distribution by fixing the quantization strategy for . Formally, for the sequential model,

where the last step follows since is independent of and therefore conditioning of does not change the distribution of . Since the bounds from Theorems 1 and 2 apply for any quantization strategy, they apply to each of the terms in (4), and the following statements hold under both quantization models:

  • Under the hypotheses in Theorem 1,

  • Under the hypotheses in Theorem 2,

Consider the squared error risk in estimating :

In order to lower bound this risk, we will use the van Trees inequality from [16]. Suppose we have a prior for the parameter . For convenience denote . The van Trees inequality for the component gives

(5)

where is the Fisher information from the prior. Note that the required regularity condition that follows trivially since the expectation over is just a finite sum:

The prior can be chosen to minimize this Fisher information and achieve [14]. Let By summing over each component,

(6)
(7)

Therefore,

(8)

The inequaltiy (7) follows from Jensen’s inequality via the convexity of for , and the inequality (6) follows both from this convexity and (5). We could have equivalently used the multivariate version of the van Trees inequality [16] to arrive at the same result, but we have used the single-variable version in each coordinate instead in order to simplify the required regularity conditions.

Combining (8) with (i) and (ii) proves the theorem.

Iii-C Applications to Common Statistical Models

Using the bounds we developed in Section II-B, Theorems 3 and 4 give a lower bound on the minimax risk for the distributed estimation of under common statistical models. We summarize these results in the following corollaries:

Corollary 5 (Gaussian location model).

Let with . For , we have

for any communication protocol of type , , or and estimator , where is a universal constant independent of .

Note that the condition in the above corollary is a weak condition that ensures that we can ignore the second term in the denominator of (8). For fixed this condition is weaker than just assuming that is at least order , which is required for consistent estimation anyways. We will make similar assumptions in the subsequent corollaries.

The corollary recovers the results in [6, 13] (without logarithmic factors in the risk) and the corresponding result from [3] without the condition . A blackboard communication protocol that matches this result is given in [13].

Corollary 6 (variance of a Gaussian).

Let with . For , we have

for any communication protocol of type or and estimator , where is a universal constant independent of .

The bound in this Corollary 6 is new, and it is an unknown whether or not it is order optimal.

Corollary 7 (distribution estimation).

Suppose that and that

Let be the probability simplex with variables. For , we have

for any communication protocol of type , , or and estimator , where is a universal constant independent of .

This result recovers the corresponding result in [3] and matches the upper bound from the achievable scheme developed in [4] (when the performance of the scheme is evaluated under loss rather than ).

Corollary 8 (product Bernoulli model).

Suppose . If then for we have

for any communication protocol of type of type , , or and estimator , where is a universal constant independent of .

If , then for , we get instead

The corollary recovers the corresponding result from [3] and matches the upper bound from the achievable scheme presented in the same paper.

Iv Distributed Estimation of Non-Parametric Densities

Finally, we turn to the case where the are drawn independently from some probability distribution on with density with respect to the uniform measure. We will assume that is (uniformly) Hölder continuous with exponent and constant , i.e. that

for all . Let be the space of all such densities. We are interested in characterizing the minimax risk

where the estimators are functions of the transcript . We have the following theorem.

Theorem 5.

For any blackboard protocol and estimator ,

Moreover, this rate is achieved by an independent protocol so that