Rapid growth in the size of modern data sets has fueled a lot of interest in solving statistical and machine learning tasks in a distributed environment using multiple machines. Communication between the machines has emerged as an important resource and sometimes the main bottleneck. A lot of recent work has been devoted to design communication-efficient learning algorithms[DAW12, ZDW13, ZX15, KVW14, LBKW14, SSZ14, LSLT15].
In this paper we consider statistical estimation problems in the distributed setting, which can be formalized as follows. There is a family of distributions that is parameterized by . Each of the machines is given i.i.d samples drawn from an unknown distribution . The machines communicate with each other by message passing, and do computation on their local samples and the messages that they receives from others. Finally one of the machines needs to output an estimator and the statistical error is usually measured by the mean-squared loss . We count the communication between the machines in bits.
This paper focuses on understanding the fundamental tradeoff between communication and the statistical error for high-dimensional statistical estimation problems. Modern large datasets are often equipped with a high-dimensional statistical model, while communication of high dimensional vectors could potentially be expensive. It has been shown by Duchi et al.[DJWZ14] and Garg et al. [GMN14] that for the linear regression problem, the communication cost must scale with the dimensionality for achieving optimal statistical minimax error – not surprisingly, the machines have to communicate high-dimensional vectors in order to estimate high-dimensional parameters.
These negative results naturally lead to the interest in high-dimensional estimation problems with additional sparse structure on the parameter . It has been well understood that the statistical minimax error typically depends on the intrinsic dimension, that is, the sparsity of the parameters, instead of the ambient dimension111the dependency on the ambient dimension is typically logarithmic.. Thus it is natural to expect that the same phenomenon also happens for communication.
However, this paper disproves this possibility in the interactive communication model by proving that for the sparse Gaussian mean estimation problem (where one estimates the mean of a Gaussian distribution which is promised to be sparse, see Section 2 for the formal definition), in order to achieve the statistical minimax error, the communication must scale with the ambient dimension. On the other end of the spectrum, if alternatively the communication only scales with the sparsity, then the statistical error must scale with the ambient dimension (see Theorem 4.5). Shamir [Sha14] establishes the same result for the 1-sparse case under a non-iterative communication model.
Our lower bounds for the Gaussian mean estimation problem imply lower bounds for the sparse linear regression problem (Corollary 4.8) via the reduction of [ZDJW13]: for a Gaussian design matrix, to achieve the statistical minimax error, the communication cost per machine needs to be where is the ambient dimension and is the dimension of the observation that each machine receives. This lower bound matches the upper bound in [LSLT15] when is larger than . When is less than , we note that it is not clear whether or should be the minimum communication cost per machine needed. In any case, our contribution here is in proving a lower bound that does not depend on the sparsity. Compared to previous work of Steinhardt and Duchi [SD15], which proves the same lower bounds for a memory-bounded model, our results work for a stronger communication model where multi-round iterative communication is allowed. Moreover, our techniques are possibly simpler and potentially easier to adapt to related problems. For example, we show that the result of Woodruff and Zhang [WZ12] on the information complexity of distributed gap majority can be reproduced by our technique with a cleaner proof (see Theorem C.1).
We complement our lower bounds for this problem in the dense case by providing a new simultaneous protocol, improving the number of rounds of the previous communication-optimal protocol from to (see Theorem 4.6). Our protocol is based on a certain combination of many bits from a few Gaussian samples, together with roundings (to a single bit) of the fractional parts of many Gaussian samples.
Our proof techniques are potentially useful for other questions along these lines. We first use a modification of the direct-sum result of [GMN14], which is tailored towards sparse problems, to reduce the estimation problem to a detection problem. Then we prove what we call a distributed data processing inequality for bounding from below the cost of the detection problem. The latter is the crux of our proofs. We elaborate more on it in the next subsection.
1.1 Distributed Data Processing Inequality
We consider the following distributed detection problem. As we will show in Section 4 (by a direct-sum theorem), it suffices to prove a tight lower bound in this setting, in order to prove a lower bound on the communication cost for the sparse linear regression problem.
Distributed detection problem: We have a family of distributions that consist of only two distributions , and the parameter space . To facilitate the use of tools from information theory, sometimes it is useful to introduce a prior over the parameter space. Letof being . Given , we draw i.i.d. samples from and the -th machine receives one sample , for . We use to denote the sequences of messages that are communicated by the machines. We will refer to as a “transcript”, and the distributed algorithm that the machines execute as a “protocol”.
The final goal of the machines is to output an estimator for the hidden parameter which is as accurate as possible. We formalize the estimator as a (random) function that takes the transcript as input. We require that given , the estimator is correct with probability at least , that is, . When , this is essentially equivalent to the statement that the transcript carries information about the random variable . Therefore, the mutual information is also used as a convenient measure for the quality of the protocol when .
Strong data processing inequality: The mutual information viewpoint of the accuracy naturally leads us to the following approach for studying the simple case when and . When , we note that the parameter , data , and transcript
form a simple Markov chain. The channel is defined as , conditioned on . The strong data processing inequality (SDPI) captures the relative ratio between and .
Definition 1 (Special case of SDPI).
Let and the channel be defined as above. Then there exists a constant that depends on and , such that for any that depends only on (that is, forms a Markov Chain), we have
An inequality of this type is typically referred to as a strong data processing inequality for mutual information when 222Inequality (1) is always true for a Markov chain with and this is called the data processing inequality.. Let be the infimum over all possible such that (1) is true, which we refer to as the SDPI constant.
Observe that the LHS of (1) measures how much information carries about , which is closely related to the accuracy of the protocol. The RHS of (1) is a lower bound on the expected length of , that is, the expected communication cost. Therefore the inequality relates two quantities that we are interested in - the statistical quality of the protocol and the communication cost of the protocol. Concretely, when , in order to recover from , we need that , and therefore inequality (1) gives that . Then it follows from Shannon’s source coding theory that the expected length of (denoted by ) is bounded from below by . We refer to [Rag14] for a thorough survey of SDPI.333Also note that in information theory, SDPI is typically interpreted as characterizing how information decays when passed through the reverse channel . That is, when the channel is lossy, then information about will decay by a factor of after passing through the channel. However, in this paper we take a different interpretation that is more convenient for our applications.
In the multiple machine setting, Duchi et al. [DJWZ14] links the distributed detection problem with SDPI by showing from scratch that for any , when , if is such that , then
This results in the bounds for the Gaussian mean estimation problem and the linear regression problem. The main limitation of this inequality is that it requires the prior to be unbiased (or close to unbiased). For our target application of high-dimensional problems with sparsity structures, like sparse linear regression, in order to apply this inequality we need to put a very biased prior on . The proof technique of [DJWZ14] seems also hard to extend to this case with a tight bound444We note, though, that it seems possible to extend the proof to the situation where there is only one-round of communication.. Moreover, the relation between , and may not be necessary (or optimal), and indeed for the Gaussian mean estimation problem, the inequality is only tight up to a logarithmic factor, while potentially in other situations the gap is even larger.
Our approach is essentially a prior-free multi-machine SDPI, which has the same SDPI constant as is required for the single machine one. We prove that, as long as the SDPI (1) for a single machine is true with parameter , and , then the following prior-free multi-machine SDPI is true with the same constant (up to a constant factor).
Theorem 1.1 (Distributed SDPI).
Suppose for some constant , and let be the SDPI constant defined in Definition 1. Then in the distributed detection problem, we have the following distributed strong data processing inequality,
where is a universal constant, and is the Hellinger distance between two distributions and denotes the distribution of conditioned on .
Moreover, for any and which satisfy the condition of the theorem, there exists a protocol that produces transcript such that (2) is tight up to a constant factor.
As an immediate consequence, we obtain a lower bound on the communication cost for the distributed detection problem.
Suppose the protocol and estimator are such that for any , given , the estimator (that takes as input) can recover with probability . Then
Our theorem suggests that to bound the communication cost of the multi-machine setting from below, one could simply work in the single machine setting and obtain the right SDPI constant . Then, a lower bound of for the multi-machine setting immediately follows. In other words, multi-machines need to communicate a lot to fully exploit the data points they receive ( on each single machine) regardless of however complicated their multi-round protocol is.
Note that our inequality differs from the typical data processing inequality on both the left and right hand sides. First of all, the RHS of (2) is always less than or equal to for any prior on . This allows us to have a tight bound on the expected communication for the case when is very small.
Second, the squared Hellinger distance (see Definition 4) on the LHS of (2) is not very far away from , especially for the situation that we consider. It can be viewed as an alternative (if not more convenient) measure of the quality of the protocol than mutual information – the further from , the easier it is to infer from . When a good estimator is possible (which is the case that we are going to apply the bound in), Hellinger distance, total variation distance between and , and are all . Therefore in this case, the Hellinger distance does not make the bound weaker.
The tightness of our inequality does not imply that there is a protocol that solves the distributed detection problem with communication cost (or information cost) . We only show that inequality (2) is tight for some protocol but solving the problem requires having a protocol such that (2) is tight and that . In fact, a protocol for which inequality (2) is tight is one in which only a single machine sends a message which maximizes .
Organization of the paper: Section 2 formally sets up our model and problems and introduces some preliminaries. Then we prove our main theorem in Section 3. In Section 4 we state the main applications of our theory to the sparse Gaussian mean estimation problem and to the sparse linear regression problem. The next three sections are devoted to the proofs of results in Section 4. In Section 5, we prove Theorem 4.4 and in Section A we prove Theorem 4.3 and Corollary 4.8. In Section 6 we provide tools for proving single machine strong data processing inequality and prove Theorem 4.1. In Section B we present our matching upper bound in the simultaneous communication model. In section C we give a simple proof of distributed gap majority problems using our machinery.
2 Problem Setup, Notations and Preliminaries
2.1 Distributed Protocols and Parameter Estimation Problems
Let be a family of distributions over some space , and be the space of all possible parameters. There is an unknown distribution , and our goal is to estimate a parameter using machines. Machine receives i.i.d samples from distribution . For simplicity we will use as a shorthand for all the samples machine receives, that is, . Therefore , where denotes the product of copies of . When it is clear from context, we will use as a shorthand for . We define the problem of estimating parameter in this distributed setting formally as task . When , we call this a detection problem and refer it to as .
The machines communicate via a publicly shown blackboard. That is, when a machine writes a message on the blackboard, all other machines can see the content. The messages that are written on the blackboard are counted as communication between the machines. Note that this model captures both point-to-point communication as well as broadcast communication. Therefore, our lower bounds in this model apply to both the message passing setting and the broadcast setting.
We denote the collection of all the messages written on the blackboard by . We will refer to as the transcript and note that is written in bits and the communication cost is defined as the length of , denoted by . We will call the algorithm that the machines follow to produce a protocol. With a slight abuse of notation, we ue to denote both the protocol and the transcript produced by the protocol.
One of the machines needs to estimate the value of using an estimator which takes as input. The accuracy of the estimator on is measured by the mean-squared loss:
where the expectation is taken over the randomness of the data , and the estimator . The error of the estimator is the supremum of the loss over all ,
The communication cost of a protocol is measured by the expected length of the transcript , that is, . The information cost IC of a protocol is defined as the mutual information between transcript and the data ,
where denotes the public coin used by the algorithm and denotes the mutual information between random variable and when the data is drawn from distribution . We will drop the subscript when it is clear from context.
For the detection problem, we need to define minimum information cost, a stronger version of information cost
We say that a protocol and estimator pair solves the distributed estimation problem with information cost , communication cost , and mean-squared loss if , and .
When , we have a detection problem, and we typically use to denote the parameter and as the (discrete) estimator for it. We define the communication and information cost the same as (2.1) and (4), while defining the error in a more meaningful and convenient way,
We say that a protocol and estimator pair solves the distributed detection problem with information cost , if , .
Now we formally define the concrete questions that we are concerned with.
Distributed Gaussian detection problem: We call the problem with and the Gaussian mean detection problem, denoted by .
Distributed (sparse) Gaussian mean estimation problem: The distributed statistical estimation problem defined by and is called the distributed Gaussian mean estimation problem, abbreviated . When , the corresponding problem is referred to as distributed sparse Gaussian mean estimation, abbreviated .
Distributed sparse linear regression: For simplicity and the purpose of lower bounds, we only consider sparse linear regression with a random design matrix. To fit into our framework, we can also regard the design matrix as part of the data. We have a parameter space . The -th data point consists of a row of design matrix and the observation where for , and each machine receives data points among them555We note that here for convenience, we use subscripts for samples, which is different from the notation convention used for previous problems. . Formally, let
denote the joint distribution ofhere, and let . We use as shorthand for this problem.
2.2 Hellinger distance and cut-paste property
In this subsection, we introduce Hellinger distance, and the key property of protocols that we exploit here, the so-called “cut-paste” property developed by [BYJKS04] for proving lower bounds for set-disjointness and other problems. We also introduce some notation that will be used later in the proofs.
Definition 4 (Hellinger distance).
Consider two distributions with probability density functions. The square of the Hellinger distance between and is defined as
A key observations regarding the property of a protocol by [BYJKS04, Lemma 16] is the following: fixing , the distribution of can be factored in the following form,
where is a function that only depends on and the entire transcript . To see this, one could simply write the density of as a products of density of each messages of the machines and group the terms properly according to machines (and note that is allowed to depend on the entire transcript ).
We extend equation (6) to the situation where the inputs are from product distributions. For any vector , let be a distribution over . We denote by the distribution of when .
Therefore if , using the fact that is a product measure, we can marginalize over and obtain the marginal distribution of when ,
where is the marginalization of over , that is, .
Let denote the distribution of when . Then by the decomposition (7) of above, we have the following cut-paste property for which will be the key property of a protocol that we exploit.
Proposition 2.1 (Cut-paste property of a protocol).
For any and with (in a multi-set sense) for every ,
3 Distributed Strong Data Processing Inequalities
In this section we prove our main Theorem 1.1. We state a slightly weaker looking version here but in fact it implies Theorem 1.1 by symmetry. The same proof also goes through for the case when the RHS is conditioned on .
Suppose , and , we have
where is an absolute constant.
Note that the RHS of (10
) naturally tensorizes (by Lemma1 that appears below) in the sense that
since conditioned on , the ’s are independent. Our main idea consists of the following two steps a) We tensorize the LHS of (10) so that the target inequality (10) can be written as a sum of inequalities. b) We prove each of these inequalities using the single machine SDPI.
To this end, we do the following thought experiment: Suppose is a random variable that takes value from uniformly. Suppose data is generated as follows: , and for any , . We apply the protocol on the input , and view the resulting transcript as communication between the -th machine and the remaining machines. Then we are in the situation of a single machine case, that is, forms a Markov Chain. Applying the data processing inequality (1), we obtain that
Let be the unit vector that only takes 1 in the th entry, and the all zero vector. Using the notation defined in Section 2.2, we observe that has distribution while has distribution . Then we can rewrite the equation above as
Observe that the RHS of (13) is close to the first entry of the LHS of (11) since the joint distribution of is not very far from . (The only difference is that is drawn from a mixture of and , and note that is not too far from ). On the other hand, the sum of LHS of (13) over is lower-bounded by the LHS of (10). Therefore, we can tensorize equation (10) into inequality (13) which can be proved by the single machine SDPI. We formalize the intuition above by the following two lemmas,
Suppose , and , then
Let be the -dimensional all 0’s vector, and the all 1’s vector, we have that
Proof of Lemma 1.
Let be uniform Bernoulli random variable and define and as follows: Conditioned on , and conditioned on , . We run protocol on and get transcript .
It is known that mutual information can be expressed as the expectation of KL divergence, which in turn is lower-bounded by Hellinger distance. We invoke a technical variant of this argument, Lemma 6.2 of [BJKS04], restated as Lemma 10, to lower bound the right hand side. Note that in Lemma 10 corresponds to here and corresponds to and . Therefore,
It remains to relate to . Note that the difference between joint distributions of and is that and . We claim (by Lemma 11) that since , we have
4 Applications to Parameter Estimation Problems
4.1 Warm-up: Distributed Gaussian mean detection
In this section we apply our main technical Theorem 3.1 to the situation when and . We are also interested in the case when each machine receives samples from either or . We will denote the product of i.i.d copies of by , for .
Theorem 3.1 requires that a) can be calculated/estimated b) the densities of distributions and are within a constant factor with each other at every point.
Certainly b) is not true for any two Gaussian distributions. To this end, we consider , the truncation of and on some support , and argue that the probability mass outside is too small to make a difference.
For a), we use tools provided by Raginsky [Rag14] to estimate the SDPI constant . [Rag14] proves that Gaussian distributions and have SDPI constant , and more generally it connects the SDPI constants to transportation inequalities. We use the framework established by [Rag14] and apply it to the truncated Gaussian distributions and . Our proof essentially uses the fact that is a log-concacve distribution and therefore it satisfies the log-Sobolev inequality, and equivalently it also satisfies the transportation inequality. The details and connections to concentration of measures are provided in Section 6.3.
Let and be the distributions obtained by truncating and on support for some . If , we have
As a corollary, the SDPI constant between copies of and is bounded by .
Let and be the distributions over that are obtained by truncating and outside the ball . Then when , we have
Applying our distributed data processing inequality (Theorem 3.1) on and , we obtain directly that to distinguish and in the distributed setting, communication is required. By properly handling the truncation of the support, we can prove that it is also true with the true Gaussian distribution.
Any protocol estimator pair that solves the distributed Gaussian mean detection problem with requires communication cost and minimum information cost at least,
The condition captures the interesting regime. When , a single machine can even distinguish and by its local samples.
Proof of Theorem 4.3.
We pick a threshold , and let . Let denote the event that , and otherwise . Note that and therefore even if we conditioned on the event that , the protocol estimator pair should still be able to recover with good probability in the sense that
We run our whole argument conditioning on the event . First note that for any Markov chain , and any random variable that only depends on , the chain is also a Markov Chain. Second, the channel from to satisfies that random variable has the distribution as defined in the statement of Corollary 4.2. Note that by Corollary 4.2, we have that . Also note that by the choice of and the fact that , we have that for any , .
Therefore we are ready to apply Theorem 3.1 and conclude that
Note that is independent with conditioned on and . Therefore we have that
Note that by construction, it is also true that , and therefore if we switch the position of and run the argument above we will have
Hence the proof is complete.
4.2 Sparse Gaussian mean estimation
In this subsection, we prove our lower bound for the sparse Gaussian mean estimation problem via a variant of the direct-sum theorem of [GMN14] tailored towards sparse mean estimation.
Our general idea is to make the following reduction argument: Given a protocol for -dimensional -sparse estimation problem with information cost and loss , we can construct a protocol for the detection problem with information cost roughly and loss . The protocol embeds the detection problem into one random coordinate of the -dimensional problem, prepares fake data on the remaining coordinates, and then runs the protocol on the high dimensional problem. It then extracts information about the true data from the corresponding coordinate of the high-dimensional estimator.
The key distinction from the construction of [GMN14] is that here we are not able to show that has small information cost, but only able to show that has a small minimum information cost 777This might be inevitable because protocol might reveal a lot information for the nonzero coordinate of but since there are very few non-zeros, the total information revealed is still not too much.. This is the reason why in Theorem 4.3 we needed to bound the minimum information cost instead of the information cost.
To formalize the intuition, let define the detection problem. Let and . Therefore is a special case of the general -sparse high-dimensional problem. We have that
Theorem 4.4 (Direct-sum for sparse parameters).
Suppose . Any protocol estimator pair that solves the -sparse Gaussian mean problem with mean-squared loss and information cost and communication cost satisfy that
Intuitively, to parse equation (20), we remark that the term comes from the fact that any local machine can achieve this error using only its local samples, and the term is the minimax error that the machines can achieve with infinite amount of communication. When the target error is between these two quantities, equation (20) predicts that the minimum communication should scale inverse linearly in the error .
Our theorem gives a tight tradeoff between and up to logarithmic factor, since it is known [GMN14] that for any communication budget , there exists protocol which uses bits and has error .
Proof of Theorem 4.5.
If then we are done. Otherwise, let . Let and and . Let . Then is just a special case of sparse Gaussian mean estimation problem , and is the distributed Gaussian mean detection problem . Therefore, by Theorem 4.4, there exists that solves with minimum information cost . Since , by Theorem 4.3 we have that . It follows that . To derive (20), we observe that is the minimax lower bound for , which completes the proof. ∎
To complement our lower bounds, we also give a new protocol for the Gaussian mean estimation problem achieving communication optimal up to a constant factor in any number of dimensions in the dense case. Our protocol is a simultaneous protocol, whereas the only previous protocol achieving optimal communication requires rounds [GMN14]. This resolves an open question in Remark 2 of [GMN14], improving the trivial protocol in which each player sends its truncated Gaussian to the coordinator by an factor.
For any , there exists a protocol that uses one round of communication for the Gaussian mean estimation problem with communication cost and mean-squared loss .
The protocol and proof of this theorem are deferred to Section B, though we mention a few aspects here. We first give a protocol under the assumption that . The protocol trivially generalizes to dimensions so we focus on dimension. The protocol coincides with the first round of the multi-round protocol in [GMN14]
, yet we can extract all necessary information in only one round, by having each machine send a single bit indicating if its input Gaussian is positive or negative. Since the mean is on the same order as the standard deviation, one can bound the variance and give an estimator based on the Gaussian density function. In SectionB.1 the mean of the Gaussian is allowed to be much larger than the variance, and this no longer works. Instead, a few machines send their truncated inputs so the coordinator learns a crude approximation. To refine this approximation, in parallel the remaining machines each send a bit which is with probability , where is the machine’s input Gaussian. This can be viewed as rounding a sample of the “sawtooth wave function” applied to a Gaussian. For technical reasons each machine needs to send two bits, another which is with probability . We give an estimator based on an analysis using the Fourier series of .
Sparse Gaussian estimation with signal strength lower bound
Our techniques can also be used to study the optimal rate-communication tradeoffs in the presence of a strong signal in the non-zero coordinates, which is sometimes assumed for sparse signals. That is, suppose the machines are promised that the mean is -sparse and also if , then , where is a parameter called the signal strength. We get tight lower bounds for this case as well.
For and , any protocol estimator pair that solves the -sparse Gaussian mean problem with signal strength and mean-squared loss requires information cost (and hence expected communication cost) at least .
Note that there is a protocol for with signal strength and mean-squared loss that has communication cost . In the regime where , the first term dominates and by Theorem 4.7, and the fact that is a lower bound even when the machines know the support [GMN14], we also get a matching lower bound. In the regime where , second term dominates and it is a lower bound by Theorem 4.5.
Proof of Theorem 4.7.
The proof is very similar to the proof of Theorem 4.4. Given a protocol estimator pair that solves with signal strength , mean-squared loss and information cost (where ), we can find a protocol that solves the Gaussian mean detection problem with information cost (as usual the information cost is measured when the mean is ). would be exactly the same as Protocol 1 but with replaced by , replaced by and replaced by . We leave the details to the reader. ∎
4.3 Lower bound for Sparse Linear Regression
In this section we consider the sparse linear regression problem in the distributed setting as defined in Section 2. Suppose the -th machine receives a subset of the data points, and we use to denote the design matrix that the -th machine receives and to denote the observed vector. That is, where is Gaussian noise.
This problem can be reduced from the sparse Gaussian mean problem, and thus its communication can be lower-bounded. It follows straightforwardly from our Theorem 4.5 and the reduction in Corollary 2 of [DJWZ14]. To state our result, we assume that the design matrices have uniformly bounded spectral norm . That is,
Suppose machines receive data from the sparse linear regression model. Let be as defined above. If there exists a protocol under which the machines can output an estimator with mean squared loss with communication , then .
When is a Gaussian design matrix, that is, the rows of are i.i.d drawn from distribution , we have and Corollary 4.8 implies that to achieve the statistical minimax rate , the algorithm has to communicate bits. The point is that we get a lower bound that doesn’t depend on – that is, with sparsity assumptions, it is impossible to improve both the loss and communication so that they depend on the intrinsic dimension instead of the ambient dimension . Moreover, in the regime when for a constant , our lower bound matches the upper bound of [LSLT15] up to a logarithmic factor. The proof follows Theorem 4.5 and the reduction from Gaussian mean estimation to sparse linear regression of [ZDJW13] straightforwardly and is deferred to Section A.