1 Introduction
Classic statistical theory studies the difficulty of estimation under various models, and attempts to find the optimal estimation procedures. Such studies usually assume that all of the collected data are available to construct the estimators. In this paper, we study the problem of statistical estimation with data residing at multiple machines. Estimation in distributed settings is becoming common in modern data analysis tasks, as the data can be collected or stored at different locations. In order to obtain an estimate of some statistical functional, information needs to be gathered and aggregated from the multiple locations to form the final estimate. However, the communication between machines may be limited. For instance, there may be a communication budget that limits how much information can be transmitted. In this setting, it is important to understand how the statistical risk of estimation degrades as the communication budget becomes more limited.
A similar problem, called the CEO problem, was first studied in the electrical engineering community from a ratedistortiontheory perspective (Berger et al., 1996; Viswanathan & Berger, 1997). More recently, several studies have focused on more specific statistical tasks and models; see, for example, Zhang et al. (2013a); Shamir (2014); Battey et al. (2015); Braverman et al. (2016); Diakonikolas et al. (2017); Fan et al. (2017); Lee et al. (2017)
treating mean estimation, regression, principal eigenspace estimation, discrete density estimation and other problems. Most of this existing research focuses on parametric and discrete models, where the parameter of interest has a finite dimension. While there are also studies of nonparametric problems and models
(Zhang et al., 2013b; Blanchard & Mücke, 2016; Chang et al., 2017; Shang & Cheng, 2017), the fundamental limits of distributed nonparametric estimation are still underexplored.In this paper, we consider a fundamental nonparametric estimation task—estimating a smooth function in the white noise model. We assume observation of the random process
(1.1) 
where is the noise level, is a standard Wiener process, and is the underlying function to be estimated. The white noise model is a centerpiece of nonparametric estimation, being asymptotically equivalent to nonparametric regression and density estimation (Brown & Low, 1996; Nussbaum, 1996). We intentionally express the noise level as to reflect the connection between the white noise model and a nonparametric regression problem with evenly spaced observations. We focus on the important case where the regression function lies in the Sobolev space of order and radius ; the exact definition of this function space is given in the following section.
In a distributed setting, instead of observing a single sample path , we assume there are machines, each of which observes an independent copy of the stochastic process. That is, the th machine gets
for where ’s are mutually independent standard Wiener processes. Furthermore, each machine has a budget of bits to communicate with a central machine, where a final estimate is formed based on the messages received from the machines. Specifically, we denote by the message that the th machine sends to the central estimating machine; each can be viewed as a (possibly random) functional of the stochastic process . In this way, the tuple defines a problem instance for the function class . We use the minimax risk
to quantify the hardness of distributed estimation of in the Sobolev space .
The main contribution of the paper is to identify the following three asymptotic regimes.

An insufficient regime where . Under this scaling, the total number of bits, , is insufficient to preserve the classical, nondistributed, minimax rate of convergence for the sample size on a single machine. Therefore, the communication budget becomes the main bottleneck, and we have

A sufficient regime where . In this case, the number of bits allowed per machine is relatively large, and we have the minimax risk
Note that this is also the optimal convergence rate if all the data were available at the central machine.

An intermediate regime where and . In this regime, the minimax risk depends on all three parameters, and scales according to
Together, these three regimes give a sharp characterization of the statistical behavior of distributed nonparametric estimation for the Sobolev space under communication constraints, covering the full range of parameters and problem settings. The Bayesian framework adopted in this paper to establish the lower bounds is different from the techniques used in previous work, which typically rely on Fano’s lemma and the strong data processing inequality. Finally, we note that an essentially equivalent set of minimax convergence rates is obtained in a simultaneously and independently written paper by Szabo & van Zanten (2018).
The paper is organized as follows. In the next section, we explain our notation and give a brief introduction of nonparametric estimation over a Sobolev space for the usual nondistributed setting and a distributed setting. In Section 3, we state our main results on the risk of distributed nonparametric estimation with communication constraints. We outline the proof strategy for the lower bounds in Section 3.1, deferring some of the technical details and proofs to the supplementary material. In Section 4, we show achievability of the lower bounds by a particular distributed protocol and estimator. We conclude the paper with a discussion of possible directions for future work.
2 Problem formulation
The Sobolev space of order and radius is defined by
Intuitively, it is a space of functions having a certain degree of smoothness. The periodic Sobolev space is defined by
The white noise model (1.1) can be reformulated in terms of an infinite Gaussian sequence model. Let be the trigonometric basis, and let
be the Fourier coefficients. It is known that belongs to if and only if the sequence belongs to the Sobolev ellipsoid , defined as
where
To ease the analysis, we will assume and use in the place of . Expanding the observed process in terms of the same basis we obtain the Gaussian sequence
Given an estimator for , we can formulate a corresponding estimator for by
and the squared errors satisfy . In this way, estimating the function in the white noise model is equivalent to estimating the means in the Gaussian sequence model.
The minimax risk of estimating over the periodic Sobolev space is defined as
which, as just shown, is equal to the minimax risk of estimating over the Sobolev ellipsoid in the corresponding Gaussian sequence model,
It is known (Tsybakov, 2008) that the asymptotic minimax risk scales according to
as .
In a distributed setting, we suppose there are machines, and the th machine independently observes such that
for . Equivalently, if we express this in terms of the Gaussian sequence model, the th machine observes data
We further assume there is a central machine where a final estimator needs to be calculated based on messages received from the local machines. Local machine sends a message of length bits to the central machine; we denote this message by . Then can be viewed as a (possibly random) mapping from to . The final estimator is then a functional of the collection of messages. The mechanism can be summarized by the following diagram:
Suppose that the communication is restricted by one of two types of constraints: An individual constraint, where , for each and a given budget , and a sum constraint, where . We call the set of mappings and a distributed protocol, and denote by and the collection of all such protocols, operating under the individual constraint and the sum constraint, respectively.
We note here that for simplicity we consider only one round of communication. A variant is to allow multiple rounds of communication, for which the local machines can get access to a “blackboard” where the central machine broadcasts information back to the distributed nodes.
The minimax risk of the distributed estimation problem under the communication constraint is defined by
(2.1) 
Here represents either or . In fact, it will be clear that the minimax risks under the two types of constraints are asymptotically equivalent.
3 Lower bounds for distributed estimation
In what follows, we will work in an asymptotic regime where the tuple goes to infinity while satisfying some relationships, and show how the minimax risk for the distributed estimation problem scales accordingly. The main result can be summarized in the following theorem.
Theorem 3.1.
Let be defined as in (2.1) with

If , then

If and , then

If , then
Remark 3.1.
The lower bounds are valid for both the sum constraint and the individual constraint. In fact, the individual constraint is more stringent than the sum constraint, so in terms of lower bounds, it suffices to prove it for the sum constraint.
Remark 3.2.
To put the result more concisely, we can write
There are multiple ways to interpret this main result and here we illustrate one of the many possibilities. Fixing and , and viewing the minimax risk as a function of , the sample size on each machine, we have
This indicates that when the configuration of machines and communication budget stay the same, as we increase the sample size at each machine, the risk starts to decay at the optimal rate with exponent . Once the sample size is large enough, the convergence rate slows down to an exponent . Eventually, the sample size exceeds a threshold, beyond which any further increase won’t decrease the risk due to the communication constraint.
Remark 3.3.
This work can be viewed as a natural generalization of Zhu & Lafferty (2017), where the authors consider estimation over a Sobolev space with a single remote machine and communication constraints. Specifically, by setting we recover the main results in Zhu & Lafferty (2017) up to some constant factor. However, with more than one machine, it is nontrivial to uncover the minimax convergence rate, especially in the intermediate regime.
3.1 Proof of the lower bounds
We now proceed to outline the proof of the lower bounds in Theorem 3.1. Most existing results rely on Fano’s lemma and the strong data processing inequality (Zhang et al., 2013a; Braverman et al., 2016). An extension of this informationtheoretic approach is used by Szabo & van Zanten (2018) in the nonparametric setting to obtain essentially the same lower bounds as we establish here. However, we develop the Bayesian framework for deriving minimax lower bounds (Johnstone, 2017), circumventing the need for both Fano’s lemma and the strong data processing inequality, and associating the lower bounds with the solution of an optimization problem.
We consider a prior distribution asymptotically supported on the parameter space . For any estimator that follows the distributed protocol, we have
(3.1) 
That is, the worstcase risk associated with is bounded from below by the integrated risk. We specifically consider the Gaussian prior distribution for , and for , where the sequence satisfies . We make (3.1) clear in the following lemma, whose proof can be found in the supplementary material.
Lemma 3.1.
Suppose that a sequence of Gaussian prior distributions for and estimator satisfy
(3.2) 
for some as . Then
The next step is to lower bound the integrated risk . Lemma 3.2 is derived from a result that appears in (Wang et al., 2010); for completeness we include the proof in the supplementary material.
Lemma 3.2.
Suppose and for and . Let be a (random) mapping, which takes up to different values. Let be an estimator based on the messages created by . Under the constraint that , can be lower bounded by the value of the following optimization problem
(3.3)  
Combining the Lemma 3.1 and 3.2, we have the following asymptotic lower bound
for sequences satisfying and as .
Next, based on the optimization problem formulated above, we work under three different regimes, and derive three forms of lower bounds of the minimax risk. The key is to choose appropriate sequences of prior variances
for different regimes, as we shall illustrate.
Suppose that is a feasible solution to the problem (3.3). Using the first constraint, we have
where we have used Jensen’s inequality. Therefore,
Consider an asymptotic regime where , and pick a sequence of corresponding prior distributions with for some constant and for . Note that this choice satisfies condition (3.2). With such a choice of the prior distribution, we have

Again suppose that is a feasible solution to the problem (3.3). This time we take another viewpoint of the first constraint
To minimize under the constraint that , we write the Lagrangian
and set
Solving this gives us that
This time, consider a regime where and . Pick a sequence of corresponding prior distributions with for some constant and for , which satisfies condition (3.2). With this choice and replacing by , we have
4 Achievability
In this section, we describe how the lower bound can be achieved through the use of a certain distributed protocol. Unlike for the lower bound, we shall work under the individual constraint on the communication budget, instead of the sum constraint. However, a protocol satisfying the individual constraint automatically satisfies the sum constraint.
4.1 Highlevel idea
In nonparametric estimation theory, it is known that for the Gaussian sequence model for with , the optimal scaling of the risk is , and this can be achieved by truncating the sequence at . That is, the estimator
has worstcase risk . We are going to build on this simple but rateoptimal estimator in our distributed protocol. But before carefully defining and analyzing the protocol, we first give a highlevel idea of how it is designed.
In our distributed setting, we have a total budget of bits to communicate from the local machines to the central machine, which means that we can transmit random variables to a certain degree of precision.
In the first regime where we have , the communication budget is so small that the total number of bits is smaller than the effective dimension for the noise level . In this case, we let each machine transmit information regarding a unique set of components of . Thus, at the central machine, we can decode and obtain information about the first components of . This is equivalent to truncating a centralized Gaussian sequence at , and gives us a convergence rate of .
In the second regime ( and ), we have a larger budget at our disposal, and can thus afford to transmit more than one random variable containing information about . Suppose that for a specific we quantize and transmit for different values of , namely at different machines. The budget of random variables will allow us to acquire information about the first components of . When aggregating at the central machine, we have for , and no information about for . Now consider the effect of choosing different values of . In choosing a smaller , we will be able to estimate more components of , but each at a lower accuracy. On the other hand, a larger leads to fewer components being estimated, but with smaller error. We know from nonparametric estimation theory that the tradeoff is optimized when . This gives us the optimal choice , with risk scaling as .
In the last regime, we have . In this case, the number of bits available at each machine is larger than the effective dimension associated with the global noise level . We simply quantize and transmit the first of from each machine to the central machine, where we decode and simply average the received random variables.
Comments
There are no comments yet.