Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms

01/24/2020 ∙ by T. Tony Cai, et al. ∙ University of Pennsylvania 0

We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under the communication constraints, both in terms of the optimal procedure design and lower bound argument. The techniques developed in this paper can be of independent interest. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the conventional statistical decision theoretical framework, the focus is on the centralized setting where all the data are collected together and directly available. The main goal is to develop optimal (estimation, testing, detection, …) procedures, where optimality is understood with respect to the sample size and parameter space. Communication/computational costs are not part of the consideration.

In the age of big data, communication/computational concerns associated with a statistical procedure are becoming increasingly important in contemporary applications. One of the difficulties for analyzing large datasets is that data are distributed, instead of in a single centralized location. This setting arises naturally in many statistical practices.

  • Large datasets. When the datasets are too large to be stored on a single computer or data center, it is natural to divide the whole dataset into multiple computers or data centers, each assigned a smaller subset of the full dataset. Such is the case for a wide range of applications.

  • Privacy and security. Privacy and security concerns can also cause the decentralization of the datasets. For example, medical and financial institutions often collect datasets that contain sensitive and valuable information. For privacy and security reasons, the data cannot be released to a third party for a centralized analysis and need to be stored in different and secure places while performing data analysis.

Learning from distributed datasets, which is called distributed learning

, has attracted much recent attention. For example, Google AI proposed a machine learning setting called “Federated Learning”

(mcmahan2017federated), which develops a high-quality centralized model while the training data remain distributed over a large number of clients. Figure 0(a) provides a simple illustration of a distributed learning network. In addition to advances on architecture design for distributed learning in practice, there is also an increasing amount of literature on distributed learning theories, including jordan2018communication, battey2018distributed, dobriban2018distributed, and fan2019distributed in statistics, computer science, and information theory communities. Several distributed learning procedures with some theoretical properties have been developed in recent works. However, they do not impose any communication constraints on the proposed procedures thus fail to characterize the relationship between the communication costs and statistical accuracy. Indeed, in a decision theoretical framework, if no communication constraints are imposed, one can always output the original data from the local machines to the central machine and treat the problem same as in the conventional centralized setting.

The study on how the communication constraints compromise the estimation accuracy in the distributed settings has a long history. Dating back to 1980’s, zhang1988estimation

proposed an asymptotic unbiased distributed estimator and calculated its variance. In recent years, there is an emerging literature focusing on distributed Gaussian mean estimation under the communication constraints.

garg2014communication provided a bound on the bits of communication needed to achieve the centralized minimax risk. zhang2013information; braverman2016communication introduced information-theoretical tools to prove lower bounds on the minimax rate for Gaussian mean estimation under communication constraints. han2018geometric developed a geometric lower bound for distributed Gaussian mean estimation. Other similar settings and distribution families were also studied in luo2005universal; pmlr-v80-zhu18a; kipnis2019mean; hadar2019distributed; szabo2019asymptotic.

For large-scale data analysis, communications between machines can be slow and expensive and limitation on bandwidth and communication sometimes becomes the main bottleneck on statistical efficiency. It is therefore necessary to take communication constraints into consideration when constructing statistical procedures. When the communication budget is limited, the algorithm must carefully “compress” the information contained in the data as efficiently as possible, leading to a trade-off between communication costs and statistical accuracy. The precisely quantification of this trade-off is an important and challenging problem.

Estimation of a Gaussian mean occupies a central position in parametric statistical inference. In the present paper we consider distributed Gaussian mean estimation under the communication constraints in both the univariate and multivariate settings. Although optimal estimation of a Gaussian mean is a relatively simple problem in the conventional setting, this problem is quite involved under the communication constraints, both in terms of the construction of the rate optimal distributed estimator and the lower bound argument. Optimal distributed estimation of a Gaussian mean also serves as a starting point for investigating other more complicated statistical problems in distributed learning including distributed high-dimensional linear regression and distributed large-scale multiple testing.

1.1 Problem formulation

We begin by giving a formal definition of transcript, distributed estimator, and distributed protocol. Let be a parametric family of distributions supported on space , where is the parameter of interest. Suppose there are local machines and a central machine, where the the local machines contain the observations and the central machine produces the final estimator of under the communication constraints between the local and central machines. More precisely, suppose we observe i.i.d. random samples drawn from a distribution :

where the -th local machine has access to only.

For , let be a positive integer and the -th local machine can only transmit bits to the central machine. That is, the observation on the -th local machine needs to be processed to a binary string of length by a (possibly random) function . The resulting string , which is called the transcript from the -th machine, is then transmitted to the central machine. Finally, a distributed estimator is constructed on the central machine based on the transcripts ,

The above scheme to obtain a distributed estimator is called a distributed protocol. The class of distributed protocols with communication budgets is defined as

(a) Distributed learning network
(b) Distributed protocol
Figure 1: (a) Left panel: An illustration of a distributed learning network. Communication between the data servers and the central learner is necessary in order to learn from distributed datasets. (b) Right panel: An illustration of distributed protocol. The -th machine can only transmit a bits transcript to the central machine.

We use as a shorthand for and denote for . We shall always assume for all , i.e. each local machine can transmit at least one bit to the central machine. Otherwise, if no communication is allowed from any of the local machines, one can just exclude those local machines and treat the problem as if there are fewer local machines available. Figure 0(b) gives a simple illustration for the distributed protocols.

As usual, the estimation accuracy of a distributed estimator is measured by the mean squared error (MSE), , where the expectation is taken over the randomness in both the data and construction of the transcripts and estimator. As in the conventional decision theoretical framework, a quantity of particular interest in distributed learning is the minimax risk for the distributed protocols

which characterizes the difficulty of the distributed learning problem under the communication constraints . As mentioned earlier, in a rigorous decision theoretical formulation of distributed learning, the communication constraints are essential. Without the constraints, one can always output the original data from the local machines to the central machine and the problem is then reduced to the usual centralized setting.

1.2 Distributed estimation of a univariate Gaussian mean

We first consider distributed estimation of a univariate Gaussian mean under the communication constraints , where with and the variance known. Note that by a sufficiency argument, the case where each local machine has access to i.i.d. samples from is the same.

Our analysis in Section 2 establishes the following minimax rate of convergence for distributed univariate Gaussian mean estimation under the communication constraints ,

(1)

where is the total communication budgets, and denotes for some constants .

The above minimax rate characterizes the trade-off between the communication costs and statistical accuracy for univariate Gaussian mean estimation. An illustration of the minimax rate is shown in Figure 2.

Figure 2: The minimax rate of univariate Gaussian mean estimation under communication constraints has 3 phases: localization, refinement and optimal-rate.

The minimax rate (1) is interesting in several aspects. First, the optimal rate of convergence only depends on the total communication budgets , but not the specific allocation of the communication budgets among the local machines, as long as each machine has at least one bit. Second, the rate of convergence has three different phases:

  1. Localization phase. When , as a function of , the minimax risk decreases fast at an exponential rate. In this phase, having more communication budget is very beneficial in terms of improving the estimation accuracy.

  2. Refinement phase. When , as a function of , the minimax risk decreases relatively slowly and is inverse-proportional to the total communication budget .

  3. Optimal-rate phase. When , the minimax rate does not depend on , and is the same as in the centralized setting where all the data are combined (bickel1981minimax).

An essential technique for solving this problem is the decomposition of the minimax estimation problem into two steps, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design. In the lower bound analysis, the statistical error is decomposed into “localization error” and “refinement error”. It is shown that one of these two terms is inevitably large under the communication constraints. In our optimal procedure called MODGAME, bits of the transcripts are divided into three types: crude localization bits, finer localization bits, and refinement bits. They compress the local data in a way that both the localization and refinement errors can be optimally reduced. Further technical details and discussion are presented in Section 2.

1.3 Distributed estimation of a Multivariate Gaussian mean

We then consider the multivariate case under the communication constraints , where with and the noise level

is known. Similar to the univariate case, the goal is to optimally estimate the mean vector

under the squared error loss.

The construction and the analysis given in Section 3 show that the minimax rate of convergence in this case is given by

(2)

where is the total communication budgets and is the “effective sample size”.

The minimax rate in the multivariate case (2) is an extension of its univariate counterpart (1), but it also has its distinct features, both in terms of the estimation procedure and lower bound argument. Intuitively, the total communication budgets are evenly divided into parts so that roughly bits can be used to estimate each coordinate. Because there are coordinates, the risk is multiplied by . The effective sample size is a special and interesting quantity in multivariate Gaussian mean estimation. This quantity suggests that even when the total communication budgets are sufficient, the rate of convergence must be larger than the benchmark . There is a gap between the distributed optimal rate and centralized optimal rate if . See Section 3 for further technical details and discussion.

Although the interplay between communication costs and statistical accuracy has drawn increasing recent attention, to the best of our knowledge, the present paper is the first to establish a sharp minimax rate for distributed Gaussian mean estimation. Compared to our results, none of the previous results turns out to be sharp in general. The techniques developed in this paper, both for the lower bound analysis and construction of the rate optimal procedure, can be of independent interest. Our lower bound argument was inspired by the earlier work on the strong data processing inequality proposed in zhang2013information; braverman2016communication; raginsky2016strong.

1.4 Organization of the paper

We finish this section with notation and definitions that will be used in the rest of the paper. Section 2 studies distributed estimation of a univariate Gaussian mean under communication constraints and Section 3 considers the multivariate case. The numerical performance of the proposed distributed estimators is investigated in Section 4 and further research directions are discussed in Section 5. For reasons of space, we prove the main results for the univariate case in Section 6 and defer the proofs of the results for the multivariate case and technical lemmas to the Supplementary Material (CaiWei2019Supplement).

1.5 Notation and definitions

For any , let denote the floor function (the largest integer not larger than ). Unless otherwise stated, we shorthand as the base 2 logarithmic of . For any , let and . For any vector , we will use to denote the -th coordinate of , and denote by its norm. For any set , let be the Cartesian product of copies of . Let denote the indicator function taking values in .

For any discrete random variables

supported on , the entropy , conditional entropy , and mutual information are defined as

2 Distributed Univariate Gaussian Mean Estimation

In this section we consider distributed estimation of a univariate Gaussian mean, where one observes on local machines i.i.d. random samples:

under the constraints that the -th machine has access to only and can transmit bits only to the central machine. We denote by the Gaussian location family

where is the mean parameter of interest and the variance is known. For given communication budgets with for , the goal is to optimally estimate the mean under the squared error loss. A particularly interesting quantity is the minimax risk under the communication constraints, i.e., the minimax risk for the distributed protocol :

which characterizes the difficulty of the estimation problem under the communication constraints. We are also interested in constructing a computationally efficient algorithm that achieves the minimax optimal rate.

We first introduce an estimation procedure and provide an upper bound for its performance and then establish a matching lower bound on the minimax rate. The upper and lower bounds together establish the minimax rate of convergence and the optimality of the proposed estimator.

2.1 Estimation procedure - MODGAME

We begin with the construction of an estimation procedure under the communication constraints and provide a theoretical analysis of the proposed procedure. The procedure, called MODGAME (Minimax Optimal Distributed GAussian Mean Estimation), is a deterministic procedure that generates a distributed estimator under the distributed protocol . We divide the discussion into two cases: and .

2.1.1 MODGAME procedure when

When , MODGAME consists of two steps: localization and refinement. Roughly speaking, the first step utilizes bits, out of the total budget bits, for localization to roughly locate where is, up to error. Building on the location information, the remaining bits are used for refinement to further increase the accuracy of the estimator. Detailed theoretical analysis will show that the optimality of the final estimator.

Before describing the MODGAME procedure in detail, we define several useful functions that will be used to generate the transcripts. For any interval , let be the truncation function defined by

(3)

For any integer , denote be the -th Gray function defined by

Similarly we denote by the -th conjugate Gray function defined by

To unify the notation we set if .

It is worth mentioning that these Gray functions mimic the behavior of the Gray codes (for reference see savage1997survey). Fix , if we treat as a string of code for any source , then those within the interval where is a integer will match the same code. Moreover, the code for adjacent intervals only differs by one bit, which is also a key feature for the Gray codes. This key feature guarantees the robustness of the Gray codes. Such robustness makes the Gray functions very useful for distributed estimation. An example for is shown in Figure 3 to better illustrate the behavior of the Gray functions.

Figure 3: An illustration of the Gray functions and Gray codes.

Define the refinement function and the conjugate refinement function by

For any function , define the convolution function

For any , let be the decoding function defined by

Last, we define the distance between a point and a set as

We are now ready to introduce the MODGAME procedure in detail. Again, we divide into three cases.

Case 1: . In this case, the output is the values of the first localization bits from local machines, where the -th localization bit is defined as the value of the function evaluated on the local sample. The procedure can be described as follows.

  1. Generate transcripts on local machines. Define and for . On the -th machine, the transcript is concatenated by the -th, -th, …, -th Gray functions evaluated at . That is,

    where for .

  2. Construct distributed estimator . Now we collect the bits from the transcripts . Note that is the -th Gray function evaluate at a random sample drawn from , one may reasonably ”guess” that . By this intuition, we set to be the minimum number in the interval , i.e.

Case 2: . Let

(4)

and define finer localization functions:

(5)

In this case the total communication budget is divided into 3 parts: crude localization bits (roughly bits), finer localization bits ( bits), and refinement bits ( bits). The crude localization bits are the values of the functions , each evaluated on a local sample. We denote those resulting binary bits by . The finer localization bits are the values of the functions , each function is evaluated on different local samples. The function values of are denoted by . The refinement bits are the values of the function , evaluated on local samples; and the values of the function , evaluated on different local samples. The resulting binary bits are denoted by and respectively.

These three types of bits are assigned to local machines by the following way: (1) Among all machines, there are local machines who will output transcript consist of 1 finer localization bit and crude localization bits. (2) Among all machines, there are local machines who will output transcript consist of 1 refinement bit and crude localization bits. (3) The remain machines will output transcript consist of crude localization bits. The above assignment is feasible because

It is worth mentioning that every finer localization bits and every refinement bits are assigned to different machines. Intuitively, this is because we need these bits to be independent so that we can gain enough information for estimation. See Figure 4 for an overview of the MODGAME procedure. The procedure can be summarized as follows:

Figure 4: An illustration of MODGAME. The bits in the transcripts are transmitted to the central machine and divided into three types: crude localization bits, finer localization bits, and refinement bits. One then constructs on the central machine a crude interval , a finer interval , and the final estimate step by step.
  1. Generate transcripts on local machines. Define and . On the -th machine:

    • If for some integer , output

      (If , just output .)

    • If , output

      (If , just output .)

    • If , output

      (If , just output .)

    • If , output

    where the above binary bits are calculated by

  2. Construct distributed estimator . From transcripts , we can collect (a) crude localization bits ; (b) finer localization bits ; (c) refinement bits and .

    1. First, we use crude localization bits to roughly locate . The “crude interval” will be obtained in this step.

      (a) If , just set .

      (b) If , let

      (6)

      Then we further stretch to a larger interval so that will double the length of :

      (7)
    2. Then, we use finer localization bits to locate to a smaller interval of length roughly . A ”finer interval” will be generated in this step. For any , let

      be the majority voting summary statistic for .

      (a) If , and , let

      (b) If , and , let

      (8)

      Then we further stretch to a larger interval so that will double the length of :

      (c) If , let

      (9)

      Lemma 6 shows is an interval. Then we further stretch to a larger interval so that will double the length of :

    3. Finally, we use refinement bits and to get an accurate estimate . Lemma 7 shows that one of the following two conditions must hold:

      or

      So we can divide the procedure into the following two cases.

      (a) If . Then is a strictly monotone function on (proved in Lemma 7). Denote

      By monotonicity, is invertible on . Let be the inverse of , the distributed estimator is given by

      (10)

      where is the truncation function defined in (3).

      (b) Otherwise, we have . In this case is a strictly monotone function on (proved in Lemma 7). Denote

      By monotonicity, is invertible on . Let be the inverse of , the distributed estimator is given by

      (11)

      where is the truncation function defined in (3).

Case 3: . We only need to use part of the total communication budget as if we deal with the case . To be precise, we can always easily find so that for and

Then we can implement the procedure introduced in Case 2 where we let the -th machine only output a transcript of length .

2.1.2 MODGAME procedure when

When , each machine only need to output a one-bit measurement to achieve the global optimal rate as if there are no communication constraints. Some related results are available in kipnis2019mean. The following procedure is based on the setting when for all . If for some , then one can simply discard all remain bits so that only one bit is sent by each machine.

Here is the MODGAME procedure when :

  1. The th machine outputs

  2. The central machine collects and estimates by

    where is the truncation function defined in (3) and

    is the cumulative distribution function of a standard normal,

    . Here is the inverse of and we extend it by defining and .

2.2 Theoretical properties of the MODGAME procedure

Section 2.1 gives a detailed construction of the MODGAME procedure, which clearly satisfies the communication constraints by construction. The following result provides a theoretical guarantee for the statistical performance of MODGAME.

Theorem 1.

For given communication budgets with for , let and let be the MODGAME estimate. Then there exists a constant such that

(12)

An interesting and somewhat surprising feature of the upper bound is that it depends on the communication constraints only through the total budget , not the specific value of , so long as each machine can transmit at least one bit.

2.3 Lower bound analysis and discussions

Section 2.1 gives a detailed construction of the MODGAME procedure and Theorem 1 provides a theoretical guarantee for the estimator. We shall now prove that MODGAME is indeed rate optimal among all estimators satisfying the communication constraints by showing that the upper bound in Equation (12) cannot be improved. More specifically, the following lower bound provides a fundamental limit on the estimation accuracy under the communication constraints.

Theorem 2.

Suppose for all . Let . Then there exists a constant such that