Theoretical Limits of One-Shot Distributed Learning

05/12/2019
by   Saber Salehkaleybar, et al.
Sharif Accelerator
0

We consider a distributed system of m machines and a server. Each machine draws n i.i.d samples from an unknown distribution and sends a message of bounded length b to the server. The server then collects messages from all machines and estimates a parameter that minimizes an expected loss. We investigate the impact of communication constraint, b, on the expected error; and derive lower bounds on the best error achievable by any algorithm. As our main result, for general values of b, we establish a Ω̃( (mb)^-1/(d,2) n^-1/2) lower bounded on the expected error, where d is the dimension of the parameter space. Moreover, for constant values of b and under the extra assumption n=1, we show that expected error remains lower bounded by a constant, even when m tends to infinity.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/02/2019

Order Optimal One-Shot Distributed Learning

We consider distributed statistical optimization in one-shot setting, wh...
08/19/2021

Order Optimal One-Shot Federated Learning for non-Convex Loss Functions

We consider the problem of federated learning in a one-shot setting in w...
10/16/2020

Improved Communication Lower Bounds for Distributed Optimisation

Motivated by the interest in communication-efficient methods for distrib...
07/14/2012

Scaling of Model Approximation Errors and Expected Entropy Distances

We compute the expected value of the Kullback-Leibler divergence to vari...
08/20/2018

On the compression of messages in the multi-party setting

We consider the following communication task in the multi-party setting,...
06/07/2018

Inference for a constrained parameter in presence of an uncertain constraint

We describe a hierarchical Bayesian approach for inference about a param...
11/09/2021

Towards a Unified Information-Theoretic Framework for Generalization

In this work, we investigate the expressiveness of the "conditional mutu...

1 Introduction

Consider a set of machines, each of them has access to i.i.d sample functions from an unknown distribution over a collection of differentiable convex functions with Lipschitz first-order derivatives. Each machine sends a message of certain length based on its own data to a main server. The server tries to estimate the parameter that minimizes the expected loss .

The problem becomes trivial if there is no limit on the number of transmitted bits between the machines and the server. Since each machine can send all its own data to the server in an encoded message. It is commonly known that the centralized solution having all data at the server can achieve the expected error of order (Lehmann and Casella, 2006; Zhang et al., 2013). Thus, the bound is a trivial lower bound on the expected error of any distributed solution with communication constraints.

Fundamental limits on the expected error of any distributed algorithm have been studied in several works. Shamir (2014) considered various communication constraints and showed that no distributed algorithm can achieve performance of the centralized solution with budget less than

bits per machine. For the problem of sparse linear regression,

Braverman et al. (2016) proved that any algorithm that achieves optimal minimax squared error, requires to communicate bits in total from machines to the server. Zhang et al. (2013) derived an information theoretic lower bound on the minimax error of parameter estimation, in presence of communication constraints. They showed that, in order to acquire the same precision as the centralized solution for estimating the mean of a

-dimensional Gaussian distribution, the machines require to transmit at least

number of bits in total. Garg et al. (2014) improved this bound to bits using direct-sum theorems (Chakrabarti et al., 2001).

As mentioned before, the distributed setting is non-trivial only if the communication budget per machine is assumed to be bounded. We explore the impact of different communication constraints on the expected loss; namely constant bit per message, bits per message, and general arbitrary message lengths. For the case that the length of each message is bounded by a constant independent of and , and under the assumption , we show that no algorithm has expected error better than a universal constant, even if goes to infinity. This shows that the profusion of machines does not contribute to accuracy of estimation if the message lengths and the number of observations per machine are bounded by constants.

As the main result of this paper, considering general message length, , we establish a lower bound of order on the estimation error. In the special case that , this shows that if , no estimator can achieve the accuracy of a centralized solution for large values of . 111This might be confusing because the averaging method in (Zhang et al., 2012) has estimation error no larger than , when . To resolve this ambiguity note that the analysis in (Zhang et al., 2012) relies on the Lipschitz continuity assumption of the second derivative, whereas we only consider first order differentiability with Lipschitz continuous first order derivatives.

In the rest of the paper, we first discuss the system model in Section 2. We then present our main results in Section 3. In Section 4 we prove the lower bound for constant number of bits per transmission, and in Section 5 we give the proof of our main lower bound for general communication constraints. We discuss our results in Section 6.

2 Problem Definition

Consider a collection of convex functions over . Suppose that

is an unknown probability distribution over the functions in

. The expected loss is defined as follows:

(1)

We want to estimate the parameter that minimizes : .

The expected loss is to be optimized in a distributed fashion, as follows: Consider identical machine which can directly send messages to a main server. Each machine observes i.i.d samples drawn according to the probability distribution . It processes the observed data and transmits a signal of length bits to the server. At the server side, an estimation of , which is denote by , is computed based on the received signals .

Assumption 1

Throughout the paper, we assume that and satisfy the following conditions:

  • Every is convex and once differentiable.

  • Distribution is such that (defined in (1)) is strongly convex. More concretely, there is a constant such that for any , we have .

  • Each has bounded and Lipschitz continuous derivatives. In particular, for any and any , we have , , and .

  • The minimizer of is inside the cube . More specifically, there exists such that .

3 Main Results

In this section, we present lower bounds on the expected error under various communication constraints. We have already argued that our distributed system reduces to a centralized system once communication constraints are eliminated and the signals are allowed to be of infinite length. This is because, in this case, each machine can encode all its observations in a signal and send it to the server. In this section, we take into account impact of communication constraints on the expected error.

We first consider the case that the number of bits per signal transmission is limited to a constant, independent of . Our first proposition shows that when , the expected error is lower bounded by a constant, even if goes to infinity. The proof is given in Section 4. Let and suppose that the number of bits per signal, , is limited to a constant. Then, there is a distribution over such that expected error, , of any randomized estimator is lower bounded by a constant, for all . In the proof, the distribution associates non-zero probabilities to polynomials of order at most . The proposition shows that the expected error is bounded from below by a constant regardless of , when and is a constant. We conjecture that the result can be extended to all constant values of .

We now turn our focus to the case where . As the main result of the paper, we show in the next theorem that in a system with machines, samples per machine, and bits per signal transmission, no estimator can achieve estimation error less than . Recall that is the dimension of the domain of functions in .

Suppose that Assumption 1 is in effect for . Then, for any estimator , there exists a probability distribution over such that

(2)

The proof is given in Section 5. Assumption in the statement of the theorem appears to be innocuous, and is merely aimed to facilitate the proofs. In fact, if , then , for some . Therefore, in light of Assumption 1, must always be no larger than .

As an immediate corollary, we obtain a lower bound on moments of the estimation error. For any estimator

, there exists a probability distribution such that for any ,

Note that in a centralized system, estimation error scales as . Combined with Theorem 3, we get the following lower bound on the estimation error for

(3)

In view of (3), no estimator can achieve performance of a centralized solution for . As discussed earlier in the Introduction section, this is in contrast with the result in Zhang et al. (2012) that a simple averaging algorithm achieves accuracy (similar to a centralized solution), for . This apparent contradiction is resolved through the difference in the set of functions considered in the two works. The set of functions in Zhang et al. (2012) are twice differentiable with Lipschitz continuous second derivatives, while we do not assume Lipschitz continuity of second derivatives.

4 Proof of Proposition 3

Let be a sub-collection of functions in that are -strongly convex. Consider convex functions in :

Consider a probability distribution over these functions that, for each , associates probability to function . With an abuse of the notation, we use

also for a vector with entries

. Since , each machine observes only one of ’s and it can send a -bit length signal out of possible messages of length bits. As a general randomized strategy, suppose that each machine sends -th message with probability when it observes function . Let be a matrix with entries . Then, each machine sends -th message with probability .

At the server side, we only observe the number (or frequency) of occurrences of each message. In view of the law of large number, as

goes to infinity, the frequency of -th message tends to , for all . Thus, in the case of infinite number of machines, the entire information of all transmitted signals is captured in the vector .

Let denote the estimator located in the server, that takes the vector and outputs an estimate of the minimizer of . We also let denote the optimal solution (i.e., the minimizer of ). In the following, we will show that the expected error is lower bounded by a universal constant, for all matrices and all estimators .

We say that vector is central if

(4)

Let be the collection of central vectors . We define two constants

For any central , the minimizer of lies in the interval . Furthermore, since functions and have different minimizers, we have . Let

(5)

where . We now show that . In order to draw a contradiction, suppose that . In this case, there exists nonzero vector such that the polynomial is equal to zero for all . On the other hand, it follows from the definition of that for any nonzero vector , is a nonzero polynomial of degree no larger than . As a result, the fundamental theorem of algebra (Krantz, 2012) implies that this polynomial has at most roots and it cannot be zero over the entire interval . This contradict with the earlier statement that the polynomial of interest equals zero throughout the interval . Therefore, .

Let be a vector of length such that , , and . Note that such exists and lies in the null-space of matrix , where is the vector of all ones. Let be the solution of the following optimization problem

and assume that is a central vector such that . Then, it follows from (5) that

(6)

Let . Then, from the conditions in (4) and , we can conclude that is a probability vector. Furthermore, based on the definition of ,

(7)

It then follows from (6) that

(8)

where the last equality is due to the fact that minimizes .

Let be the minimizer of . Then,

(9)

Furthermore, for any and any , its easy to see that . Consequently, , for all . It follows that

where the last two relations are due to (9) and (8), respectively. Then,

where the equality follows from (7). Therefore, the estimation error exceeds for at least one of the probability vectors or . This completes the proof of Proposition 3.

5 Proof of Theorem 3

In this section, we present the proof of Theorem 3. The high level idea is that if there is an algorithm that finds a minimizer of with high probability, then there is an algorithm that finds a fine approximation of over -neighborhood of . The key steps of the proof are as follows.

We first consider a sub-collection of such that for any pair of functions in , there is a point in the -neighborhood of such that . We develop a metric-entropy based framework to show that such collection exists and can have as many as functions. Consider a constant and suppose that there exists an estimator that finds an -approximation of with high probability, for all distributions. We generate a distribution that associates probability to an arbitrary function , while distributes the remaining probability unevenly over linear functions. The priory unknown probability distribution of these linear functions can displace the minimum of in an -neighborhood. Capitalizing on this observation, we show that the server needs to obtain an -approximation of all over this -neighborhood; because otherwise the server could mistake for another function , which leads to -error in for specific choices of probability distribution over the linear functions. Therefore, the server needs to distinguish which function out of functions in has positive probability in . Using information theoretic tools (Fano’s inequality (Cover and Thomas, 2012)) we conclude that the total number of bits delivered to the server (i.e., bits) must exceed the size of (i.e., ). This implies that , and no estimator has error less than .

5.1 Preliminaries

Before going through the details of the proof, in this subsection we present some definitions and auxiliary lemmas. For the ease of presentation, throughout this section we assume that the functions in are defined over the cube. Here we will only consider a sub-collection of functions in whose derivatives vanish at zero, i.e. , where is the all-zeros vector. Throughout the proof, we fix a constant

(10)

Recall that is a sub-collection of functions in that are -strongly convex. [-packing] Given , a subset is said to be an -packing if for any , and ; and for any , there exists such that . We denote an -packing with maximum size by , and refer to as the -metric entropy.

There exists a constant such that , for all with . The proof is given in Appendix A. The proof is constructive and goes by devising a set of functions in as convolutions of a collection of impulse trains by a suitable kernel.

We now define a collection of probability distributions over . [Collection of probability distributions] Let denote the vector whose th entry equals and all other entries equal zero. Consider two linear functions and . The collection consists of probability distributions of the following form:

where is the constant defined in (10). For each , we refer to any such as a corresponding distribution of . For ease of representation, in the remainder of paper, we use instead of .

Note that for any , there exist infinite corresponding distributions. In order to simplify the presentation, we use the shorthand notations , , and , for .

The following lemma shows that the different distributions in are close to each other over randomly generated samples, in a certain sense. Consider a function and two corresponding distributions . We draw i.i.d samples from . Let be the number of samples generated from , respectively. Let . Then,

The proof is based on the Hoeffding’s inequality, and is given in Appendix B. For self-containedness, we state the Hoeffding’s concentration inequality in the form of a lemma. [Hoeffding’s inequality] Let

be independent random variables ranging over the interval

, , and . Then, for any ,

In the rest of this subsection, we review a well-known inequality in information theory: Fano’s inequality (Cover and Thomas, 2012). Consider a pair of random variables and with certain joint probability distribution. Fano’s inequality asserts that given an observation of no estimator can recover with probability of error less than , i.e.,

where is the conditional entropy and is the size of probability space of . In the special case that has uniform marginal distribution, the above inequality further simplifies as follows:

(11)

() Since .
()

has uniform distribution.

5.2 Proof of Theorem 3

Let . Suppose that there exists an estimator such that in a system of machines and samples per machine, has estimation error less than with probability at least , for all distributions satisfying Assumption1. Note that since cannot beat the estimation error of the centralized solution, it follows that

(12)

We will show that .

We first improve the confidence of via repetitions to obtain an estimator , as in the following lemma. There exists an estimator such that in a system of machines and samples per machine, has estimation error less than with probability at least , for all distributions satisfying Assumption1. The proof is fairly standard, and is given in Appendix C.

For any , consider a probability distribution such that: , , . Suppose that each machine observes samples from this distribution. Let be the number of samples generated from , respectively. We refer to as the observed frequency vector of this particular machine. We denote by the signal generated by estimator (equivalently by ) at machine , corresponding to the distribution and the observed frequency vector . Note that bits suffice to represent the pair .

Consider a system of machines. For any , we define as the collection of pairs that are generated via the above procedure.

We now present the main technical lemma of this proof. It shows that employing , given , we can uniquely recover out of all functions in , with high probability. There exists an algorithm that for any , given , it outputs an such that with probability . Consider the collection of probability distributions defined in Definition 5.1. The high level idea is as follows. We first show that for any distribution corresponding to , there is a sub-sampling of such that the sub-sampled pairs are i.i.d and have distribution . As a result, employing estimator , we can find the minimizer of . We will then conclude that for any , we obtain with high probability a decent approximation of . This enables us to recover with high probability.

Consider a cube and suppose that is a minimum -covering222By a covering, we mean a set such that for any , there is a point such that . of , where is the lower bound on the curvature of (cf. Assumption 1). A regular grid yields a simple bound on the size of :

(13)

Let be the probability distribution with and , for . Moreover, for any consider the probability distributions : , , and , for (note that , because and ).

It follows from Lemma 5.1 that for any observed frequency vector in , we have with probability at least ,

(14)

We sub-sample , and discard from any pair whose does not satisfy (14). Otherwise, if satisfies (14), we then keep the pair with probability . We denote the set of surviving samples by .

Claim 1

With probability , at least pairs survive the above sub-sampling procedure; these pairs are i.i.d and the corresponding ’s have distribution .

The proof is given in Appendix D.1.

Let be the output of the server of estimator to the input . It follows from Lemma 5.2 and Claim 1 that with probability at least ,

(15)

where is the minimizer of function . By repeating this process for different ’s, we compute for all in . We define event as follows:

Then,

where the first four relations follow from the union bound, (13), the definition of in (10), and (12), respectively.

The algorithm then returns, as its final estimation of , an of the form

(16)

We now bound the error probability of , and show that with probability at least .

Claim 2

For any with , there is a such that .

The proof is given in Appendix D.2.

Suppose that event has occurred and consider a with . Then, it follows from Claim 2 that there is a such that


() Due to Claim 2.
() According to definition of event .
() Based on the definition of in (10), we have .
Therefore, with probability at least , for any with ,

It then follows from (16) that with probability . This shows that the error probability of is , and completes the proof of Lemma 5.2.

Going back to the proof of Theorem 3, we consider a random variable that has uniform distribution over and a random variable with domain and with the following distribution:

Based on Lemma 5.2, there exists an estimator which observes and returns the correct with probability at least . Let be the probability of error of this estimator. Then,

(17)

for large enough . On the other hand, it follows from the Fano’s inequality in (11) that