An Information-theoretic Method for Collaborative Distributed Learning with Limited Communication

05/13/2022
by   Xinyi Tong, et al.
Tsinghua University
0

In this paper, we study the information transmission problem under the distributed learning framework, where each worker node is merely permitted to transmit a m-dimensional statistic to improve learning results of the target node. Specifically, we evaluate the corresponding expected population risk (EPR) under the regime of large sample sizes. We prove that the performance can be enhanced since the transmitted statistics contribute to estimating the underlying distribution under the mean square error measured by the EPR norm matrix. Accordingly, the transmitted statistics correspond to the eigenvectors of this matrix, and the desired transmission allocates these eigenvectors among the statistics such that the EPR is minimal. Moreover, we provide the analytical solution of the desired statistics for single-node and two-node transmission, where a geometrical interpretation is given to explain the eigenvector selection. For the general case, an efficient algorithm that can output the allocation solution is developed based on the node partitions.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/21/2020

Cooperative Speed Estimation of an RF Jammer in Wireless Vehicular Networks

In this paper, we are concerned with the problem of estimating the speed...
05/29/2018

Distributed Statistical Inference for Massive Data

This paper considers distributed statistical inference for general symme...
03/04/2020

Gradient Statistics Aware Power Control for Over-the-Air Federated Learning in Fading Channels

To enable communication-efficient federated learning, fast model aggrega...
09/14/2021

On Distributed Learning with Constant Communication Bits

In this paper, we study a distributed learning problem constrained by co...
08/06/2018

GLSE Precoders for Massive MIMO Systems: Analysis and Applications

This paper proposes the class of Generalized Least-Square-Error (GLSE) p...
05/10/2020

A Robust Matching Pursuit Algorithm Using Information Theoretic Learning

Current orthogonal matching pursuit (OMP) algorithms calculate the corre...
04/26/2022

Scheduling of Sensor Transmissions Based on Value of Information for Summary Statistics

The optimization of Value of Information (VoI) in sensor networks integr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Distributed learning has been an active research area focusing on solving learning tasks of different workers under the collaboration between each other [17, 12, 13]. This learning scheme allows the distributed workers to share some knowledge that enables them to collaboratively learn better models than learning individually. In this context, the communication cost is of significant importance as the worker nodes usually have power and bandwidth resource constraints in realistic applications [10]. On the other hand, the learning tasks are usually solved by iterative optimization steps, e.g., stochastic gradient decent (SGD) [6], which could involve iterative high-dimensional gradient message transmissions in a frequent manner and thus induce high communication overhead.

To alleviate this issue, recent works have studied the approaches of gradient compression, which focus on employing a low-precision and low-dimensional representation of the gradient vector

[1, 18, 11, 3]. On the other hand, some other works employ multiple local learning steps before transmitting gradient to reduce the total communication rounds [15, 14, 9]. However, those methods should empirically strike a good balance between the model performance and the communication cost, and they did not consider the explicit constraint of communication bits. Therefore, designing the transmission approach with limited communication, i.e., revealing what statistics are required to be transmitted from other nodes to the target one, is vital for taking full advantage of the distributed collaborative learning [2].

In this paper, we study the fundamental problem of information transmission in distributed learning under the information dimensionality constraint, where the setting is summarized as follows. Let

be the random variable of data with domain

. We consider that the distributed learning problem has worker nodes, namely node , node , …, node . For node , we assume that training samples are i.i.d. generated from the universal distribution , where denotes the corresponding empirical distribution. In detail, the training process follows the empirical risk minimization (ERM) framework, where each node learns the parameter vector

with respect to the loss function

.

Fig. 1: A geometric explanation of distributed learning setting, where each node transmit a statistic to the target node to help achieve a better parameter.

Without loss of generality, we take node as the target. As shown in Figure 1, the distributed learning framework can achieve a better parameter vector for node by the following collaboration mechanism. For all the remaining nodes, node () transmits a -dimensional statistic to node , which takes the empirical mean of some statistic function of the samples, i.e., its form is . The restriction on the dimensionality is typically a communication constraint of fixed codeword length when each dimension needs fixed transmission bits.

The purpose of our work is to provide the expressions of the statistic functions such that the learning result of node performs best, where can be recognized as a function of its individual samples and . Moreover, the performance is evaluated by the expected population risk (EPR), i.e., the desired statistic functions can be derived by

(1)

where the expectation is taken over the sampling process. In this paper, we consider the asymptotic regime that the sample size of each node is large, and the empirical distribution can be close to the underlying distribution with high probability. As a result, we demonstrate that the EPR can be recognized as a mean square error measured by the EPR norm matrix between the empirical and underlying distribution.

Note that the empirical distribution can be regarded as a Gaussian vector under the asymptotic regime, whose covariance is inversely proportional to the sample size. Accordingly, we prove that the statistic functions correspond to the eigenvectors of the EPR norm matrix. Therefore, designing the optimal information transmission mechanism is transformed into an integer programming problem, which settles different eigenvectors to the positions of different statistic functions. Especially, we provide the analytical solutions of the cases when and

, where eigenvectors of the largest 2 eigenvalues are allocated and a geometric interpretation is given. Moreover, we demonstrate that the statistic functions of those nodes with more samples prefer the eigenvectors with larger eigenvalues. This conclusion leads to an algorithm based on the partition of the

nodes, which presents a computational complexity smaller than trivial methods.

Our framework and results differ from previous works in two ways: (1) previous works transmit the compressed gradient vectors in each round, while we transmit the low-dimensional statistics; (2) previous works involve iterative gradient transmissions, while our goal is to maximize the utility of collaboration between workers by one-shot communication. The contributions of this paper can be summarized as follows. Section II formulates this problem as estimating the underlying distribution under EPR. Section III presents the main theorems describing the properties of the solutions for scalar transmission, and propose the algorithm based on node partitions. Finally, Section IV extends the results to the case of vector transmission and improves the algorithm according to the operation of high order partitions.

Ii Preliminaries

Ii-a Asymptotic Approximation

Before presenting the problem formulation, we briefly introduce some convergence results and notations with respect to the empirical distributions. First, for an arbitrary distribution , we define its associated information vector as,

(2)

denoted as . Accordingly, information vector follows , and is associated with the corresponding empirical distribution.

In this paper, we concentrate on the local regime where

(3)

for all and is small. The empirical distributions are contained in this regime with high probability when samples sizes are large. Under this regime, we have the following approximation of the Kullback-Leibler (K-L) divergence

(4)

where denotes the -norm of vectors.

It is well known in information theory [7] that the probability function follows

By applying the local approximation (4), the probability function follows

(5)

which indicates that is approximately a Gaussian vector centered at with covariance matrix .

Ii-B Computation of EPR

Note that the optimal parameter that all the nodes desire can be defined as

(6)

Consider that we learn the estimator for with respect to , which is typically the empirical distribution of the corresponding node when no knowledge is transferred from other nodes, i.e., let the learned estimator be

(7)

The performance of this estimator is evaluated by the expected population risk (EPR), which is defined as [cf. (1)]

(8)

where the expectation is computed by the integral over the Gaussian density functions

(9)

, and (constant) is the EPR where the optimal parameter is achieved. Based on this formulation, we have the following characterization of the EPR (8).

Proposition 1.

Suppose that is twice-differentiable and Lipschitz continuous for , and is unbiased for , the testing loss as defined in (8) can be computed by

(10)

where is called EPR norm matrix, whose entries are

and

(11)

The notation and denote the gradient vector and Hessian matrix , and denotes its Moore–Penrose inverse.

This characterization indicates that the purpose of our problem is to find the optimal estimation for with its Gaussian observations under error (10), which can be seen as a mean square error measured by .

When no knowledge is transferred from other nodes, node takes , which has the EPR (high order terms omitted)

(12)

When are transmitted from other nodes, this paper could construct a smaller EPR than (12). Let

be the statistic function matrix and the statistic from node can be written as . Finally, the problem (1) comes into an optimization problem with two steps:

  1. provide the optimal estimator with respect to the empirical vector and the statistics ;

  2. find the optimal ’s such that the EPR is minimal.

Thus, the following formulation is given

(13)

Iii Scalar Transmission

In this section, we provide the solution of problem (13) under a special case when each node only transmits a scalar to node , i.e., . In other word, the matrix is degenerated to a vector, which is denoted as . This special case can be easily extended to the case when , and the result would be shown in Section IV.

First, we provide the solution for step (i). Let be the optimal estimator that minimizes the EPR

(14)

which is almost a non-Bayesian minimal mean square error (MMSE) estimation problem. Note that , where satisfies . Thus, problem (14

) can be viewed as to find the MMSE estimator for the linearly transformed parameter

. Note that it is easy to verify that the corresponding observations ’s are still Gaussian vectors. The typical method is to compute the maximum-likelihood estimator (MLE) and then prove its efficiency by the Cramer-Rao bound. The MLE can be computed as follows

(15)

where the density function is defined in (9).

Accordingly, the expression of is

(16)

Then, we have the following characterization of the optimal estimator .

Theorem 2.

The optimal estimator as defined in (14) takes the form of the MLE as defined in (16).

The next step is to compute the corresponding EPR . Without loss of generality, we assume that the statistic functions satisfy (), and the step (ii) of problem (13) becomes

(17)

The following theorem characterizes the property of the solution of this problem.

Theorem 3.

Suppose that the eigenvalues and the corresponding eigenvectors of matrix as defined in Proposition 1 are and , where . Let be the optimal arguments of (17), and then

(18)

Theorem 3 indicates that the statistic design searches for the suitable eigenvectors such that the EPR is minimal. Let be the index set such that , and then problem (17) becomes

(19)

where denotes the indicator function [8].

Note that problem (19) is an integer programming problem, which is typically NP-hard [16], and the analytical solution is hard to provide. However, we can still understand some properties of its solution and provide an efficient algorithm. Before presenting these results, we first show two simple cases of this problem, which could help achieve a geometrical understanding and interpretation. Specifically, when , the objective function of problem (19) gives , which is consistent with the result in (12).

Proposition 4 (Single-node transmission).

When there is only one node for knowledge transmission, i.e., , problem (19) comes to

(20)

Let be the solution of problem (20) and it is easy to verify that . Accordingly, the optimal statistic function is the largest eigenvector of matrix .

A geometric explanation associated with this result can be depicted in Figure 2. Note that the case when implies that the EPR (12) is the summation of the expected errors along all the eigenvectors of matrix , which are proportional to the corresponding eigenvalues and inversely proportional to the sample size . With the information contained in the scalar , the expected error along is deduced from to . Problem (20) aims at finding the direction where the maximum error deduction is achieved, where obviously the direction of is the answer.

Fig. 2: A geometric illustration of Example 1, where the blue shadow implies the EPR along different directions in the space of information vectors. The information transmission deduces the expected error along from to .
Proposition 5 (Two-node transmission).

When there are two nodes for knowledge transmission, i.e., , there exist two possible strategies

  1. Statistic function and select different directions;

  2. Statistic function and select the same direction.

Then, problem (19) under these two strategies comes to

(21a)
(21b)

Without loss of generality, it could be assumed that . The solutions of problem (21a) and (21b) are easy to derive. For strategy (a), the direction of and shall be along and , i.e., the optimal arguments are and ; for strategy (b), similar to Proposition 4, . Additionally, the corresponding EPRs are presented in the following.

(22a)
(22b)

Depending on the relationship between the eigenvalues and , the EPR of strategy (a) could be larger or smaller than strategy (b). Thus, the optimal statistic functions of the two nodes are decided by the following test.

(23)

A geometric explanation associated with this result can be depicted in Figure 3. When the largest eigenvalue is sufficiently large, the additional information from samples of node 1 and samples of node 2 tends to reduce the population error along the same direction , and otherwise the two nodes are allocated to different directions.

Fig. 3: A geometric illustration of Example 2, where strategy (a) leads to error deductions in the directions of both and , while strategy (b) in the direction of merely .

These two propositions imply that the information transmission corresponds to allocating different directions of eigenvectors to different worker nodes. For a general case , the allocation decision depends on the relationship between the eigenvalues of matrix . As the most trivial way to solve this problem, we could try all possible such that (not the set since we at most reduce EPR along the direction of to achieve a larger EPR deduction), which contains possible allocations.

However, the complexity can be reduced by considering all possible strategies. As shown in Example 2, when , there could be possible allocations but only 2 possible strategies. Moreover, each strategy corresponds to a possible partition of the index set . For instance, strategy (a) and (b) in Example 2 corresponds to the partition and . In detail, let be a partition of , and the corresponding strategy refers to that the correspondingly indexed statistic functions are the same eigenvector, i.e., for all the elements , . Thus, given partition , problem (19) becomes

(24)

where denotes the set of all possible permutations of . The solution of problem (24) is given in the following theorem. Without loss of generality, we rank the elements of such that .

Theorem 6.

Let be the arguments that minimizing the objective of problem (24), and then .

1:  Input: , , and
2:  ,
3:  for do
4:   Sort s.t.
5:   
6:   if then
7:    ,
8:  end
9:  for do
10:   for do
11:    
12:   end
13:  end
14:  return
Algorithm 1 Partition Searching Algorithm

With Theorem 6, the solution of problem (19) lies in comparing the minimal EPRs for all possible partitions. Let be the collection of all possible partitions of . Such result can be summarized as Algorithm 1, whose outputs are the statistic functions as desired in problem (17). Moreover, the complexity of Algorithm 1 is the number of possible partitions of , which is called Bell number [4], denoted as . It has been found that  [5], which can be smaller than the complexity of trivial methods.

Iv Vector transmission

Similar to the procedures in Section III, we first provide the maximum likelihood estimator to solve step (i) of problem (13) as follows.

(25)

The matrix is not singular here, and otherwise the statistic could be equivalent to a lower-dimensional one. We without loss of generality assume that , which comes from the fact that we can do the linear transformation for arbitrary . Let , and then step (ii) of problem (13) becomes

(26)

Similar to Theorem 3, we have the following characterization of the results in problem (IV).

Corollary 7.

Let be the optimal solution of (IV), and then

(27)

Corollary 7 implies that problem (IV) is still to allocate different directions of eigenvectors to the entries of different statistic functions. Additionally, we can still find the optimal statistic functions according to an algorithm similar to Algorithm 1. The only difference lies in that for the case of scalar transmission, we consider the partition of the index set , where each index could appear once. For the case of vector transmission, we request each index appears times, where the -th partition is defined as follows.

Definition 8.

A -th partition of a set satisfies (1) , (2) , and (3) for all , .

1:  Input: , , and
2:  , ,
3:  for do
4:   Sort s.t.
5:   
6:   if then
7:    ,
8:  end
9:  for do
10:   for do
11:    
12:   end
13:  end
14:  return
Algorithm 2 -th Partition Searching Algorithm

Note that the standard partition in Section III can be viewed as the -th partition of set . With all these results, problem (13) can be solved by finding the optimal -th partition of the index set (). The procedures can be summarized in Algorithm 2, where be the collection of all possible -th partitions of . The outputs are the collections of required statistic function entries from nodes, whose arrangement in row can be the solution of problem (13). Finally, the corresponding estimator for information vector after knowledge transmission is as defined in (IV).

V Conclusion

This paper studies the information transmission problem in distributed learning, where the design of the transmitted statistics is related to a singular vector decomposition problem. Under the asymptotic regime, the desired method allocates eigenvectors of the EPR norm matrix to different statistic functions in consideration of the sample sizes and the eigenvalues. Note that this paper provides a general operation approach, and designing corresponding concrete algorithms for model training could be an interesting future direction.

Appendix

V-a Proof of Proposition 1

We first provide some notations. We define the vector of loss function as

(28)

Then, the training loss as defined in (6) can be written as

(29)

where we have

(30)

Simialrly,

(31)

Note that

(32)

It leads to

(33)

We can also get the Taylor series of the loss function with

(34)

Finally, we can compute the testing loss (8)

(35)

where Proposition 1 is proved.

V-B Proof of Theorem 2

The MLE corresponds to the mean square error

(36)

While for all possible estimator , the Cramer-Rao bound for its mean square error hold:

(37)

Note that error (V-B) takes the Cramer-Rao bound (V-B) and thus Theorem 2 is proved.

V-C Proof of Theorem 3

First, we consider he method of Lagrange multipliers and derive the following equations. For all

(38)

where are the multipliers. Apparently, satisfies this equation. Note that equation (38) is equivalent to

(39)

It implies that vectors are the eigenvectors of .

Note that without loss of generality, can be chosen such that the eigenvalues of this matrix are different from each other. As the eigenvectors of the same matrix, can be parallel or perpendicular to each other. Let be the index set that , . Then, , . With these properties we have

(40)

which means is the eigenvector of matrix . Note that the eigenvectors of and are the same, and thus Theorem 3 is proved.

References

  • [1] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NIPS), Cited by: §I.
  • [2] M. F. Balcan, A. Blum, S. Fine, and Y. Mansour (2012) Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pp. 26–1. Cited by: §I.
  • [3] D. Basu, D. Data, C. Karakus, and S. N. Diggavi (2020) Qsparse-local-sgd: distributed SGD with quantization, sparsification, and local computations. IEEE J. Sel. Areas Inf. Theory. Cited by: §I.
  • [4] H. Becker and J. Riordan (1948) The arithmetic of bell and stirling numbers. American journal of Mathematics 70 (2), pp. 385–394. Cited by: §III.
  • [5] D. Berend and T. Tassa (2010)

    Improved bounds on bell numbers and on moments of sums of random variables

    .
    Probability and Mathematical Statistics 30 (2), pp. 185–205. Cited by: §III.
  • [6] L. Bottou (2010)

    Large-scale machine learning with stochastic gradient descent

    .
    In COMPSTAT, Cited by: §I.
  • [7] I. Csiszár (1998) The method of types [information theory]. IEEE Transactions on Information Theory 44 (6), pp. 2505–2523. Cited by: §II-A.
  • [8] W. Feller (2008)

    An introduction to probability theory and its applications, vol 2

    .
    John Wiley & Sons. Cited by: §III.
  • [9] E. Gorbunov, F. Hanzely, and P. Richtárik (2021) Local SGD: unified theory and new efficient methods. In

    International Conference on Artificial Intelligence and Statistics,AISTATS

    ,
    Cited by: §I.
  • [10] A. Imteaj, U. Thakker, S. Wang, J. Li, and M. H. Amini (2022) A survey on federated learning for resource-constrained iot devices. IEEE Internet Things J. 9 (1), pp. 1–24. Cited by: §I.
  • [11] P. Jiang and G. Agrawal (2018)

    A linear speedup analysis of distributed deep learning with sparse and quantized communication

    .
    In NeurIPS, Cited by: §I.
  • [12] H. Li, Q. Lü, Z. Wang, X. Liao, and T. Huang (2020) Distributed optimization: advances in theories, methods, and applications. Springer. Cited by: §I.
  • [13] J. Liu, J. Huang, Y. Zhou, X. Li, S. Ji, H. Xiong, and D. Dou (2022) From distributed machine learning to federated learning: a survey. Knowl. Inf. Syst. 64 (4), pp. 885–917. Cited by: §I.
  • [14] A. Spiridonoff, A. Olshevsky, and Y. Paschalidis (2021) Communication-efficient SGD: from local SGD to one-shot averaging. In NeurIPS, Cited by: §I.
  • [15] S. U. Stich (2019) Local SGD converges fast and communicates little. In International Conference on Learning Representations, ICLR, Cited by: §I.
  • [16] J. Van Leeuwen (1991) Handbook of theoretical computer science (vol. a) algorithms and complexity. Mit Press. Cited by: §III.
  • [17] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer (2020) A survey on distributed machine learning. ACM Comput. Surv. 53 (2), pp. 30:1–30:33. Cited by: §I.
  • [18] J. Wangni, J. Wang, J. Liu, and T. Zhang (2018) Gradient sparsification for communication-efficient distributed optimization. In NIPS, Cited by: §I.