Distributed learning has been an active research area focusing on solving learning tasks of different workers under the collaboration between each other [17, 12, 13]. This learning scheme allows the distributed workers to share some knowledge that enables them to collaboratively learn better models than learning individually. In this context, the communication cost is of significant importance as the worker nodes usually have power and bandwidth resource constraints in realistic applications . On the other hand, the learning tasks are usually solved by iterative optimization steps, e.g., stochastic gradient decent (SGD) , which could involve iterative high-dimensional gradient message transmissions in a frequent manner and thus induce high communication overhead.
To alleviate this issue, recent works have studied the approaches of gradient compression, which focus on employing a low-precision and low-dimensional representation of the gradient vector[1, 18, 11, 3]. On the other hand, some other works employ multiple local learning steps before transmitting gradient to reduce the total communication rounds [15, 14, 9]. However, those methods should empirically strike a good balance between the model performance and the communication cost, and they did not consider the explicit constraint of communication bits. Therefore, designing the transmission approach with limited communication, i.e., revealing what statistics are required to be transmitted from other nodes to the target one, is vital for taking full advantage of the distributed collaborative learning .
In this paper, we study the fundamental problem of information transmission in distributed learning under the information dimensionality constraint, where the setting is summarized as follows. Let
be the random variable of data with domain. We consider that the distributed learning problem has worker nodes, namely node , node , …, node . For node , we assume that training samples are i.i.d. generated from the universal distribution , where denotes the corresponding empirical distribution. In detail, the training process follows the empirical risk minimization (ERM) framework, where each node learns the parameter vector
with respect to the loss function.
Without loss of generality, we take node as the target. As shown in Figure 1, the distributed learning framework can achieve a better parameter vector for node by the following collaboration mechanism. For all the remaining nodes, node () transmits a -dimensional statistic to node , which takes the empirical mean of some statistic function of the samples, i.e., its form is . The restriction on the dimensionality is typically a communication constraint of fixed codeword length when each dimension needs fixed transmission bits.
The purpose of our work is to provide the expressions of the statistic functions such that the learning result of node performs best, where can be recognized as a function of its individual samples and . Moreover, the performance is evaluated by the expected population risk (EPR), i.e., the desired statistic functions can be derived by
where the expectation is taken over the sampling process. In this paper, we consider the asymptotic regime that the sample size of each node is large, and the empirical distribution can be close to the underlying distribution with high probability. As a result, we demonstrate that the EPR can be recognized as a mean square error measured by the EPR norm matrix between the empirical and underlying distribution.
Note that the empirical distribution can be regarded as a Gaussian vector under the asymptotic regime, whose covariance is inversely proportional to the sample size. Accordingly, we prove that the statistic functions correspond to the eigenvectors of the EPR norm matrix. Therefore, designing the optimal information transmission mechanism is transformed into an integer programming problem, which settles different eigenvectors to the positions of different statistic functions. Especially, we provide the analytical solutions of the cases when and
, where eigenvectors of the largest 2 eigenvalues are allocated and a geometric interpretation is given. Moreover, we demonstrate that the statistic functions of those nodes with more samples prefer the eigenvectors with larger eigenvalues. This conclusion leads to an algorithm based on the partition of thenodes, which presents a computational complexity smaller than trivial methods.
Our framework and results differ from previous works in two ways: (1) previous works transmit the compressed gradient vectors in each round, while we transmit the low-dimensional statistics; (2) previous works involve iterative gradient transmissions, while our goal is to maximize the utility of collaboration between workers by one-shot communication. The contributions of this paper can be summarized as follows. Section II formulates this problem as estimating the underlying distribution under EPR. Section III presents the main theorems describing the properties of the solutions for scalar transmission, and propose the algorithm based on node partitions. Finally, Section IV extends the results to the case of vector transmission and improves the algorithm according to the operation of high order partitions.
Ii-a Asymptotic Approximation
Before presenting the problem formulation, we briefly introduce some convergence results and notations with respect to the empirical distributions. First, for an arbitrary distribution , we define its associated information vector as,
denoted as . Accordingly, information vector follows , and is associated with the corresponding empirical distribution.
In this paper, we concentrate on the local regime where
for all and is small. The empirical distributions are contained in this regime with high probability when samples sizes are large. Under this regime, we have the following approximation of the Kullback-Leibler (K-L) divergence
where denotes the -norm of vectors.
It is well known in information theory  that the probability function follows
By applying the local approximation (4), the probability function follows
which indicates that is approximately a Gaussian vector centered at with covariance matrix .
Ii-B Computation of EPR
Note that the optimal parameter that all the nodes desire can be defined as
Consider that we learn the estimator for with respect to , which is typically the empirical distribution of the corresponding node when no knowledge is transferred from other nodes, i.e., let the learned estimator be
The performance of this estimator is evaluated by the expected population risk (EPR), which is defined as [cf. (1)]
where the expectation is computed by the integral over the Gaussian density functions
, and (constant) is the EPR where the optimal parameter is achieved. Based on this formulation, we have the following characterization of the EPR (8).
Suppose that is twice-differentiable and Lipschitz continuous for , and is unbiased for , the testing loss as defined in (8) can be computed by
where is called EPR norm matrix, whose entries are
The notation and denote the gradient vector and Hessian matrix , and denotes its Moore–Penrose inverse.
This characterization indicates that the purpose of our problem is to find the optimal estimation for with its Gaussian observations under error (10), which can be seen as a mean square error measured by .
When no knowledge is transferred from other nodes, node takes , which has the EPR (high order terms omitted)
When are transmitted from other nodes, this paper could construct a smaller EPR than (12). Let
be the statistic function matrix and the statistic from node can be written as . Finally, the problem (1) comes into an optimization problem with two steps:
provide the optimal estimator with respect to the empirical vector and the statistics ;
find the optimal ’s such that the EPR is minimal.
Thus, the following formulation is given
Iii Scalar Transmission
In this section, we provide the solution of problem (13) under a special case when each node only transmits a scalar to node , i.e., . In other word, the matrix is degenerated to a vector, which is denoted as . This special case can be easily extended to the case when , and the result would be shown in Section IV.
First, we provide the solution for step (i). Let be the optimal estimator that minimizes the EPR
which is almost a non-Bayesian minimal mean square error (MMSE) estimation problem. Note that , where satisfies . Thus, problem (14
) can be viewed as to find the MMSE estimator for the linearly transformed parameter. Note that it is easy to verify that the corresponding observations ’s are still Gaussian vectors. The typical method is to compute the maximum-likelihood estimator (MLE) and then prove its efficiency by the Cramer-Rao bound. The MLE can be computed as follows
where the density function is defined in (9).
Accordingly, the expression of is
Then, we have the following characterization of the optimal estimator .
The next step is to compute the corresponding EPR . Without loss of generality, we assume that the statistic functions satisfy (), and the step (ii) of problem (13) becomes
The following theorem characterizes the property of the solution of this problem.
where denotes the indicator function .
Note that problem (19) is an integer programming problem, which is typically NP-hard , and the analytical solution is hard to provide. However, we can still understand some properties of its solution and provide an efficient algorithm. Before presenting these results, we first show two simple cases of this problem, which could help achieve a geometrical understanding and interpretation. Specifically, when , the objective function of problem (19) gives , which is consistent with the result in (12).
Proposition 4 (Single-node transmission).
When there is only one node for knowledge transmission, i.e., , problem (19) comes to
Let be the solution of problem (20) and it is easy to verify that . Accordingly, the optimal statistic function is the largest eigenvector of matrix .
A geometric explanation associated with this result can be depicted in Figure 2. Note that the case when implies that the EPR (12) is the summation of the expected errors along all the eigenvectors of matrix , which are proportional to the corresponding eigenvalues and inversely proportional to the sample size . With the information contained in the scalar , the expected error along is deduced from to . Problem (20) aims at finding the direction where the maximum error deduction is achieved, where obviously the direction of is the answer.
Proposition 5 (Two-node transmission).
When there are two nodes for knowledge transmission, i.e., , there exist two possible strategies
Statistic function and select different directions;
Statistic function and select the same direction.
Then, problem (19) under these two strategies comes to
Without loss of generality, it could be assumed that . The solutions of problem (21a) and (21b) are easy to derive. For strategy (a), the direction of and shall be along and , i.e., the optimal arguments are and ; for strategy (b), similar to Proposition 4, . Additionally, the corresponding EPRs are presented in the following.
Depending on the relationship between the eigenvalues and , the EPR of strategy (a) could be larger or smaller than strategy (b). Thus, the optimal statistic functions of the two nodes are decided by the following test.
A geometric explanation associated with this result can be depicted in Figure 3. When the largest eigenvalue is sufficiently large, the additional information from samples of node 1 and samples of node 2 tends to reduce the population error along the same direction , and otherwise the two nodes are allocated to different directions.
These two propositions imply that the information transmission corresponds to allocating different directions of eigenvectors to different worker nodes. For a general case , the allocation decision depends on the relationship between the eigenvalues of matrix . As the most trivial way to solve this problem, we could try all possible such that (not the set since we at most reduce EPR along the direction of to achieve a larger EPR deduction), which contains possible allocations.
However, the complexity can be reduced by considering all possible strategies. As shown in Example 2, when , there could be possible allocations but only 2 possible strategies. Moreover, each strategy corresponds to a possible partition of the index set . For instance, strategy (a) and (b) in Example 2 corresponds to the partition and . In detail, let be a partition of , and the corresponding strategy refers to that the correspondingly indexed statistic functions are the same eigenvector, i.e., for all the elements , . Thus, given partition , problem (19) becomes
where denotes the set of all possible permutations of . The solution of problem (24) is given in the following theorem. Without loss of generality, we rank the elements of such that .
Let be the arguments that minimizing the objective of problem (24), and then .
With Theorem 6, the solution of problem (19) lies in comparing the minimal EPRs for all possible partitions. Let be the collection of all possible partitions of . Such result can be summarized as Algorithm 1, whose outputs are the statistic functions as desired in problem (17). Moreover, the complexity of Algorithm 1 is the number of possible partitions of , which is called Bell number , denoted as . It has been found that , which can be smaller than the complexity of trivial methods.
Iv Vector transmission
The matrix is not singular here, and otherwise the statistic could be equivalent to a lower-dimensional one. We without loss of generality assume that , which comes from the fact that we can do the linear transformation for arbitrary . Let , and then step (ii) of problem (13) becomes
Let be the optimal solution of (IV), and then
Corollary 7 implies that problem (IV) is still to allocate different directions of eigenvectors to the entries of different statistic functions. Additionally, we can still find the optimal statistic functions according to an algorithm similar to Algorithm 1. The only difference lies in that for the case of scalar transmission, we consider the partition of the index set , where each index could appear once. For the case of vector transmission, we request each index appears times, where the -th partition is defined as follows.
A -th partition of a set satisfies (1) , (2) , and (3) for all , .
Note that the standard partition in Section III can be viewed as the -th partition of set . With all these results, problem (13) can be solved by finding the optimal -th partition of the index set (). The procedures can be summarized in Algorithm 2, where be the collection of all possible -th partitions of . The outputs are the collections of required statistic function entries from nodes, whose arrangement in row can be the solution of problem (13). Finally, the corresponding estimator for information vector after knowledge transmission is as defined in (IV).
This paper studies the information transmission problem in distributed learning, where the design of the transmitted statistics is related to a singular vector decomposition problem. Under the asymptotic regime, the desired method allocates eigenvectors of the EPR norm matrix to different statistic functions in consideration of the sample sizes and the eigenvalues. Note that this paper provides a general operation approach, and designing corresponding concrete algorithms for model training could be an interesting future direction.
V-a Proof of Proposition 1
We first provide some notations. We define the vector of loss function as
Then, the training loss as defined in (6) can be written as
where we have
It leads to
We can also get the Taylor series of the loss function with
V-B Proof of Theorem 2
The MLE corresponds to the mean square error
While for all possible estimator , the Cramer-Rao bound for its mean square error hold:
V-C Proof of Theorem 3
First, we consider he method of Lagrange multipliers and derive the following equations. For all
where are the multipliers. Apparently, satisfies this equation. Note that equation (38) is equivalent to
It implies that vectors are the eigenvectors of .
Note that without loss of generality, can be chosen such that the eigenvalues of this matrix are different from each other. As the eigenvectors of the same matrix, can be parallel or perpendicular to each other. Let be the index set that , . Then, , . With these properties we have
which means is the eigenvector of matrix . Note that the eigenvectors of and are the same, and thus Theorem 3 is proved.
-  (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NIPS), Cited by: §I.
-  (2012) Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pp. 26–1. Cited by: §I.
-  (2020) Qsparse-local-sgd: distributed SGD with quantization, sparsification, and local computations. IEEE J. Sel. Areas Inf. Theory. Cited by: §I.
-  (1948) The arithmetic of bell and stirling numbers. American journal of Mathematics 70 (2), pp. 385–394. Cited by: §III.
Improved bounds on bell numbers and on moments of sums of random variables. Probability and Mathematical Statistics 30 (2), pp. 185–205. Cited by: §III.
-  (2010) . In COMPSTAT, Cited by: §I.
-  (1998) The method of types [information theory]. IEEE Transactions on Information Theory 44 (6), pp. 2505–2523. Cited by: §II-A.
An introduction to probability theory and its applications, vol 2. John Wiley & Sons. Cited by: §III.
Local SGD: unified theory and new efficient methods.
International Conference on Artificial Intelligence and Statistics,AISTATS, Cited by: §I.
-  (2022) A survey on federated learning for resource-constrained iot devices. IEEE Internet Things J. 9 (1), pp. 1–24. Cited by: §I.
A linear speedup analysis of distributed deep learning with sparse and quantized communication. In NeurIPS, Cited by: §I.
-  (2020) Distributed optimization: advances in theories, methods, and applications. Springer. Cited by: §I.
-  (2022) From distributed machine learning to federated learning: a survey. Knowl. Inf. Syst. 64 (4), pp. 885–917. Cited by: §I.
-  (2021) Communication-efficient SGD: from local SGD to one-shot averaging. In NeurIPS, Cited by: §I.
-  (2019) Local SGD converges fast and communicates little. In International Conference on Learning Representations, ICLR, Cited by: §I.
-  (1991) Handbook of theoretical computer science (vol. a) algorithms and complexity. Mit Press. Cited by: §III.
-  (2020) A survey on distributed machine learning. ACM Comput. Surv. 53 (2), pp. 30:1–30:33. Cited by: §I.
-  (2018) Gradient sparsification for communication-efficient distributed optimization. In NIPS, Cited by: §I.