I Introduction
Interest in datadriven machine learning (ML) is on the rise, but difficulties in securing data still remain [17, 14]. Mission critical applications aggravate this challenge, which require a large volume of uptodate data for timely coping with local environments even under extreme events. Mobile devices prevailing at the network edge are a major source of these data, but their usergenerated raw data are often privacysensitive (e.g., medical records, location history, etc.). In view of this, distributed ML has attracted significant attention, whereby the parameters of each model, such as the weights of a neural network (NN), are exchanged without revealing raw data [17, 14, 18, 12]. However, with deep NN architectures, the communication payload sizes may be too large, and hinder the performance of distributed ML, spurring a quest for communicationefficient distributed ML solutions.
Federated Learning (FL). A notable communicationefficient distributed ML framework is FL [15, 19, 9]
. Each device, or worker, in FL stores its own dataset and an NN, and locally trains the NN via stochastic gradient descent (SGD). As shown in Fig.
1a, the weight parameters of the local NN are uploaded to a parameter server at a regular interval. The server thereby produces the global average weight parameters that are downloaded by each worker. These FL operations are summarized by its serveraided centralized architecture, random data sampling per SGD iteration, and periodic communication at an interval of multiple SGD iterations. However, the inherently centralized architecture of FL is illsuited for mobile devices located faraway from the server. Due to the limited transmission power and energy, these devices may easily loose connectivity, calling for decentralized ML methods.Group ADMM (GADMM). To achieve fast convergence while minimizing the number of communication links under a decentralized architecture, GADMM was proposed in our prior works [6, 5, 7]. Under GADMM, each worker communicates only with its two neighboring workers. To this end, as illustrated in Fig. 1
b, GADMM first divides the workers into head and tail groups. The workers in the same group update their model parameters in parallel, while the workers in different groups update their models in an alternating way, after communicating with neighbors in different groups. Nonetheless, the effectiveness of GADMM was shown only for convex loss functions without exploring deep NN architectures. What’s more, GADMM assumes that every worker identically stores the
full batch data and runs gradient descent (GD) with immediate communication per GD iteration. These assumptions are illsuited for the usergenerated nature of data and communication efficiency, motivating us to seek for a federated and decentralized solution.Layerwise Federated GADMM (LFGADMM). To bridge the gap between FL and GADMM, in this article we propose LFGADMM, by integrating the periodic communication and random data sampling properties of FL into GADMM under a deep NN architecture. To further improve communication efficiency, as illustrated in Fig. 1c, LGADMM applies a different communication period to each layer. By exchanging the largest layer x less frequently than the other layers, our results show that LFGADMM achieves the same test accuracy while saving % and % average communication cost, compared to the case using the same communication period for all layers and FL, respectively.
Related Works. Towards improving communication efficiency of distributed ML, under centralized ML, the number of communication rounds can be reduced by collaboratively adjusting the training momentum [13, 23]. On the other hand, the number of communication links can be decreased by collecting model updates until a time deadline [21], upon the values sufficiently changed from the preceding updates [4, 20], or based on channel conditions [16, 22, 3]. Furthermore, the communication payload can be compressed by 1bit gradient quantization [2], multibit gradient quantization [20], or weight quantization with random rotation [11]. Alternatively, instead of model parameters, model outputs can be exchanged for large models via knowledge distillation [8, 1]. Similar principles are applicable for communicationefficient decentralized ML. Without any central entity, communication payload sizes can be reduced by a quantized weight gossiping algorithm [10], ignoring communication link reduction. Alternatively, the number of communication links and rounds can be decreased using GADMM proposed in our prior work [6]. Furthermore, by integrating stochastic quantization into GADMM, quantized GADMM (QGADMM) was proposed to reduce communication rounds, links, and payload sizes altogether [7]. To achieve the same goals, instead of quantization as in QGADMM, LFGADMM applies a layerwise federation to GADMM under deep NN architectures. Combining both quantization and layerwise federation is deferred to future work.
Ii Problem Formulation
We consider workers, each of which stores its own batch of input samples and runs a deep NN model comprising layers. Hereafter, the subscript identifies the workers, and the superscript indicates the layers. The th worker’s model parameters (i.e., weights and biases) are denoted as , where is a
dimensional vector whose elements are the model parameters of the
th layer.Every worker has its local loss function , and optimizes its local model such that the global average loss can be minimized, at which all local models reach a consensus on a global model . To solve this problem in parallel, each worker runs a firstorder iterative algorithm by selecting a minibatch at the th iteration, while communicating with other workers to ensure the consensus across . Unfortunately, this consensus requires exchanging local models, incurring huge communication overhead for deep NNs. To reduce the communication payload sizes, we instead consider a consensus across , leading to the following problem formulation:
Minimize  (1)  
subject to  (2) 
The constraint (2) implies a perlayer consensus between the th and th workers. This enables layerwise federation and neighborbased communication, as elaborated in the next section.
Iii Proposed Algorithm: LFGADMM
To solve the problem defined in (1)(2), in this section we propose LFGADMM, by extending GADMM proposed in our prior work [6]. Following GADMM (see Fig. 1b), workers in LFGADMM are divided into head and tail groups, and communicate only with their neighboring workers. Compared to GADMM, LFGADMM further improves communication efficiency through the following two ways. First, workers LFGADMM periodically communicate as done in FL [15, 19, 9], in contrast to the communication per iteration in GADMM. Second, the communication period of LFGADMM is adjusted separately for each layer (see Fig. 1c), as opposed to GADMM and FL exchanging the entire models. LFGADMM can thereby increase the communication periods for largesized layers, while reducing the communication payload size.
To be specific, a physical network topology is converted into a logical worker connectivity graph in LFGADMM. Then, the workers are split into a head group and a tail group , such that each head worker is connected to neighboring tail workers. For the workers in the same group, their model parameters are updated in parallel, by iterating the minibatch stochastic gradient descent algorithm (SGD). After iterations, the workers share the updated model parameters of the th layer with their neighbors. These operations of LFGADMM are summarized in Algorithm 1, and detailed next.
(3) 
where is the th worker’s dual variable of the th layer, and is a constant penalty term. For the sake of explanation, hereafter the th worker’s model parameter vector of the th layer is called a primal variable. Head and tail workers’ primal and dual variables are updated through the following three phases.
1) Head primal updates. Head workers receive the primal variables from their tail workers, and update the dual variables at an interval of iterations. These variables at the th iteration are thus fixed as the values at the th iteration. Given these fixed primal and dual variables associated with neighbors, at iteration , each head worker runs minibatch SGD to minimize . Applying the firstorder condition to yields the th head worker’s mode update as follows:
(4) 
After iterations, the head worker transmits the updated primal variable to its two tail neighbors, workers and .
2) Tail primal updates. Following the same principle in the head primal updates, at the th iteration, the th tail worker updates its model as:
(5) 
After iterations, the tail worker transmits the updated primal variable to its two head neighbors, workers and .
3) Dual updates. After the updated tail primal variables are exchanged, every worker locally updates the dual variables and as follows:
(6) 
The convergence of the aforementioned primal and dual variable updates are theoretically proved for convex, proper, and smooth loss functions only when the entire layers are exchanged at every iteration [6]. The convergence proof of LFGADMM for different exchanging periods under deep NN architectures is deferred to future work. Meanwhile, the effectiveness of LFGADMM is empirically corroborated in the next section.
Iv Numerical Evaluations
(a) MLP  
Layer  Output Shape  #Weights 
fc1  256  200,960 
fc2  128  32,896 
fc3  64  8,256 
fc4  32  2,080 
fc5  16  528 
fc6  10  170 
Total  244,890 
(b) CNN  
Layer  Output Shape  #Weights 
conv1  28x28x8  208 
conv2  14x14x6  3,216 
conv3  7x7x32  8,224 
fc4  400  627,600 
fc5  10  4,010 
Total  643,258 
This section validates the performance of LFGADMM for a classification task, with 4 workers uniformly randomly distributed over a 50x50 m plane. These workers are assigned to head and tail groups, such that the length of the path starting from one worker passing through all workers is minimized. The simulation settings are elaborated as follows.
Datasets. We consider the MNIST dataset comprising 28x28 pixel images that represent handwritten 09 digits. Each worker has training samples, independent and identically distributed across workers, and utilizes randomly selected samples per minibatch SGD iteration, i.e., .
Communication periods. Denoting as the largest layer, its communication period under LFGADMM is set as for all . Hereafter, our proposed scheme is referred to as LFGADMM 1x, 2x, or 4x, when , , or , respectively. The communication periods of the other layers are identically set to 5 iterations.
NN architectures. To examine the impact of NN architectures, two different NN models are considered, a
multilayer perceptron (MLP)
NN and convolutional neural network (CNN). As Table I describes, the MLP consists of 6 layers, among which the 1st layer is the largest, i.e., , and has 82% weight parameters of the entire model. The CNN comprises 5 layers, and the 4th layer is the largest among them, i.e., , having 97.6% weight parameters of the entire model.Baselines. We compare LFGADMM with two benchmark schemes, (i) FL running minibatch SGD with the learning rate , while exchanging local gradients every 5 iterations [15]; and (ii) standalone minibatch SGD with a single worker. In FL, the worker having the minimum sum distances to all other workers is set as a parameter server. In the standalone case, there is no communication, but for the sake of convenience 5 SGD iterations are counted as a single communication round.
Performance measures. The performance of each scheme is measured in terms of training loss, test accuracy, and total communication cost. Training loss is measured using the cross entropy function. Test accuracy is calculated as the fraction of correct classification outcomes. Total communication cost is the sum of the entire links’ communication costs. For a given link with distance between two workers, its communication cost is given as , which corresponds to the transmission power for freespace channel inversion.
With respect to these three figure of merits, the effectiveness of LFGADMM is described as follows.

[leftmargin=10pt]

Fast convergence of LFGADMM: As shown by Fig. 2, under both MLP and CNN architectures, LFGADMM 1x, 2x, and 4x converge within 100 communication rounds. For all cases, the final loss values of LFGADMM are close to each other, which are up to 13.8% smaller than FL whose training speed is also slower. It is noted that the standalone baseline yields the fastest convergence speed. This is because of its overfitting towards local training samples, which results in poor accuracy as explained next.

High accuracy of LFGADMM: Fig. 3b shows that under CNN, LFGADMM 1x achieves the highest final test accuracy (92.25%), followed by FL (92%), LFGADMM 2x (91.42%), LFGADMM 4x (89.76%), and the standalone case (87.3%). It is noticeable that LFGADMM 2x accuracy is comparable to FL, which converges faster while exchanging less layers than FL. Fig. 3 demonstrates that under MLP, surprisingly, LFGADMM 2x achieves the highest accuracy (91.37%), followed by LFGADMM 1x (90.87%), FL (90.72%), LFGADMM 4x (86.94%), and the standalone case (83.61%). The excellence of LFGADMM 2x can be explained by its regularization effect. Its skipping the largest layer communications introduces additional errors compared to LFGADMM 1x, which exhibits better generalization.

Low communication cost of LFGADMM: Fig. 4
illustrates the complementary cumulative distribution function (CCDF) of the total communication cost, after running 1,000 experiments. The result shows that LFGADMM achieves lower mean and variance of the total communication cost, compared to FL and the standalone baseline. Specifically, the mean total communication cost of LFGADMM is up to 3.05x and 4.6x lower than FL, under MLP and CNN, respectively. Furthermore, the variance of LFGADM is up to 42.4x and 21.4x lower than FL, under MLP and CNN, respectively. There are two rationales behind these results. In LFGADMM, the worker connectivity is based on nearest neighbors in a decentralized setting, leading to shorter link distances than FL whose connectivity is centralized. Furthermore, the payload sizes are smaller thanks to partially skipping the largest layer exchanges, inducing higher communication efficiency of LFGADMM.
V Conclusions
By leveraging and extending GADMM and FL, in this article we proposed LFGADMM, a communicationefficient decentralized ML algorithm that exchanges the largest layers of deep NN models less frequently than other layers. Numerical evaluations validated that LFGADMM achieves fast convergence and high accuracy while significantly reducing the mean and variance of the communication cost. Generalizing this preliminary study, optimizing layer exchanging periods under different NN architectures and network topologies could be an interesting topic for future research.
References
 [1] (201905) Wireless federated distillation for distributed edge learning with heterogeneous data. [Online]. Arxiv preprint: http://arxiv.org/abs/1907.02745. Cited by: §I.
 [2] (201807) SignSGD: compressed optimisation for nonconvex problems. In Proc. Intl. Conf. Machine Learn., Stockholm, Sweden. Cited by: §I.
 [3] (20019) A joint learning and communications framework for federated learning over wireless networks. arXiv preprint arXiv: 1909.07972. Cited by: §I.
 [4] (2018) LAG: lazily aggregated gradient for communicationefficient distributed learning. In Advances in Neural Information Processing Systems, pp. 5055–5065. Cited by: §I.
 [5] (2019) Communicationefficient decentralized machine learning framework for 5g and beyond. In 2019 IEEE Global Communications Conference (GLOBECOM), Cited by: §I.
 [6] (2019) GADMM: fast and communication efficient framework for distributed machine learning. arXiv preprint arXiv:1909.00047. Cited by: §I, §I, §III, §III.
 [7] (2019) QGADMM: quantized group admm for communication efficient decentralized machine learning. arXiv preprint arXiv:1910.10453. Cited by: §I, §I.
 [8] (201812) Communicationefficient ondevice machine learning: federated distillation and augmentation under noniid private data. presented at Neural Information Processing Systems Workshop on Machine Learning on the Phone and other Consumer Devices (MLPCD), Montréal, Canada. External Links: Document, 1811.11479, Link Cited by: §I.
 [9] Blockchained ondevice federated learning. to appear in IEEE Communications Letters [Online]. Early access: https://ieeexplore.ieee.org/document/8733825. Cited by: §I, §III.
 [10] (201907) Decentralized stochastic optimization and gossip algorithms with compressed communication. Proc. International Conference on Machine Learning (ICML), Long Beach, CA, USA. Cited by: §I.
 [11] (201612) Federated learning: strategies for improving communication efficiency. In Proc. of NIPS Wksp. PMPML, Barcelona, Spain. External Links: Link Cited by: §I.
 [12] (201908) Federated learning: challenges, methods, and future directions. [Online]. ArXiv preprint: https://arxiv.org/abs/1908.07873. Cited by: §I.
 [13] (2019) Accelerating federated learning via momentum gradient descent. arXiv preprint arXiv: 1910.03197. Cited by: §I.
 [14] Robust coreset construction for distributed machine learning. [Online]. ArXiv preprint: https://arxiv.org/abs/1904.05961. Cited by: §I.

[15]
(201704)
Communicationefficient learning of deep networks from decentralized data.
In Proceedings of Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA
. Cited by: §I, §III, §IV.  [16] (201905) Client selection for federated learning with heterogeneous resources in mobile edge. In Proc. Int’l Conf. Commun. (ICC), Shanghai, China. Cited by: §I.
 [17] (201911) Wireless network intelligence at the edge. Proceedings of the IEEE 107 (11), pp. 2204–2239. Cited by: §I.
 [18] (201908) Distilling ondevice intelligence at the network edge. Arxiv preprint abs/1908.05895. Cited by: §I.
 [19] (2018) Federated learning for ultrareliable lowlatency v2v communications. In 2018 IEEE Global Communications Conference (GLOBECOM), pp. 1–7. Cited by: §I, §III.
 [20] (2019) Communicationefficient distributed learning via lazily aggregated quantized gradients. arXiv preprint arXiv:1909.07588. Cited by: §I.
 [21] (201906) Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1205–1221. Cited by: §I.
 [22] (2019) Scheduling policies for federated learning in wireless networks. arXiv preprint arXiv: 1908.06287. Cited by: §I.
 [23] (201906) On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization. In Proc. Intl. Conf. Machine Learn., Long Beach, CA, USA. Cited by: §I.
Comments
There are no comments yet.