L-FGADMM: Layer-Wise Federated Group ADMM for Communication Efficient Decentralized Deep Learning

11/09/2019
by   Anis Elgabli, et al.
University of Oulu
0

This article proposes a communication-efficient decentralized deep learning algorithm, coined layer-wise federated group ADMM (L-FGADMM). To minimize an empirical risk, every worker in L-FGADMM periodically communicates with two neighbors, in which the periods are separately adjusted for different layers of its deep neural network. A constrained optimization problem for this setting is formulated and solved using the stochastic version of GADMM proposed in our prior work. Numerical evaluations show that by less frequently exchanging the largest layer, L-FGADMM can significantly reduce the communication cost, without compromising the convergence speed. Surprisingly, despite less exchanged information and decentralized operations, intermittently skipping the largest layer consensus in L-FGADMM creates a regularizing effect, thereby achieving the test accuracy as high as federated learning (FL), a baseline method with the entire layer consensus by the aid of a central entity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

10/23/2019

Q-GADMM: Quantized Group ADMM for Communication Efficient Decentralized Machine Learning

In this paper, we propose a communication-efficient decentralized machin...
10/28/2021

Communication-Efficient ADMM-based Federated Learning

Federated learning has shown its advances over the last few years but is...
03/21/2020

Dynamic Sampling and Selective Masking for Communication-Efficient Federated Learning

Federated learning (FL) is a novel machine learning setting which enable...
02/05/2022

Communication Efficient Federated Learning via Ordered ADMM in a Fully Decentralized Setting

The challenge of communication-efficient distributed optimization has at...
04/26/2021

Communication-Efficient and Personalized Federated Lottery Ticket Learning

The lottery ticket hypothesis (LTH) claims that a deep neural network (i...
10/14/2017

Robust Federated Learning Using ADMM in the Presence of Data Falsifying Byzantines

In this paper, we consider the problem of federated (or decentralized) l...
03/24/2021

The Gradient Convergence Bound of Federated Multi-Agent Reinforcement Learning with Efficient Communication

The paper considers a distributed version of deep reinforcement learning...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Interest in data-driven machine learning (ML) is on the rise, but difficulties in securing data still remain [17, 14]. Mission critical applications aggravate this challenge, which require a large volume of up-to-date data for timely coping with local environments even under extreme events. Mobile devices prevailing at the network edge are a major source of these data, but their user-generated raw data are often privacy-sensitive (e.g., medical records, location history, etc.). In view of this, distributed ML has attracted significant attention, whereby the parameters of each model, such as the weights of a neural network (NN), are exchanged without revealing raw data [17, 14, 18, 12]. However, with deep NN architectures, the communication payload sizes may be too large, and hinder the performance of distributed ML, spurring a quest for communication-efficient distributed ML solutions.

Fig. 1: Operational structures of (a) federated learning (FL), (b) group ADMM (GADMM), and (c) our proposed layer-wise federated GADMM (L-FGADMM) combining FL and GADMM, while less frequently exchanging the largest NN layer compared to the other layers.

Federated Learning (FL). A notable communication-efficient distributed ML framework is FL [15, 19, 9]

. Each device, or worker, in FL stores its own dataset and an NN, and locally trains the NN via stochastic gradient descent (SGD). As shown in Fig. 

1a, the weight parameters of the local NN are uploaded to a parameter server at a regular interval. The server thereby produces the global average weight parameters that are downloaded by each worker. These FL operations are summarized by its server-aided centralized architecture, random data sampling per SGD iteration, and periodic communication at an interval of multiple SGD iterations. However, the inherently centralized architecture of FL is ill-suited for mobile devices located faraway from the server. Due to the limited transmission power and energy, these devices may easily loose connectivity, calling for decentralized ML methods.

Group ADMM (GADMM). To achieve fast convergence while minimizing the number of communication links under a decentralized architecture, GADMM was proposed in our prior works [6, 5, 7]. Under GADMM, each worker communicates only with its two neighboring workers. To this end, as illustrated in Fig. 1

b, GADMM first divides the workers into head and tail groups. The workers in the same group update their model parameters in parallel, while the workers in different groups update their models in an alternating way, after communicating with neighbors in different groups. Nonetheless, the effectiveness of GADMM was shown only for convex loss functions without exploring deep NN architectures. What’s more, GADMM assumes that every worker identically stores the

full batch data and runs gradient descent (GD) with immediate communication per GD iteration. These assumptions are ill-suited for the user-generated nature of data and communication efficiency, motivating us to seek for a federated and decentralized solution.

Layer-wise Federated GADMM (L-FGADMM). To bridge the gap between FL and GADMM, in this article we propose L-FGADMM, by integrating the periodic communication and random data sampling properties of FL into GADMM under a deep NN architecture. To further improve communication efficiency, as illustrated in Fig. 1c, L-GADMM applies a different communication period to each layer. By exchanging the largest layer x less frequently than the other layers, our results show that L-FGADMM achieves the same test accuracy while saving % and % average communication cost, compared to the case using the same communication period for all layers and FL, respectively.

Related Works. Towards improving communication efficiency of distributed ML, under centralized ML, the number of communication rounds can be reduced by collaboratively adjusting the training momentum [13, 23]. On the other hand, the number of communication links can be decreased by collecting model updates until a time deadline [21], upon the values sufficiently changed from the preceding updates [4, 20], or based on channel conditions [16, 22, 3]. Furthermore, the communication payload can be compressed by 1-bit gradient quantization [2], multi-bit gradient quantization [20], or weight quantization with random rotation [11]. Alternatively, instead of model parameters, model outputs can be exchanged for large models via knowledge distillation [8, 1]. Similar principles are applicable for communication-efficient decentralized ML. Without any central entity, communication payload sizes can be reduced by a quantized weight gossiping algorithm [10], ignoring communication link reduction. Alternatively, the number of communication links and rounds can be decreased using GADMM proposed in our prior work [6]. Furthermore, by integrating stochastic quantization into GADMM, quantized GADMM (Q-GADMM) was proposed to reduce communication rounds, links, and payload sizes altogether [7]. To achieve the same goals, instead of quantization as in Q-GADMM, L-FGADMM applies a layer-wise federation to GADMM under deep NN architectures. Combining both quantization and layer-wise federation is deferred to future work.

Ii Problem Formulation

We consider workers, each of which stores its own batch of input samples and runs a deep NN model comprising layers. Hereafter, the subscript identifies the workers, and the superscript indicates the layers. The -th worker’s model parameters (i.e., weights and biases) are denoted as , where is a

-dimensional vector whose elements are the model parameters of the

-th layer.

Every worker has its local loss function , and optimizes its local model such that the global average loss can be minimized, at which all local models reach a consensus on a global model . To solve this problem in parallel, each worker runs a first-order iterative algorithm by selecting a mini-batch at the -th iteration, while communicating with other workers to ensure the consensus across . Unfortunately, this consensus requires exchanging local models, incurring huge communication overhead for deep NNs. To reduce the communication payload sizes, we instead consider a consensus across , leading to the following problem formulation:

Minimize (1)
subject to (2)

The constraint (2) implies a per-layer consensus between the -th and -th workers. This enables layer-wise federation and neighbor-based communication, as elaborated in the next section.

Iii Proposed Algorithm: L-FGADMM

1:  Input:
2:  Output:
3:  Initialization:
4:  while  do
5:     Head worker : in Parallel
6:        Randomly selects a mini-batch
7:        Updates its primal variable via (4)
8:     if  mod  then
9:           Transmits to its two tail neighbors
10:     end if
11:     Tail worker : in Parallel
12:        Randomly selects a mini-batch
13:        Updates its primal variable via (5)
14:     if  mod  then
15:           Transmits to its two head neighbors
16:     end if
17:     All workers: in Parallel
18:     if  mod  then
19:           Updates the dual variables via (6)
20:     end if
21:     
22:  end while
Algorithm 1 Layer-Wise Federated GADMM (L-FGADMM)

To solve the problem defined in (1)-(2), in this section we propose L-FGADMM, by extending GADMM proposed in our prior work [6]. Following GADMM (see Fig. 1b), workers in L-FGADMM are divided into head and tail groups, and communicate only with their neighboring workers. Compared to GADMM, L-FGADMM further improves communication efficiency through the following two ways. First, workers L-FGADMM periodically communicate as done in FL [15, 19, 9], in contrast to the communication per iteration in GADMM. Second, the communication period of L-FGADMM is adjusted separately for each layer (see Fig. 1c), as opposed to GADMM and FL exchanging the entire models. L-FGADMM can thereby increase the communication periods for large-sized layers, while reducing the communication payload size.

To be specific, a physical network topology is converted into a logical worker connectivity graph in L-FGADMM. Then, the workers are split into a head group and a tail group , such that each head worker is connected to neighboring tail workers. For the workers in the same group, their model parameters are updated in parallel, by iterating the mini-batch stochastic gradient descent algorithm (SGD). After iterations, the workers share the updated model parameters  of the -th layer with their neighbors. These operations of L-FGADMM are summarized in Algorithm 1, and detailed next.

At first, the augmented Lagrangian of the problem in (1)-(2) is defined as:

(3)

where is the -th worker’s dual variable of the -th layer, and is a constant penalty term. For the sake of explanation, hereafter the -th worker’s model parameter vector of the -th layer is called a primal variable. Head and tail workers’ primal and dual variables are updated through the following three phases.

1) Head primal updates. Head workers receive the primal variables from their tail workers, and update the dual variables at an interval of iterations. These variables at the -th iteration are thus fixed as the values at the -th iteration. Given these fixed primal and dual variables associated with neighbors, at iteration , each head worker runs mini-batch SGD to minimize . Applying the first-order condition to yields the -th head worker’s mode update as follows:

(4)

After iterations, the head worker transmits the updated primal variable to its two tail neighbors, workers and .

2) Tail primal updates. Following the same principle in the head primal updates, at the -th iteration, the -th tail worker updates its model as:

(5)

After iterations, the tail worker transmits the updated primal variable to its two head neighbors, workers and .

3) Dual updates. After the updated tail primal variables are exchanged, every worker locally updates the dual variables and as follows:

(6)

The convergence of the aforementioned primal and dual variable updates are theoretically proved for convex, proper, and smooth loss functions only when the entire layers are exchanged at every iteration [6]. The convergence proof of L-FGADMM for different exchanging periods under deep NN architectures is deferred to future work. Meanwhile, the effectiveness of L-FGADMM is empirically corroborated in the next section.

Iv Numerical Evaluations

(a) MLP
Layer Output Shape #Weights
fc1 256 200,960
fc2 128 32,896
fc3 64 8,256
fc4 32 2,080
fc5 16 528
fc6 10 170
Total 244,890
(b) CNN
Layer Output Shape #Weights
conv1 28x28x8 208
conv2 14x14x6 3,216
conv3 7x7x32 8,224
fc4 400 627,600
fc5 10 4,010
Total 643,258
TABLE I: NN architectures: (a) MLP comprising 6 fully-connected layers (fc1-6) and (b) 5-layer CNN consisting of 3 convolutional layers (conv1-3) and 2 fully-connected layers (fc4-5).

This section validates the performance of L-FGADMM for a classification task, with 4 workers uniformly randomly distributed over a 50x50 m plane. These workers are assigned to head and tail groups, such that the length of the path starting from one worker passing through all workers is minimized. The simulation settings are elaborated as follows.

Datasets. We consider the MNIST dataset comprising 28x28 pixel images that represent hand-written 0-9 digits. Each worker has training samples, independent and identically distributed across workers, and utilizes randomly selected samples per mini-batch SGD iteration, i.e., .

Communication periods. Denoting as the largest layer, its communication period under L-FGADMM is set as for all . Hereafter, our proposed scheme is referred to as L-FGADMM 1x, 2x, or 4x, when , , or , respectively. The communication periods of the other layers are identically set to 5 iterations.

NN architectures. To examine the impact of NN architectures, two different NN models are considered, a

multi-layer perceptron (MLP)

NN and convolutional neural network (CNN). As Table I describes, the MLP consists of 6 layers, among which the 1st layer is the largest, i.e., , and has 82% weight parameters of the entire model. The CNN comprises 5 layers, and the 4th layer is the largest among them, i.e., , having 97.6% weight parameters of the entire model.

Baselines. We compare L-FGADMM with two benchmark schemes, (i) FL running mini-batch SGD with the learning rate , while exchanging local gradients every 5 iterations [15]; and (ii) standalone mini-batch SGD with a single worker. In FL, the worker having the minimum sum distances to all other workers is set as a parameter server. In the standalone case, there is no communication, but for the sake of convenience 5 SGD iterations are counted as a single communication round.

Performance measures. The performance of each scheme is measured in terms of training loss, test accuracy, and total communication cost. Training loss is measured using the cross entropy function. Test accuracy is calculated as the fraction of correct classification outcomes. Total communication cost is the sum of the entire links’ communication costs. For a given link with distance between two workers, its communication cost is given as , which corresponds to the transmission power for free-space channel inversion.

((a)) MLP.
((b)) CNN.
Fig. 2: Training loss of L-FGADMM under (a) MLP and (b) CNN, when the largest layer’s exchanging period is 2x or 4x longer.
((a)) MLP.
((b)) CNN.
Fig. 3: Test accuracy of L-FGADMM under (a) MLP and (b) CNN, when the largest layer’s exchanging period is 2x or 4x longer.

With respect to these three figure of merits, the effectiveness of L-FGADMM is described as follows.

  • [leftmargin=10pt]

  • Fast convergence of L-FGADMM: As shown by Fig. 2, under both MLP and CNN architectures, L-FGADMM 1x, 2x, and 4x converge within 100 communication rounds. For all cases, the final loss values of L-FGADMM are close to each other, which are up to 13.8% smaller than FL whose training speed is also slower. It is noted that the standalone baseline yields the fastest convergence speed. This is because of its overfitting towards local training samples, which results in poor accuracy as explained next.

  • High accuracy of L-FGADMM: Fig. 3b shows that under CNN, L-FGADMM 1x achieves the highest final test accuracy (92.25%), followed by FL (92%), L-FGADMM 2x (91.42%), L-FGADMM 4x (89.76%), and the standalone case (87.3%). It is noticeable that L-FGADMM 2x accuracy is comparable to FL, which converges faster while exchanging less layers than FL. Fig. 3 demonstrates that under MLP, surprisingly, L-FGADMM 2x achieves the highest accuracy (91.37%), followed by L-FGADMM 1x (90.87%), FL (90.72%), L-FGADMM 4x (86.94%), and the standalone case (83.61%). The excellence of L-FGADMM 2x can be explained by its regularization effect. Its skipping the largest layer communications introduces additional errors compared to L-FGADMM 1x, which exhibits better generalization.

  • Low communication cost of L-FGADMM: Fig. 4

    illustrates the complementary cumulative distribution function (CCDF) of the total communication cost, after running 1,000 experiments. The result shows that L-FGADMM achieves lower mean and variance of the total communication cost, compared to FL and the standalone baseline. Specifically, the mean total communication cost of L-FGADMM is up to 3.05x and 4.6x lower than FL, under MLP and CNN, respectively. Furthermore, the variance of L-FGADM is up to 42.4x and 21.4x lower than FL, under MLP and CNN, respectively. There are two rationales behind these results. In L-FGADMM, the worker connectivity is based on nearest neighbors in a decentralized setting, leading to shorter link distances than FL whose connectivity is centralized. Furthermore, the payload sizes are smaller thanks to partially skipping the largest layer exchanges, inducing higher communication efficiency of L-FGADMM.

((a)) MLP.
((b)) CNN.
Fig. 4: Total communication cost of L-FGADMM under (a) MLP and (b) CNN, when the largest layer’s exchanging period is 2x or 4x longer.

V Conclusions

By leveraging and extending GADMM and FL, in this article we proposed L-FGADMM, a communication-efficient decentralized ML algorithm that exchanges the largest layers of deep NN models less frequently than other layers. Numerical evaluations validated that L-FGADMM achieves fast convergence and high accuracy while significantly reducing the mean and variance of the communication cost. Generalizing this preliminary study, optimizing layer exchanging periods under different NN architectures and network topologies could be an interesting topic for future research.

References

  • [1] J. H. Ahn, O. Simeone, and J. Kang (2019-05) Wireless federated distillation for distributed edge learning with heterogeneous data. [Online]. Arxiv preprint: http://arxiv.org/abs/1907.02745. Cited by: §I.
  • [2] J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar (2018-07) SignSGD: compressed optimisation for non-convex problems. In Proc. Intl. Conf. Machine Learn., Stockholm, Sweden. Cited by: §I.
  • [3] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui (20019) A joint learning and communications framework for federated learning over wireless networks. arXiv preprint arXiv: 1909.07972. Cited by: §I.
  • [4] T. Chen, G. Giannakis, T. Sun, and W. Yin (2018) LAG: lazily aggregated gradient for communication-efficient distributed learning. In Advances in Neural Information Processing Systems, pp. 5055–5065. Cited by: §I.
  • [5] A. Elgabli, M. Bennis, and V. Aggarwal (2019) Communication-efficient decentralized machine learning framework for 5g and beyond. In 2019 IEEE Global Communications Conference (GLOBECOM), Cited by: §I.
  • [6] A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal (2019) GADMM: fast and communication efficient framework for distributed machine learning. arXiv preprint arXiv:1909.00047. Cited by: §I, §I, §III, §III.
  • [7] A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal (2019) Q-GADMM: quantized group admm for communication efficient decentralized machine learning. arXiv preprint arXiv:1910.10453. Cited by: §I, §I.
  • [8] E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S. Kim (2018-12) Communication-efficient on-device machine learning: federated distillation and augmentation under non-iid private data. presented at Neural Information Processing Systems Workshop on Machine Learning on the Phone and other Consumer Devices (MLPCD), Montréal, Canada. External Links: Document, 1811.11479, Link Cited by: §I.
  • [9] H. Kim, J. Park, M. Bennis, and S.-L. Kim Blockchained on-device federated learning. to appear in IEEE Communications Letters [Online]. Early access: https://ieeexplore.ieee.org/document/8733825. Cited by: §I, §III.
  • [10] A. Koloskova, S. U. Stich, and M. Jaggi (2019-07) Decentralized stochastic optimization and gossip algorithms with compressed communication. Proc. International Conference on Machine Learning (ICML), Long Beach, CA, USA. Cited by: §I.
  • [11] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon (2016-12) Federated learning: strategies for improving communication efficiency. In Proc. of NIPS Wksp. PMPML, Barcelona, Spain. External Links: Link Cited by: §I.
  • [12] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2019-08) Federated learning: challenges, methods, and future directions. [Online]. ArXiv preprint: https://arxiv.org/abs/1908.07873. Cited by: §I.
  • [13] W. Liu, L. Chen, Y. Chen, and W. Zhang (2019) Accelerating federated learning via momentum gradient descent. arXiv preprint arXiv: 1910.03197. Cited by: §I.
  • [14] H. Lu, M. Li, T. He, S. Wang, V. Narayanan, and K. S. Chan Robust coreset construction for distributed machine learning. [Online]. ArXiv preprint: https://arxiv.org/abs/1904.05961. Cited by: §I.
  • [15] H. B. McMahan, Moore, S. Hampson, and B. A. yArcas (2017-04) Communication-efficient learning of deep networks from decentralized data.

    In Proceedings of Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA

    .
    Cited by: §I, §III, §IV.
  • [16] T. Nishio and R. Yonetani (2019-05) Client selection for federated learning with heterogeneous resources in mobile edge. In Proc. Int’l Conf. Commun. (ICC), Shanghai, China. Cited by: §I.
  • [17] J. Park, S. Samarakoon, M. Bennis, and M. Debbah (2019-11) Wireless network intelligence at the edge. Proceedings of the IEEE 107 (11), pp. 2204–2239. Cited by: §I.
  • [18] J. Park, S. Wang, A. Elgabli, S. Oh, E. Jeong, H. Cha, H. Kim, S. Kim, and M. Bennis (2019-08) Distilling on-device intelligence at the network edge. Arxiv preprint abs/1908.05895. Cited by: §I.
  • [19] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah (2018) Federated learning for ultra-reliable low-latency v2v communications. In 2018 IEEE Global Communications Conference (GLOBECOM), pp. 1–7. Cited by: §I, §III.
  • [20] J. Sun, T. Chen, G. B. Giannakis, and Z. Yang (2019) Communication-efficient distributed learning via lazily aggregated quantized gradients. arXiv preprint arXiv:1909.07588. Cited by: §I.
  • [21] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019-06) Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1205–1221. Cited by: §I.
  • [22] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor (2019) Scheduling policies for federated learning in wireless networks. arXiv preprint arXiv: 1908.06287. Cited by: §I.
  • [23] H. Yu, R. Jin, and S. Yang (2019-06) On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proc. Intl. Conf. Machine Learn., Long Beach, CA, USA. Cited by: §I.