Efficient and Convergent Federated Learning

05/03/2022
by   Shenglong Zhou, et al.
0

Federated learning has shown its advances over the last few years but is facing many challenges, such as how algorithms save communication resources, how they reduce computational costs, and whether they converge. To address these issues, this paper proposes a new federated learning algorithm (FedGiA) that combines the gradient descent and the inexact alternating direction method of multipliers. It is shown that FedGiA is computation and communication-efficient and convergent linearly under mild conditions.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/28/2021

Communication-Efficient ADMM-based Federated Learning

Federated learning has shown its advances over the last few years but is...
04/22/2022

Federated Learning via Inexact ADMM

One of the crucial issues in federated learning is how to develop effici...
06/01/2021

QLSD: Quantised Langevin stochastic dynamics for Bayesian federated learning

Federated learning aims at conducting inference when data are decentrali...
09/11/2020

Federated Generalized Bayesian Learning via Distributed Stein Variational Gradient Descent

This paper introduces Distributed Stein Variational Gradient Descent (DS...
05/06/2022

Network Gradient Descent Algorithm for Decentralized Federated Learning

We study a fully decentralized federated learning algorithm, which is a ...
08/12/2021

An Operator Splitting View of Federated Learning

Over the past few years, the federated learning () community has witness...
06/02/2022

Federated Learning with a Sampling Algorithm under Isoperimetry

Federated learning uses a set of techniques to efficiently distribute th...

I Introduction

CFederated learning (FL), as an effective machine learning technique, gains popularity in recent years due to its ability to deal with various issues like data privacy, data security, and data access to heterogeneous data. Typical applications include vehicular communications

[samarakoon2019distributed, pokhrel2020federated, elbir2020federated, posner2021federated], digital health [rieke2020future], and smart manufacturing [fazel2013hankel], just to name a few. The earliest work for FL can be traced back to [konevcny2015federated] in 2015 and [konevcny2016federated] in 2016. It is still undergoing development and also facing many challenges [kairouz2019advances, li2020federated, qin2021federated].

I-a Related work

Gradient descent-based learning.

In recent years, there is an impressive body of work on developing FL algorithms. One of the most popular approaches benefits from the stochastic gradient descent (SGD). The general framework is to run certain steps of SGD in parallel by clients/devices and then average the resulting parameters from clients by a central server once in a while. Representatives of SGD family consists of the famous Federated averaging (

FedAvg [mcmahan2017communication]) and Local SGD (LocalSGD [stich2018local, Lin2020Don]). Other state-of-the-art ones can be seen in [DeepLearning2015, AsynchronousStochastic2017, yu2019parallel, wang2021cooperative]. These algorithms execute global averaging/aggregation periodically and thus can reduce the communication rounds (CR), thereby saving resources (e.g., transmission power and bandwidth in wireless communication) for real-world applications.

However, to establish the convergence theory, most SGD algorithms assume that the local data is identically and independently distributed (i.i.d.), which is unrealistic for FL applications where data is usually heterogeneous. More details can be referred to the LocalSGD [stich2018local], K-step averaging SGD [zhou2017convergence], and Cooperative SGD [wang2021cooperative].

A parallel line of research aims to investigate gradient descent (GD) based-FL algorithms. Since full data is used to construct the gradient, these algorithms do not impose assumptions on distributions of the involved data [smith2018cocoa, LAG2018, wang2019adaptive, liu2021decentralized, tong2020federated]. Nevertheless, strong conditions for the objective functions of the learning optimization problems are still required to guarantee the convergence. Typical assumptions are gradient Lipschitz continuity (also known as L-smoothness), strong smoothness, convexity, or strong convexity.

ADMM-based learning. The alternating direction method of multipliers (ADMM) has shown its advances both in theoretical and numerical aspects over the last few decades, with extensive applications into various disciplines. In particular, there is a success of implementation ADMM in distributed learning [boyd2011distributed, song2016fast, zhang2018improving, zheng2018stackelberg, elgabli2020fgadmm]. Fairly recently, ADMM-based FL draws much attention due to its simple structure and easy implementation. They can be categorized into two classes: exact and inexact ADMM. The former aims at updating local parameters through solving sub-problems exactly, which hence brings more computational burdens for local clients [zhang2016dynamic, Li2017RobustFL, guo2018practical, huang2019dp, zhang2018recycled].

Therefore, inexact ADMM is an alternative to reduce the computational complexity for clients [ding2019stochastic, Inexact-ADMM2021, ryu2022differentially, zhang2020fedpd], where clients update their parameters via solving sub-problems approximately, thereby alleviating the computational burdens and accelerating the learning speed. Again, we shall emphasize that those algorithms that have been established for convergence properties still impose some restrictive assumptions. Fairly recently, an algorithm from the primal-dual optimization perspective was developed in [zhang2020fedpd] and turned out to be a member of inexact ADMM-based FL. It is shown that the algorithm converges under weaker assumptions. Finally, it is worth mentioning that ADMM is very useful in FL for the purpose of data privacy [zhang2016dynamic, guo2018practical, zhang2018recycled, ding2019stochastic, huang2019dp, ryu2022differentially].

I-B Our contributions

The main contribution of this paper is to develop a new FL algorithm that is capable of saving communication resources, reducing computational burdens, and converging under relatively weak assumptions.

I) The proposed algorithm, FedGiA in Algorithm 1, has a novel framework. After each round of communication (i.e., iteration is a multiple of a given integer ), all clients are split into two groups randomly. One group adopts the scheme of the inexact ADMM to update their parameters times. While the second group exploits the GD approach to update their parameters just once. Therefore, FedGiA possesses three advantages as follows.

  • It is communication-efficient since CR can be controlled by setting . Our numerical experiments have shown that CR decline when increases, see Figure 2.

  • It is computation-efficient due to the nature of inexact updates for all local clients. The communication efficiency has been demonstrated by our numerical comparisons with two state-of-the-art algorithms, see Table III.

  • It is possible to cope with scenarios where a portion of clients are in bad conditions. The sever could select them for the second group where less effort is required to update their parameters.

II) The assumptions to guarantee the convergence are mild. We prove that FedGiA converges to a stationary point (see Definition II.1) of the learning optimization problem in (4) with a linear rate only under two conditions: gradient Lipschitz continuity (also known for the L-smoothness in many publications) and the boundedness of a level set, as shown in Theorem IV.2. These conditions do not impose convexity or strong convexity. Hence, they are weaker than those used to establish convergence for most current distributed learning and FL algorithms. If we further assume the convexity, then FedGiA achieves the optimal solution, as shown in Corollary IV.1.

I-C Organization and notation

This paper is organized as follows. In the next section, we introduce FL and the framework of ADMM. In Section III, we present FedGiA and highlight its advantages, followed by the establishment of its global convergence and convergence rate in Section IV. We then conduct some numerical experiments and comparisons with two popular algorithms to demonstrate the performance of FedGiA in Section V. Concluding remarks are given in the last section.

We end this section with summarizing the notation that will be employed throughout this paper. We use plain, bold, and capital letters to present scalars, vectors, and matrices, respectively, e.g.,

and are scalars, and are vectors, and are matrices. Let represent the largest integer strictly smaller than and with ‘’ meaning define. In this paper, denotes the -dimensional Euclidean space equipped with the inner product defined by . Let be the Euclidean norm for vectors (i.e., ) and Spectral norm for matrices, and be the weighted norm defined by

. Write the identity matrix as

and a positive semidefinite matrix as . In particular, represents . A function, , is said to be gradient Lipschitz continuous with a constant if

(1)

for any and , where represents the gradient of with respect to .

Ii GD and inexact ADMM-based FL

Suppose we have local clients/edge nodes with datasets . Each client has the total loss , where

is a continuous loss function and bounded from below,

is the cardinality of , and is the parameter to be learned. Below are two examples used for our numerical experiments.

Example II.1 (Least square loss).

Suppose the th client has data , where , . Then the least square loss is

(2)
Example II.2 ( norm regularized logistic loss).

Similarly, the th client has data but with . The norm regularized logistic loss is given by

(3)

where is a penalty parameter.

The overall loss function can be defined by

Federated learning aims to learn a best parameter that reaches the minimal overall loss, namely,

(4)

Since is bounded from below, we have

(5)

By introducing auxiliary variables, , problem (4) can be equivalently rewritten as

(6)

Throughout the paper, we shall place our interest on the above optimization problem. For simplicity, we also denote and

(7)

It is easy to see that if .

Ii-a Admm

The backgrounds of ADMM can be referred to the earliest work [gabay1976dual] and a nice book [boyd2011distributed]. To apply ADMM for problem (6), by letting ad , we introduce the augmented Lagrange function defined by,

(8)

Here, are the Lagrange multipliers and . The framework of ADMM for problem (6) is given as follows: for an initialized point , perform the following updates iteratively for every ,

(9)

Ii-B Stationary points

To end this section, we present the optimality conditions of problems (6) and (4).

Definition II.1.

A point is a stationary point of problem (6) if it satisfies

(10)

A point is a stationary point of problem (4) if it satisfies

(11)

Note that any locally optimal solution to problem (6) (resp. (4)) must satisfy (10) (resp. (11)). If is convex for every , then a point is a globally optimal solution to problem (6) (resp.(4)) if and only if it satisfies condition (10) (resp. (11)). Moreover, it is easy to see that a stationary point of problem (6) indicates

That is, is also a stationary point of the problem (4).

Iii Algorithmic Design

The framework of ADMM in (9) encounters three drawbacks in reality. (i) It repeats the three updates in every step, leading to communication inefficiency. In FL, the framework manifests that local clients and the central server have to communicate at every step. However, frequent communications would come at a huge price, such as a long learning time and large amounts of resources. (ii) Solving the second sub-problem in (9) would incur expensive computational cost as it generally does not admit a closed-form solution. (iii) In real applications, some clients may suffer from bad conditions, which leads to computational difficulties. It is necessary to leave them more time to update their parameters. Therefore, to overcome the above mentioned drawbacks, we cast a new algorithm in Algorithm 1, where

The merits of Algorithm 1 are highlighted as follows.

Given an integer and a constant , every client initializes , and , . Let be a function of as for  do
       if  then
             Weights upload: (Communication occurs) All clients upload to the server. Global aggregation: The server calculates average parameter by
(12)
Weights broadcast: (Communication occurs) The server broadcasts to all clients. Clients selection: The server randomly selects a new set of clients for training in the next round.
       end if
      for every  do
             Local update: Client updates its parameters by
(13)
(14)
(15)
       end for
      for every  do
             Local invariance: Client keeps parameters by
(16)
(17)
(18)
       end for
      
end for
Algorithm 1 FL via GD and inexact ADMM (FedGiA)

(i) Communication efficiency: Algorithm 1 shows that communications only occur when where is a predefined positive integer. Therefore, CR can be reduced if setting a big , thereby saving the cost vastly. In fact, such an idea has been extensively used in literature [DeepLearning2015, AsynchronousStochastic2017, stich2018local, yu2019parallel, Lin2020Don, wang2021cooperative].

(ii) Fast computation using inexact updates: We update by (13) instead of solving the second sub-problem in (9). It can accelerate the computation for local clients significantly, as the computation is relatively cheap if is chosen properly (e.g, diagonal matrices). We point out that (13) is a result of

(19)

where is an approximation of , namely,

(20)

(iii) Mixed updates: At every in Algorithm 1, all clients are divided into two groups. For clients in , they update their parameters times based on the inexact ADMM, while for clients not in , they update their parameters just once based on the GD. This suggests that, in real applications, the sever should try to select clients with good devices to form and put the rest as the second group. This would leave clients in the second group having more time to update their parameters since they only need to update them once.

We would like to point out that FedAvg [mcmahan2017communication, li2019convergence] and FedProx [li2020federatedprox] select partial devices to join in the training in each communication round. That is, if , they randomly select a subset of clients and only clients in update their parameters and the rest clients remain unchanged. However, differing from that, FediAG asks clients outside for updating their parameters only once during steps.

Iv Convergence Analysis

To establish the convergence, we need one assumption.

Assumption IV.1.

Every is gradient Lipschitz continuous with a constant .

Assumption IV.1 implies that there is always a satisfying such that

(21)

for any . Apparently, many s satisfy the above condition (e.g., ). In the subsequent convergence analysis, we suppose that every client chooses .

Iv-a Global convergence

For notational convenience, hereafter we let stand for and denote

(22)

With the help of Assumption IV.1, our first result shows that whole sequences of , , and converge.

Theorem IV.1.

Let be the sequence generated by Algorithm 1 with and . The following results hold under Assumption IV.1.

  • Three sequences , , and converge to the same value, namely,

    (23)
  • and eventually vanish, namely,

    (24)

Theorem IV.1 states that the objective function values converge, and its establishment does not relied on the random selection of . In the below theorem, we would like to see the convergence performance of sequence itself. To proceed with that, we need the assumption on the boundedness of the following level set

(25)

for a given . We point out that the boundedness of the level set is frequently used in establishing the convergence properties of optimization algorithms. There are many functions satisfying this condition, such as the coercive functions111A continuous function is coercive if when ..

Theorem IV.2.

Let be the sequence generated by Algorithm 1 with and . The following results hold under Assumption IV.1 and the boundedness of .

  • Then sequence is bounded, and any its accumulating point, , is a stationary point of (6), where is a stationary point of (4).

  • If further assume that is isolated, then the whole sequence, , converges to .

It is noted that if is locally strongly convex at , then is unique and hence is isolated. However, being isolated is a weaker assumption than locally strong convexity. It is worth mentioning that the establishment of Theorem IV.2 does not require the convexity of or , because of this, the sequence is guaranteed to converge to the stationary point of problems (6) and (4). In this regard, if we further assume the convexity of , then the sequence is capable of converging to the optimal solution to problems (6) and (4), which is stated by the following corollary.

Corollary IV.1.

Let be the sequence generated by Algorithm 1 with and . The following results hold under Assumption IV.1, the boundedness of , and the convexity of .

  • Three sequences , , and converge to the optimal function value of (4), namely

    (26)
  • Any accumulating point of sequence is an optimal solution to (6), where is an optimal solution to (4).

  • If further assume is strongly convex. Then whole sequence converges the unique optimal solution, , to (6), where is the unique optimal solution to (4).

Remark IV.1.

Regarding the assumption in Corollary IV.1, we note that being strongly convex does not require that every is strongly convex. If one of s is strongly convex and the remaining is convex, then is strongly convex. Moreover, the strongly convexity suffices to the boundedness of level set for any . Therefore, under the strongly convexity, the assumption on the boundedness of can be exempted.

Iv-B Complexity analysis

Finally, we investigate the convergence speed of the proposed Algorithm 1. The following result states that the minimal value among vanishes with a linear rate .

Theorem IV.3.

Let be the sequence generated by Algorithm 1 with and . If Assumption IV.1 holds, then it follows

where with given by (39).

We would like to point out that the establishment of such a convergence rate only requires the assumption of gradient Lipschitz continuity, namely, Assumption IV.1. Moreover, if we take with , then

This is what we expected. The larger is, the more iterations is required to converge stated as below.

Remark IV.2.

Theorem IV.3 hints that Algorithm 1 should be terminated if

(27)

where is a given tolerance. Therefore, after

(28)

iterations, Algorithm 1 meets (27) and the total CR are

(29)

V Numerical Experiments

This section conducts some numerical experiments to demonstrate the performance of FedGiA in Algorithm 1. All numerical experiments are implemented through MATLAB (R2019a) on a laptop with 32GB memory and 2.3Ghz CPU.

V-a Testing example

We use Example II.1 with synthetic data and Example II.2 with real data to conduct the numerical experiments.

Example V.1 (Linear regression with non-i.i.d. data).

For this problem, local clients have their objective functions as (2). We randomly generate samples

from three distributions: the standard normal distribution, the Student’s

distribution with degree

, and the uniform distribution in

. Then we shuffle all samples and divided them into parts for clients, where and . Therefore, . The data size of each part, , is randomly chosen from . For simplicity, we fix , but choose . In the reagrd, each client has non-i.i.d. data .

Example V.2 (Logistic regression).

For this problem, local clients have their objective functions as (3), where in our numerical experiments. We use two real datasets described in Table I to generate and . We randomly split samples into groups corresponding to clients.

Data Datasets Source
qot Qsar oral toxicity uci 1024 8992
sct Santander customer transaction kaggle 200 200000
TABLE I: Descriptions of two real datasets.

V-B Implementations

As mentioned in Remark IV.2, we terminate FedGiA if or solution satisfies

(30)

and initialize . For every , we randomly select clients to form , namely, and . Here, means all clients are chosen. Parameters and are set as follows. Theorem IV.1 suggests that should be chosen to satisfy , where is given in Table II. Finally, is chosen as Table II, where FedGiA and FedGiA represent FedGiA under opted as a Gram and Diagonal matrix, respectively.

FedGiA FedGiA
Example V.1
Example V.2
TABLE II: Choices of and .

V-C Numerical performance

In this part, we conduct some simulation to demonstrate the performance of FedGiA including global convergence, convergence rate, and effect of and . To measure the performance, we report the following factors: , error , CR, and computational time (in second). We only report results of FedGiA solving Example V.1 and omit ones for Example V.2 as they show the similar observations.

V-C1 Global convergence with rate

We fix , and and present the results in Figure 1. From the left sub-figure, as expected, all lines eventually tend to the same objective function value, well testifying Theorem IV.1. It is clear that the bigger values of (i.e., the wider gap between two global aggregations) are, the more iterations are required to reach the optimal function value. From the right sub-figure, the trends show that all errors vanish gradually along with the iterations rising, and the big values of , the more iterations required to converge, which perfectly justifies Theorems IV.3 that the convergence rate relies on .

Fig. 1: Objective function values and errors v.s. iterations. FedGiA (solid lines) and FedGiA (dashed lines) solve Example V.1 with and .

V-C2 Effect of

Next, we would like to see how the choices of impact the performance of FedGiA. To proceed with that, for each dimension of the dataset, we generate 20 instances of Example V.1 solved by FedGiA with fixing and and report the average results in Figure 2. With the increasing of , CR decreases first and then stabilize at a certain level. To this end, it is efficient to save communication costs if we set a proper . However, it is unnecessary to set a big value of as it results in longer computational time. In comparison with FedGiA, FedGiA needs more CR for small and fewer CR for large but always rans faster.

Fig. 2: Effect of for FedGiA (solid lines) and FedGiA (dashed lines) solving Example V.1 with .

V-C3 Effect of

Finally, we would like to see how choices of impact the performance of FedGiA. We alter and report the average results in Figure 3. We observe that would not have a big influence on CR when . As expected, the larger the longer the computational time for most cases. In general speaking, FedGiA needs fewer CR when is small and more CR when is large but always runs faster than FedGiA.

Fig. 3: Effect of . FedGiA (solid lines) and FedGiA (dashed lines) solve Example V.1 with .

V-D Numerical comparison

In this part, we will compare our proposed method with FedAvg [mcmahan2017communication] and LocalSGD [stich2018local, Lin2020Don]. For the former, we use its non-stochastic version. Precisely, we select all clients for the training in each round of communication and sue full local dataset to calculate the gradient. The learning rate is set as with for Example V.1 and for Example V.2, where . For LocalSGD, as suggested by [Lin2020Don] using small mini-batch size to approximate the gradient for every local client, we choose mini-batch size for the th client. Its learning rate is set as with for Example V.1 and for Example V.2. For FedGiA, we fix the size of as (i.e., ). To ensure relatively fair comparisons, we terminate all methods if condition (30) is satisfied or CR are over .

Fig. 4: v.s. CR.

V-D1 Solving Example v.1

For simplicity, we fix and . From the left sub-figure in Figure 4, we can see that (i) the objective function values for all methods eventually tend to be the same; (ii) Basically, the larger the faster the decline of the objective function values; (iii) FedGiA and FedGiA behave better than FedAvg which outperforms LocalSGD. We then run 20 independent trials and report the average results in Table III. We can conclude that FedGIA and FedGIA use the fewest CR and run the fastest.

Ex. V.1 Ex. V.2 with qot Ex. V.2 with sct
Alg. Obj. CR Time Obj. CR Time Obj. CR Time
FedAvg 1 1.684 1000 1.17 0.260 1000 9.30 0.327 96.6 4.65
5 1.684 1000 3.27 0.237 572 14.5 0.327 20.0 2.75
10 1.684 1000 5.64 0.237 289 13.0 0.327 10.0 2.49
LocalSGD 1 1.686 1000 2.95 0.325 1000 13.2 0.333 1000 58.9
5 1.685 1000 11.0 0.302 1000 44.1 0.331 1000 196
10 1.684 1000 19.5 0.299 1000 81.5 0.330 1000 369
FedGiA 1 1.684 13.6 0.16 0.236 19.8 0.60 0.326 5.00 0.59
5 1.684 7.40 0.20 0.236 19.9 1.35 0.325 5.00 0.97
10 1.684 6.10 0.25 0.236 19.9 2.10 0.325 5.00 1.45
FedGiA 1 1.684 18.4 0.13 0.236 19.9 0.44 0.326 5.00 0.49
5 1.684 10.1 0.11 0.236 19.9 0.63 0.325 4.90 0.76
10 1.684 7.00 0.11 0.236 19.9 0.87 0.325 4.90 1.05
TABLE III: Comparison for four algorithms.

V-D2 Solving Example v.2

Again, we fix for simplicity. From the two right sub-figures in Figure 4, we can observe the objective function values obtained by FedGiA decline the fastest. In addition, FedGiA uses the fewest CR to reach the optimal solutions, followed by FedAvg, and LocalSGD needs the most. We then run 20 independent trials and report the average results in Table III, where FedGIA outperforms the other two algorithms by consuming the fewest CR and running the fastest.

Vi Conclusion

This paper developed a new FL algorithm and managed to address three key issues in FL, including saving communication resources, reducing computational complexity, and establishing convergence property under mild assumptions. These advantages hint that the proposed algorithm might be practical to deal with many real applications such as mobile edge computing [mao2017survey, mach2017mobile], over-the-air computation [zhu2018mimo, yang2020federated], vehicular communications [samarakoon2019distributed], unmanned aerial vehicle online path control [shiri2020communication] and so forth. Moreover, we feel that the algorithmic schemes and techniques used to build the convergence theory could be also valid for tackling decentralized FL [elgabli2020fgadmm, ye2021decentralized]. We leave these for future research.

References

Appendix A Some Useful Properties

For notational simplicity, hereafter, we denote

For any vectors , matrix , and , we have

(31)

By the Mean Value Theorem, the gradient Lipschitz continuity indicates that for any and ,

(32)

Appendix B Proofs of all theorems

B-a Key lemmas

Lemma B.1.

Let be the sequence generated by Algorithm 1 with . The following results hold under Assumption IV.1.

  • ,

    (33)
  • ,

    (34)
  • ,

    (35)
Proof.

a) For any , we have from (15) that

(36)

For any , it follows from (16)-(18) that the above relation is still valid. Hence, we have (36) for any and for any . As a result, for any ,

b) For , solution in (13) satisfies (19), thereby contributing to,

(37)

For any , the second equation in (37) is still valid due to and from (16)-(18). Hence, it is true for any and any .

c) It follows from (34) and that