Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning

01/27/2021 ∙ by Haibo Yang, et al. ∙ The Ohio State University 0

Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate 𝒪(1/√(mKT) + 1/T) for full worker participation and a convergence rate 𝒪(1/√(nKT) + 1/T) for partial worker participation, where K is the number of local steps, T is the number of total communication rounds, m is the total worker number and n is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to T/m. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated Learning (FL) is a distributed machine learning paradigm that leverages a large number of workers to collaboratively learn a model with decentralized data under the coordination of a centralized server. Formally, the goal of FL is to solve an optimization problem, which can be decomposed as:

where

is the local (non-convex) loss function associated with a local data distribution

and is the number of workers. FL allows a large number of workers (such as edge devices) to participate flexibly without sharing data, which helps protect data privacy. However, it also introduces two unique challenges unseen in traditional distributed learning algorithms that are used typically for large data centers:

  • Non-independent-identically-distributed (non-i.i.d.) datasets across workers (data heterogeneity): In conventional distributed learning in data centers, the distribution for each worker’s local dataset can usually be assumed to be i.i.d., i.e., . Unfortunately, this assumption rarely holds for FL since data are generated locally at the workers based on their circumstances, i.e., , for . It will be seen later that the non-i.i.d assumption imposes significant challenges in algorithm design for FL and their performance analysis.

  • Time-varying partial worker participation (systems non-stationarity): With the flexibility for workers’ participation in many scenarios (particularly in mobile edge computing), workers may randomly join or leave the FL system at will, thus rendering the active worker set stochastic and time-varying across communication rounds. Hence, it is often infeasible to wait for all workers’ responses as in traditional distributed learning, since inactive workers or stragglers will significantly slow down the whole training process. As a result, only a subset of the workers may be chosen by the server in each communication round, i.e., partial worker participation.

In recent years, the Federated Averaging method (FedAvg) and its variants  (McMahan et al., 2016; Li et al., 2018; Hsu et al., 2019; Karimireddy et al., 2019; Wang et al., 2019a) have emerged as a prevailing approach for FL. Similar to the traditional distributed learning, FedAvg leverages local computation at each worker and employs a centralized parameter server to aggregate and update the model parameters. The unique feature of FedAvg is that each worker runs

multiple local stochastic gradient descent (SGD) steps

rather than just one step as in traditional distributed learning between two consecutive communication rounds. For i.i.d. datasets and the full worker participation setting, Stich (2018) and Yu et al. (2019b) proposed two variants of FedAvg that achieve a convergence rate of with a bounded gradient assumption for both strongly convex and non-convex problems, where is the number of workers, is the local update steps and is the total communication rounds. Wang and Joshi (2018) and Stich and Karimireddy (2019) further proposed improved FedAvg algorithms to achieve a rate without bounded gradient assumption. Notably, for a sufficiently large , the above rates become 111This rate also matches the convergence rate order of parallel SGD in conventional distributed learning., which implies a linear speedup with respect to the number of workers.222To attain accuracy for an algorithm, it needs to take steps with a convergence rate , while needing steps if the convergence rate is (the hidden constant in Big-O is the same). In this sense, one achieves a linear speedup with respect to the number of workers. This linear speedup is highly desirable for an FL algorithm because the algorithm is able to effectively leverage the massive parallelism in a large FL system. However, with non-i.i.d. datasets and partial worker participation in FL, a fundamental open question arises: Can we still achieve the same linear speedup for convergence, i.e., , with non-i.i.d. datasets and under either full or partial worker participation?

In this paper, we show the answer to the above question is affirmative. Specifically, we show that a generalized FedAvg with two-sided learning rates achieves linear convergence speedup with non-i.i.d. datasets and under full/partial worker participation. We highlight our contributions as follows:

  • For non-convex problems, we show that the convergence rate of the FedAvg algorithm on non-i.i.d. dataset are and for full and partial worker participation, respectively, where is the size of the partially participating worker set. This indicates that our proposed algorithm achieves a linear speedup for convergence rate for a sufficiently large . When reduced to the i.i.d. case, our convergence rate is , which is also better than previous works. We summarize the convergence rate comparisons for both i.i.d. and non-i.i.d. cases in Table 1. It is worth noting that our proof does not require the bounded gradient assumption. We note that the SCAFFOLD algorithm (Karimireddy et al., 2019)

    also achieves the same rate but extra variance reduction operations are required, which lead to high communication costs and implementation complexity. By contrast, we do not have such extra requirements in this paper.

  • In order to achieve a linear speedup, i.e., a convergence rate , we show that the number of local updates can be as large as , which improves the result previously shown in Yu et al. (2019a) and Karimireddy et al. (2019). As shown later in the communication complexity comparison in Table 1, a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead. Interestingly, our results also indicate that the number of local updates does not hurt but rather help the convergence with a proper learning rates choice. This overcomes the limitation as suggested in Li et al. (2019b) that local SGD steps might slow down the convergence ( for strongly convex case). This result also reveals new insights on the relationship between the number of local steps and learning rate.

Notation. In this paper, we let be the total number of workers and be the set of active workers for the -th communication round with size for some . 333 For simplicity and ease of presentation in this paper, we let . We note that this is not a restrictive condition and our proofs and results still hold for , which can be easily satisfied in practice. We use to denote the number of local steps per communication round at each worker. We let

be the number of total communication rounds. In addition, we use boldface to denote matrices/vectors. We let

represent the parameter of -th local step in the -th worker after the -th communication. We use to denote the -norm. For a natural number , we use to represent the set .

The rest of the paper is organized as follows. In Section 2, we review the literature to put our work in comparative perspectives. Section 3 presents the convergence analysis for our proposed algorithm. Section 4 discusses the implication of the convergence rate analysis. Section 5 presents numerical results and Section 6 concludes this paper. Due to space limitation, the details of all proofs and some experiments are provided in the supplementary material.

2 Related work

[t] Dataset Algorithm6 Convexity7 Partial Convergence Communication Worker Rate complexity IID Stich1 SC Yu1 NC Wang NC Stich2 NC This paper NC NON-IID Khaled 1 C Yu22 NC Li SC Karimireddy 3 NC Karimireddy 4 NC This paper5 NC

  • Full gradients are used for each worker.

  • Local momentum is used at each worker.

  • A FedAvg algorithm with two-sided learning rates. . () for full (partial) worker participation.

  • The SCAFFOLD algorithm in Karimireddy et al. (2019) for non-convex case.

  • The convergence rate becomes under partial worker participation.

  • Shorthand notation for references: Stich1 := Stich (2018), Yu2 := Yu et al. (2019b), Wang:= Wang and Joshi (2018), Stich2:= Stich and Karimireddy (2019); Khaled:= Khaled et al. (2019b), Yu2:=Yu et al. (2019a), Li:= Li et al. (2019b), and Karimireddy:= Karimireddy et al. (2019).

  • Shorthand notation for convexity: SC: Strongly Convex, C: Convex, and NC: Non-Convex.

Table 1: Convergence rates of optimization methods for FL.

The federated averaging (FedAvg) algorithm was first proposed by McMahan et al. (2016)

for FL as a heuristic to improve communication efficiency and data privacy. Since then, this work has sparked many follow-ups that focus on FL with i.i.d. datasets and full worker participation (also known as LocalSGD

(Stich, 2018; Yu et al., 2019b; Wang and Joshi, 2018; Stich and Karimireddy, 2019; Lin et al., 2018; Khaled et al., 2019a; Zhou and Cong, 2017)). Under these two assumptions, most of the theoretical works can achieve a linear speedup for convergence, i.e., for a sufficiently large , matching the rate of the parallel SGD. In addition, LocalSGD is empirically shown to be communication-efficient and enjoys better generalization performance  (Lin et al., 2018). For a comprehensive introduction to FL, we refer readers to Li et al. (2019a) and Kairouz et al. (2019).

For non-i.i.d. datasets, many works (Sattler et al., 2019; Zhao et al., 2018; Li et al., 2018; Wang et al., 2019a; Karimireddy et al., 2019; Huang et al., 2018; Jeong et al., 2018) heuristically demonstrated the performance of FedAvg and its variants. On convergence rate with full worker participation, many works (Stich et al., 2018; Yu et al., 2019a; Wang and Joshi, 2018; Karimireddy et al., 2019; Reddi et al., 2020) can achieve linear speedup, but their convergence rate bounds could be improved as shown in this paper. On convergence rate with partial worker participation, Li et al. (2019b) showed that the original FedAvg can achieve for strongly convex functions, which suggests that local SGD steps slow down the convergence in the original FedAvg. Karimireddy et al. (2019) analyzed a generalized FedAvg with two-sided learning rates in strongly convex, convex and non-convex cases. However, as shown in Table 1, none of them indicates that linear speedup is achievable with non-i.i.d. datasets under partial worker participation. Note that the SCAFFOLD algorithm (Karimireddy et al., 2019) can achieve linear speedup but extra variance reduction operations are required, which lead to high communication costs and implementation complexity. In this paper, we show that this linear speedup can be achieved without any extra requirements. For more detailed comparisons and other algorithmic variants in FL and decentralized settings, we refer readers to Kairouz et al. (2019).

3 Linear Speedup of the Generalized FedAvg with Two-Sided Learning Rates for Non-IID Datasets

In this paper, we consider a FedAvg algorithm with two-sided learning rates as shown in Algorithm 1, which is generalized from previous works (Karimireddy et al., 2019; Reddi et al., 2020). Here, workers perform multiple SGD steps using a worker optimizer to minimize the local loss on its own dataset, while the server aggregates and updates the global model using another gradient-based server optimizer based on the returned parameters. Specifically, between two consecutive communication rounds, each worker performs SGD steps with the worker’s local learning rate

. We assume an unbiased estimator in each step, which is denoted by

, where is a random local data sample for -th steps after -th communication round at worker . Then, each worker sends the accumulative parameter difference to the server. On the server side, the server aggregates all the available and updates the model parameters with a global learning rate . The FedAvg algorithm with two-sided learning rates provides a natural way to decouple the learning of worker and server, thus utilizing different learning rate schedules on the worker and server sides. The original FedAvg can be viewed as a special case of this framework with learning rate on server side being one.

  Initialize
  for  do
     The server samples a subset of workers with .
     for each worker in parallel do
        
        for  do

           Compute an unbiased estimate

of .
           Local worker update: .
        end for
        Let . Send to the server.
     end for
     At Server: Receive . Let . Server Update: . Broadcasting to workers.
  end for
Algorithm 1 A Generalized FedAvg Algorithm with Two-Sided Learning Rates.

In what follows, we show that a linear speedup for convergence is achievable by the generalized FedAvg for non-convex functions on non-i.i.d. datasets. We first state our assumptions as follows.

Assumption 1.

(-Lipschitz Continuous Gradient) There exists a constant , such that .

Assumption 2.

(Unbiased Local Gradient Estimator) Let be a random local data sample in the -th step at the -th worker. The local gradient estimator is unbiased, i.e., , , where the expectation is over the local datasets sample.

Assumption 3.

(Bounded Local and Global Variance) There exit two constants and , such that the variance of each local gradient estimator is bounded by , , and the global variability of the local gradient of the cost function is bounded by
, .

The first two assumptions are standard in non-convex optimization (Ghadimi and Lan, 2013; Bottou et al., 2018). For Assumption 3, the bounded local variance is also a standard assumption. We use a universal bound to quantify the heterogeneity of the non-i.i.d. datasets among different workers. In particular, corresponds to i.i.d. datasets. This assumption is also used in other works for FL under non-i.i.d. datasets  (Reddi et al., 2020; Yu et al., 2019b; Wang et al., 2019b) as well as in decentralized optimization (Kairouz et al., 2019). It is worth noting that we do not require a bounded gradient assumption, which is often used in FL optimization analysis.

3.1 Convergence analysis for full worker participation

In this subsection, we first analyze the convergence rate of the generalized FedAvg with two-sided learning rates under full worker participation, for which we have the following result:

Theorem 1.

Let constant local and global learning rates and be chosen as such that and . Under Assumptions 13 and with full worker participation, the sequence of outputs generated by Algorithm 1 satisfies:

where , is a constant, , and the expectation is over the local dataset samples among workers.

Remark 1.

The convergence bound contains two parts: a vanishing term as increases and a constant term whose size depends on the problem instance parameters and is independent of . The vanishing term’s decay rate matches that of the typical SGD methods.

Remark 2.

The first part of (i.e., ) is due to the local stochastic gradients at each worker, which shrinks as as increases. The cumulative variance of the local steps contributes to the second term in (i.e., , which is independent of and largely affected by the data heterogeneity. To make the second part small, an inverse relationship between the local learning rate and local steps should be satisfied, i.e., . Specifically, note that the global and local variances are linearly amplified by . This requires a sufficiently small to offset the variance between two successive communication rounds to make the second term in small. This is consistent with the observation in strongly convex FL that a decaying learning rate is needed for FL to converge under non-i.i.d. datasets even if full gradients used in each worker (Li et al., 2019b). However, we note that our explicit inverse relationship between and in the above is new. Intuitively, the local steps with a sufficiently small can be viewed as one SGD step with a large learning rate.

With Theorem 1, we immediately have the following convergence rate for the generalized FedAvg algorithm with a proper choice of two-sided learning rates:

Corollary 1.

Let and . The convergence rate of the generalized FedAvg algorithm under full worker participation is .

Remark 3.

The generalized FedAvg algorithm with two-sided learning rates can achieve a linear speedup for non-i.i.d. datasets, i.e., a convergence rate as long as . Although many works have achieved this convergence rate asymptotically, we improve the maximum number of local steps to , which is significantly better than the state-of-art bounds such as shown in (Karimireddy et al., 2019; Yu et al., 2019a; Kairouz et al., 2019). Note that a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead. See also the communication complexity comparison in Table 1. For example, when and (as used in (Kairouz et al., 2019)), the local steps in our algorithm is . However, means that no extra local steps can be taken to reduce communication costs.

Remark 4.

When degenerated to the i.i.d. case (), the convergence rate becomes , which has a better first term in the bound compared with previous work as shown in Table 1.

3.2 Convergence analysis for partial worker participation

Partial worker participation in each communication round may be more practical than full worker participation due to many physical limitations of FL in practice (e.g., excessive delays because of too many devices to poll, malfunctioning devices, etc.). Partial worker participation can also accelerate the training by neglecting stragglers. We consider two sampling strategies proposed by Li et al. (2018) and Li et al. (2019b). Let be the participating worker index set at communication round with , , for some .

is randomly and independently selected either with replacement (Strategy 1) or without replacement (Strategy 2) sequentially according to the sampling probabilities

. For each member in , we pick a worker from the entire set uniformly at random with probability . That is, selection likelihood for anyone worker is . Then we have the following results:

Theorem 2.

Under Assumptions 13 with partial worker participation, the sequence of outputs generated by Algorithm 1 with constant learning rates and satisfies:

where , and the expectation is over the local dataset samples among workers.

For sampling Strategy 1, let and be chosen as such that , and . It then holds that:

For sampling Strategy 2, let and be chosen as such that , and . It then holds that:

With Theorem 2, we immediately have the following convergence rate for the generalized FedAvg algorithm with a proper choice of two-sided learning rates:

Corollary 2.

Let and . The convergence rate of the generalized FedAvg algorithm under partial worker participation and both sampling strategies are:

Remark 5.

The convergence rate bound for partial worker participation has the same structure but with a larger variance term. This implies that the partial worker participation by the uniform sampling does not result in fundamental changes in convergence (in order sense) except for an amplified variance due to fewer workers participating and random sampling. The intuition is that the uniform sampling (with/without replacement) for worker selection yields a good approximation of the entire worker distribution in expectation, which reduces the risk of distribution deviation due to the partial worker participation. As shown in Section  5, the distribution deviation due to fewer worker participation could render the training unstable, especially in highly non-i.i.d. cases.

Remark 6.

The generalized FedAvg with partial worker participation under non-i.i.d. datasets can still achieve a linear speedup with proper learning rate settings as shown in Corollary 2. In addition, when degenerated to i.i.d. case (), the convergence rate becomes .

Remark 7.

Here, we let only for ease of presentation and better readability. We note that this is not a restrictive condition. We can show that can be relaxed to ) and the same convergence rate still holds. In fact, our full proof in Appendix A.2 is for .

4 Discussion

In light of above results, in what follows, we discuss several insights from the convergence analysis:

Convergence Rate. We show that the generalized FedAvg algorithm with two-sided learning rates can achieve a linear speedup, i.e., an convergence rate with a proper choice of hyper-parameters. Thus, it works well in large FL systems, where massive parallelism can be leveraged to accelerate training. The key challenge in convergence analysis stems from the different local loss functions (also called “model drift” in the literature) among workers due to the non-i.i.d. datasets and local steps. As shown above, we obtain a convergence bound for the generalized FedAvg method containing a vanishing term and a constant term (the constant term is similar to that of SGD). In contrast, the constant term in SGD is only due to the local variance. Note that, similar to SGD, the iterations do not diminish the constant term. The local variance (randomness of stochastic gradients), global variability (non-i.i.d. datasets), and the number of local steps (amplification factor) all contribute to the constant term, but the total global variability in local steps dominates the term. When the local learning rate is set to an inverse relationship with respect to the number of local steps , the constant term is controllable. An intuitive explanation is that the local steps can be approximately viewed as one step in conventional SGD. So this speedup and the more local steps that are allowed can be largely attributed to the two-sided learning rates setting.

Number of Local Steps. Besides the result that the maximum number of local steps is improved to , we also show that the local steps could help the convergence with the proper hyper-parameter choices, which supports previous numerical results  (McMahan et al., 2016; Stich, 2018; Lin et al., 2018) and is verified in different models with different non-i.i.d. degree datasets in Section 5. However, there are other results showing the local steps slow down the convergence (Li et al., 2019b). We believe that whether local steps help or hurt the convergence in FL worths further investigations.

Number of Workers. We show that the convergence rate improves substantially as the the number of workers in each communication round increases. This is consistent with the results for i.i.d. cases in Stich (2018). For i.i.d. datasets, more workers means more data samples and thus less variance and better performance. For non-i.i.d. datasets, having more workers implies that the distribution of the sampled workers is a better approximation for the distribution of all workers. This is also empirically observed in Section 5. On the other hand, the sampling strategy plays an important role in non-i.i.d. case as well. Here, we adopt the uniform sampling (with/without replacement) to enlist workers to participate in FL. Intuitively, the distribution of the sampled workers’ collective datasets under uniform sampling yields a good approximation of the overall data distribution in expectation.

Note that, in this paper, we assume that every worker is available to participate once being enlisted. However, this may not always be feasible. In practice, the workers need to be in certain states in order to be able to participate in FL (e.g., in charging or idle states, etc. (Eichner et al., 2019)). Therefore, care must be taken in sampling and enlisting workers in practice. We believe that the joint design of sampling schemes and the generalized FedAvg algorithm will have significant impact on the convergence, which needs further investigations.

5 Numerical Results

We perform extensive experiments to verify our theoretical results. We use three models: logistic regression (LR), a fully-connected neural network with 2 hidden layer (2NN) and a convolution neural network (CNN) with the non-i.i.d. version of MNIST

(LeCun et al., 1998) and one ResNet model with CIFAR-10 (Krizhevsky et al., 2009). Due to space limitation, we relegate some experimental results in the supplementary material.

(a) Impact of non-i.i.d. datasets. (b) Impact of worker number. (c) Impact of local steps
Figure 1:

Training loss (top) and test accuracy (bottom) for the 2NN model with hyper-parameters setting: local learning rate 0.1, global learning rate 1.0: (a) worker number 100, local steps 5 epochs; (b) local steps 5 epochs; (c) 5 digits in each worker’s dataset.

In this section, we elaborate the results under non-i.i.d. MNIST datasets for the 2NN. We distribute the MNIST dataset among workers randomly and evenly in a digit-based manner such that the local dataset for each worker contains only a certain class of digits. The number of digits in each worker’s dataset represents the non-i.i.d. degree. For , each worker has training/testing samples with ten digits from to , which is essentially an i.i.d. case. For , each worker has samples only with one digit, which leads to highly non-i.i.d. datasets among workers. For partial worker participation, we set the number of workers in each communication round.

Impact of non-i.i.d. datasets. As shown in Figure 1(a), for the 2NN model with full worker participation, the top-row figures are for training loss versus communication round and the bottom-row are for test accuracy versus communication round. We can see that the generalized FedAvg algorithm converges under non-i.i.d. datasets with a proper learning rate choice in both cases. For five digits () in each worker’s dataset with full (partial) worker participation in Figure 1(a), the generalized FedAvg algorithm achieves a convergence speed comparable with that of i.i.d. case (). Another key observation is that non-i.i.d. datasets slow down the convergence under the same learning rate settings for both cases. The higher the non-i.i.d. degree, the slower the convergence speed. As the non-i.i.d. degree increases (from case to case ), it is obvious that the training loss is increasing and test accuracy is decreasing. This trend is more obvious from the zigzagging curves for partial worker participation. These two observations can also be verified for other models as shown in the supplementary material, which confirms our theoretical analysis.

Impact of worker number. As shown in Figure 1(b), we compare the training loss and test accuracy between full worker participation and partial worker participation with the same hyper-parameters. Compared with full worker participation, partial worker participation introduces another source of randomness, which leads to zigzagging convergence curves and slower convergence. This problem is more prominent for highly non-i.i.d. datasets. For full worker participation, it can neutralize the the system heterogeneity in each communication round. However, it might not be able to neutralize the gaps among different workers for partial worker participation. That is, the datasets’ distribution does not approximate the overall distribution well. Specifically, it is not unlikely that the digits in these datasets among all active workers are only a proper subset of the total 10 digits in the original MNIST dataset, especially with highly non-i.i.d. datasets. This trend is also obvious for complex models and complicated datasets as shown in the supplementary material. The sample strategy here is random sampling with equal probability without replacement. In practice, however, the actual sampling of the workers in FL could be more complex, which worths further investigations.

Impact of local steps. One open question of FL is that whether the local steps help the convergence or not. In Figure 1(c), we show that the local steps could help the convergence for both full and partial worker participation. These results verify our theoretical analysis. However, Li et al. (2019b) showed that the local steps may hurt the convergence and confirmed it under unbalanced non-i.i.d. MNIST datasets. We believe it may be due to the combined effect of unbalanced datasets and local steps rather than just local steps only.

Dataset IID or Non-IID Worker selected Model SCAFFOLD This paper
# of Round Communication cost (MB) Wall-clock time (s) # of Round Communication cost (MB) Wall-clock time (s)
MNIST IID Logistic 3 0.36 0.32 3 0.18 0.22
2NN 3 9.12 0.88 3 4.56 0.56
CNN 3 26.64 2.23 3 13.32 1.57
Logistic 5 0.60 0.53 5 0.30 0.42
2NN 5 15.20 1.51 8 12.16 1.49
CNN 1 8.88 0.79 1 4.44 0.50
Non-IID Logistic 14 1.68 1.48 14 0.84 1.16
2NN 14 42.55 4.23 14 21.28 2.46
CNN 14 124.34 11.12 10 44.41 4.92
Logistic 7 0.84 0.72 11 0.66 0.91
2NN 7 21.28 2.11 17 25.84 3.16
CNN 17 150.98 13.50 7 31.08 3.51
CIFAR-10 IID Resnet18 56 9548.07 583.24 44 3751.03 256.63
Non-IID Resnet18 52 8866.06 539.50 61 5200.29 358.22
  • Bandwidth = 20MB/s.

Table 2: Comparison with SCAFFOLD.

Comparison with SCAFFOLD: Lastly, we compare with the SCAFFOLD algorithm (Karimireddy et al., 2019) since it also achieves the same linear speedup effect under non-i.i.d. datasets. We compare communication rounds, total communication load, and estimated wall-clock time under the same settings to achieve certain test accuracy, and the results are reported in Table 2. The non-i.i.d. dataset is for and i.i.d. is for . The learning rates are , and number of local steps is epochs. We set the target accuracy for MNIST and for CIFAR-10. Note that the total training time contains two parts: i) the computation time for training the local model at each worker and ii) the communication time for information exchanges between the worker and server. We assume the bandwidth MB/s for both uplink and downlink connections. For MNIST datasets, we can see that our algorithm is similar to or outperforms SCAFFOLD. This is because the communication rounds of both algorithms are relatively small for such simple tasks. For non-i.i.d. CIFAR-10, SCAFFOLD takes slightly fewer communication rounds than our FedAvg to achieve thanks to variance reduction. However, it takes more than 1.5 times of communication cost and wall-clock time compared to those of our FedAvg. Due to space limitation, we relegate the results of time proportions for computation and communication to Appendix B (see Fig. 7).

6 Conclusions and future work

In this paper, we analyzed the convergence of a generlized FedAvg algorithm with two-sided learning rates on non-i.i.d. datasets for general non-convex optimization. We proved that the generalized FedAvg algorithm achieves a linear speedup for convergence under full and partial worker participation. We showed that the local steps in FL could help the convergence and we improve the maximum number of local steps to . While our work sheds light on theoretical understanding for FL, it also opens the doors to many new interesting questions in FL, such as how to sample optimally in partial worker participation, how to deal with active participant sets that are both time-varying and size-varying across communication rounds. We hope that the insights and proof techniques in this paper can pave the way for many new research in the aforementioned future directions.

Acknowledgements

This work is supported in part by NSF grants CAREER CNS-1943226, CIF-2110252, ECCS-1818791, CCF-1934884, ONR grant ONR N00014-17-1-2417, and a Google Faculty Research Award.

References

  • L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §3.
  • H. Eichner, T. Koren, H. B. McMahan, N. Srebro, and K. Talwar (2019) Semi-cyclic stochastic gradient descent. arXiv preprint arXiv:1904.10120. Cited by: §B.3, §4.
  • S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §3.
  • T. H. Hsu, H. Qi, and M. Brown (2019) Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335. Cited by: §1.
  • L. Huang, Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu (2018) Loadaboost: loss-based adaboost federated machine learning on medical data. arXiv preprint arXiv:1811.12629. Cited by: §2.
  • E. Jeong, S. Oh, H. Kim, J. Park, M. Bennis, and S. Kim (2018) Communication-efficient on-device machine learning: federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479. Cited by: §2.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §2, §2, §3, Remark 3.
  • S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh (2019) SCAFFOLD: stochastic controlled averaging for on-device federated learning. arXiv preprint arXiv:1910.06378. Cited by: §B.3, 1st item, 2nd item, §1, item 4, item 6, §2, §3, §5, Remark 3.
  • A. Khaled, K. Mishchenko, and P. Richtárik (2019a) Better communication complexity for local sgd. arXiv preprint arXiv:1909.04746. Cited by: §2.
  • A. Khaled, K. Mishchenko, and P. Richtárik (2019b) First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715. Cited by: item 6.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §5.
  • T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2019a) Federated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873. Cited by: §2.
  • T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2018) Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127. Cited by: §1, §2, §3.2.
  • X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2019b) On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189. Cited by: §B.3, 2nd item, item 6, §2, §3.2, §4, §5, Remark 2.
  • T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi (2018) Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217. Cited by: §2, §4.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §B.2, §1, §2, §4.
  • S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konecny, S. Kumar, and H. B. McMahan (2020) Adaptive federated optimization. arXiv preprint arXiv:2003.00295. Cited by: §A.3, §2, §3, §3, Lemma 2.
  • F. Sattler, S. Wiedemann, K. Müller, and W. Samek (2019) Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems. Cited by: §2.
  • S. U. Stich, J. Cordonnier, and M. Jaggi (2018) Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pp. 4447–4458. Cited by: §2.
  • S. U. Stich and S. P. Karimireddy (2019) The error-feedback framework: better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350. Cited by: §1, item 6, §2.
  • S. U. Stich (2018) Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767. Cited by: §1, item 6, §2, §4, §4.
  • J. Wang and G. Joshi (2018) Cooperative sgd: a unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576. Cited by: §1, item 6, §2, §2.
  • J. Wang, V. Tantia, N. Ballas, and M. Rabbat (2019a) SlowMo: improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643. Cited by: §1, §2.
  • S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2019b) Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications 37 (6), pp. 1205–1221. Cited by: §3.
  • H. Yu, R. Jin, and S. Yang (2019a) On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. arXiv preprint arXiv:1905.03817. Cited by: 2nd item, item 6, §2, Remark 3.
  • H. Yu, S. Yang, and S. Zhu (2019b)

    Parallel restarted sgd with faster convergence and less communication: demystifying why model averaging works for deep learning

    .
    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 5693–5700. Cited by: §1, item 6, §2, §3.
  • Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §2.
  • F. Zhou and G. Cong (2017) On the convergence properties of a -step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012. Cited by: §2.

Appendix A Appendix I: Proofs

In this section, we give the proofs in detail for full and partial worker participation in Section A.1 and Section A.2, respectively.

a.1 Proof of Theorem 1

See 1

Proof.

For convenience, we define . Under full device participation (i.e., ), it is clear that .

Due to the smoothness in Assumption 1, taking expectation of over the randomness at communication round , we have:

(1)

Note that the term in (1) can be bounded as follows:

(2)

where follows from that for and , is due to that , is due to Assumption 1 and follows from Lemma 2.

The term in (1) can be bounded as:

(3)

where follows from the fact that and is due to the bounded variance assumption in Assumption 3 and the fact that if s are independent with zero mean and .

Substituting the inequalities in (2) of and (3) of into inequality (1), we have: