Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client Availability

02/18/2020 ∙ by Yikai Yan, et al. ∙ 0

Federated learning is a new distributed machine learning framework, where a bunch of heterogeneous clients collaboratively train a model without sharing training data. In this work, we consider a practical and ubiquitous issue in federated learning: intermittent client availability, where the set of eligible clients may change during the training process. Such an intermittent client availability model would significantly deteriorate the performance of the classical Federated Averaging algorithm (FedAvg for short). We propose a simple distributed non-convex optimization algorithm, called Federated Latest Averaging (FedLaAvg for short), which leverages the latest gradients of all clients, even when the clients are not available, to jointly update the global model in each iteration. Our theoretical analysis shows that FedLaAvg attains the convergence rate of O(1/(N^1/4 T^1/2)), achieving a sublinear speedup with respect to the total number of clients. We implement and evaluate FedLaAvg with the CIFAR-10 dataset. The evaluation results demonstrate that FedLaAvg indeed reaches a sublinear speedup and achieves 4.23 FedAvg.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated Learning (FL) is a new paradigm of distributed machine learning (McMahan et al., 2017; Li et al., 2019a; Kairouz et al., 2019). It allows multiple clients to collaboratively train a global model without needing to upload local data to a centralized cloud server. In the FL setting, data are massively distributed over clients, with non-IID distribution (Hsieh et al., 2019) and unbalance in quantity (Mohri et al., 2019); in these ways, FL is distinguished from traditional distributed optimization (Li et al., 2014; Lian et al., 2018; Tang et al., 2018a, b, 2019; Yu and Jin, 2019). Furthermore, the agents participating in FL are typically heterogeneous clients with limited computation resources and unreliable communication links, resulting in a varying set of eligible clients during the training process. These new features pose challenges in designing and analyzing learning algorithms for FL.

One of the leading challenges in deploying FL systems is client availability, where the clients may not be available throughout the entire training process. Consider the typical FL scenario where Google’s mobile keyboard Gboard polishes its language models among numerous mobile-device users (Bonawitz et al., 2019; Yang et al., 2018). To minimize the negative impact on user experience, only devices that meet certain requirements (e.g., charging, idle, and free Wi-Fi) are eligible for model training. These requirements are usually satisfied at night in local time, resulting in a diurnal pattern of client availability. Such intermittent client availability would introduce bias into training data. On one hand, as clients have diverse availability patterns, certain clients are more likely to be selected to participate in the training, and thus their data would be over-represented. On the other hand, if the criteria of client availability depend on latency, then clients with slower processors or delayed networks may be under-represented. Such bias gives rise to inconsistency between the training and test data distributions, thus degrading the generalization ability of FL algorithms. This inconsistency is also known as dataset shift (Quiñonero-Candela et al., 2009; Moreno-Torres et al., 2012), a notorious obstacle to the convergence of machine learning algorithms (Subbaswamy et al., 2019; Snoek et al., 2019), which also exists in FL.


Studies Assumptions on Client Availability Convergence Rate
Wang and Joshi (2018) All clients are available and participate in training.
Yu et al. (2019b)
Khaled et al. (2019)
Stich (2019)
Stich and Karimireddy (2019)
Li et al. (2019b) All clients are available, and a subset of clients participate in training.
The current study Each client is available at least once during any period with certain length.
Table 1: Convergence results in FL under different client availability assumptions.

Existing work in the literature has not considered the issue of intermittent client availability, and the convergence analysis of FL algorithms always requires all clients to be available throughout the training process. As shown in Table 1, much effort (Wang and Joshi, 2018; Yu et al., 2019b; Khaled et al., 2019; Stich, 2019; Stich and Karimireddy, 2019; Li et al., 2019b) has been expended in proving the convergence of the classical FedAvg algorithm (McMahan et al., 2017). However, this line of work assumed that all clients participate in each iteration of the training, to establish the 111Notation is the total number of clients, and is the total number of iterations in the training. convergence of FedAvg. Such a full client participation requirement would significantly increase the synchronization latency of the collaborative training process, and is hard to be satisfied in practical FL. One exception is Li et al. (2019b), who only required a subset of clients to participate in each iteration to obtain the

convergence of FedAvg. However, to guarantee such a convergence result, they assumed that the participating clients are selected either uniformly at random or with probabilities proportional to the volume of local data, which is possible only if all clients are available. As these studies adopted the full client availability model, there is no bias in the training data, which is an essential condition to obtain the positive convergence results of FedAvg in the literature.

In this study, we integrate the consideration of intermittent client availability into the design and analysis of the FL algorithm. We first formulate a practical model for intermittent client availability in FL; this model allows the set of available clients to follow any time-varying distribution, with the assumption that each client needs to be available at least once during any period with certain length. Under such a client availability model, FedAvg would diverge even in a simple learning scenario (shown in Subsection 3.1), because the training data are biased towards those highly available clients. For general distributed non-convex optimization, we propose a simple Federated Latest Averaging algorithm, namely FedLaAvg, to approximately balance the influence of each client’s data on the global model training. Specifically, instead of averaging only the gradients collected from participating clients, FedLaAvg averages the latest gradients222The latest gradient of a given client is the gradient calculated in her latest participating iteration. Please refer to Subsection 3.2 for detailed definition. of all clients. By setting appropriate parameters, we can prove an convergence for FedLaAvg, implying that FedLaAvg can achieve a sublinear speedup with respect to the total number of clients. We summarize the contributions of this work as follows.

  • To the best of our knowledge, we are the first to study the problem of intermittent client availability in FL, and present a formal formulation thereof. We also show the divergence of FedAvg in such a practical client availability model, and investigate the underlying reasons behind it.

  • Under the intermittent client availability model, we propose the fast convergent algorithm FedLaAvg, which aggregates the latest gradients of all clients in each training iteration. Our theoretical analysis shows the convergence of FedLaAvg for general distributed non-convex optimization.

  • Using the CIFAR-10 dataset, we evaluate FedLaAvg and compare its performance with that of FedAvg. Our evaluation results demonstrate the effectiveness and efficiency of FedLaAvg, as it achieves 4.23% higher test accuracy than FedAvg and a sublinear speedup.

2 Problem Formulation

We consider a general distributed non-convex optimization scenario in which clients collaboratively solve the following consensus optimization problem:

Each client holds training data , and is the weight of this client (typically the proportion of client ’s local data volume in the total data volume of the FL system (McMahan et al., 2017)). Function is the training error of model parameters over local data , and is the local generalization error, taking expectation over the randomness of local data. In iteration , participating client observes the local stochastic gradient:

where is the model parameters from the previous iteration and is the local training data in this iteration. We note

where is the historical training data from all clients before iteration :

To simplify the analysis of unbalanced data volume among clients, we use a scaling technique to obtain a revised local objective function:

Then, we can rewrite the global objective function as

(1)

In this study, we make three assumptions regarding the objective funtions as follows.

Assumption 1.

Local objective functions are all :

The corollary is

Assumption 2.

Bounded variance: with

,

Assumption 3.

Bounded gradient: with ,

To model intermittent client availability, we use to denote the set of available clients in iteration . We formally introduce the following assumption regarding the model of intermittent client availability in FL.

Assumption 4.

Minimal availability: each client is available at least once in any period with successive iterations:

Assumption 1 is standard, and Assumptions 2 and 3 have also been widely made in the literature (Zhang et al., 2012; Stich et al., 2018; Yu et al., 2019b, a; Stich, 2019; Li et al., 2019b). Specifically, Yu et al. (2019b) worked with non-convex functions under Assumptions 13, and required all clients to be available and to participate in each iteration. Meanwhile, Li et al. (2019b) focused on convex functions while imposing the same full client availability requirement. The full client availability model in existing work is equivalent to the special case of our intermittent client availability model by setting in Assumption 4. Furthermore, Assumption 4 regarding the intermittent client availability model is reasonable in practical FL. For example, as discussed earlier, clients are typically available at night, and thus Assumption 4 with equal to the number of iterations in one day can describe such a client availability scenario.

3 Algorithm Design

In this section, we first show that the classical FedAvg algorithm produces arbitrarily poor-quality results in the presence of intermittent client availability. We investigate the underlying reasons for the divergence of FedAvg, and then propose a new algorithm called FedLaAvg, which converges at a fast rate under the intermittent client availability model.

3.1 Divergence of FedAvg

Example 1.

We consider a distributed optimization problem with only two clients (denoted as and ) and a convex objective function. The goal is to learn the mean of one-dimensional data from these two clients. Following the problem formulation in Section 2, the local data distribution is with mean . For simplicity, we assume the amounts of data from the two clients are balanced. We can formulate this simple learning problem as minimizing the following mean square error (MSE):

(2)

For this example, we consider a specific intermittent client availability model: clients are available periodically and alternately—that is, in each period, client 1 is available in the first iterations, and client 2 is available in the following iterations. Let index the period; we then have

This model describes the client availability with a regular diurnal pattern. For example, clients around the world participate in FL at night. Clients 1 and 2 may correspond to clients from two different geographic regions, respectively.

Theorem 1.

Suppose each client computes the exact (not stochastic) gradient. In Example 1, even with a sufficiently low learning rate, the model parameters returned by FedAvg at the end of each period, i.e., , would converge to , which can be arbitrarily far away from the optimal solution .

Proof of Theorem 1.

In Example 1, the training process of FedAvg is that the two clients train the global model using their own local data alternatively. Hence, after a certain number of training iterations, the global model parameters would be “pulled” in opposite directions when different clients are available, and would finally oscillate periodically around . For the detailed proof, please refer to Appendix A. ∎

3.2 Federated Latest Averaging

As shown in Subsection 3.1, intermittent client availability seriously affects the performance of FedAvg. In FL, the overall data distribution is an unbiased mixture of all clients’ local data distributions. FedAvg can be proven to converge in the full client participation scenario (Yu et al., 2019b), because it uses the current gradients of all clients to update the global model. This makes the training data distribution in each iteration consistent with the overall data distribution. However, due to the intermittent client availability, some clients are selected to participate in the training process more frequently, introducing the bias into training data. To mitigate the bias problem, we imitate the full client participation scenario, and attempt to leverage the gradient information of all clients for model training in each iteration. The difficulty in employing this idea is that as some clients are absent from the training due to being either unavailable or unselected, we cannot obtain the current gradients of these clients. To resolve the lack of gradient information, we propose a natural and simple idea: using the latest gradient of the client when her current gradient is not available. By doing so, we can eliminate the bias in training data, and establish the convergence result.

1:  Input: initial model parameters ; number of clients ; number of total iterations ; learning rate ; proportion of selected clients (i.e., the number of participating clients in each iteration is .)
2:  Do initialization:
3:  for  to  do
4:     
5:      the set of available clients
6:      clients from with the lowest values
7:     Update values:
(3)
8:     Each client calculates local gradient and uploads gradient difference in parallel.
9:     Once receiving the gradient information from client , the cloud server calculates the global gradient:
(4)
10:     The cloud server updates the global model parameters:
(5)
11:  end for
Algorithm 1 Federated Latest Averaging Algorithm

We present in Algorithm 1 the detailed procedures of our approach FedLaAvg. In each iteration , each selected client locally calculates the gradient , and the cloud server maintains the average latest gradient of all clients. The client selection principle in FedLaAvg is to choose the clients that are absent from the training process for the longest time (Lines 5–7). Together with Assumption 4, we can guarantee that each client is selected at least once during any period with successive iterations, where is a function of parameters , , and (please refer to Lemma 1 in Subsection 4.2 for the details). Based on this condition, we can establish an upper bound for the difference between each client’s latest gradient and her current gradient, which would be critical for the convergence analysis of FedLaAvg in Subsection 4.2. To implement this principle, we use to record the latest iteration before or at in which client participates in the training process. During the aggregation procedure (Lines 8–9), to reduce the aggregation overhead, each selected client uploads the gradient difference: the difference between the gradients computed in the current participating iteration and the previous participating iteration, i.e., , rather than the current gradient as in the traditional FedAvg algorithm. Once the gradient difference from each client is received, the cloud server would update the global gradient using (4). Following this aggregation method, the cloud server only needs to store the average latest gradient and run update operations. It can be proved by induction that at the end of each iteration , the resulting gradient is indeed the average latest gradient:

(6)

Once the average latest gradient is obtained, the cloud server uses it to update the global model parameters in (5).

4 Convergence Analysis

In this section, under the practical model of intermittent client availability, we show that FedLaAvg achieves an convergence rate with a sublinear speedup in terms of the total number of clients.

4.1 Convergence on Example 1

We first demonstrate that FedLaAvg converges in Example 1, where FedAvg produces an arbitrarily poor-quality result. The convergence analysis of FedLaAvg for this simple example sheds light on the analysis for the case of general non-convex optimization in next subsection.

Theorem 2.

Suppose each client computes the exact (not stochastic) gradient. In Example 1, after iterations, FedLaAvg with the learning rate produces a solution that is within range of the optimal solution :

(7)

where we choose as the output.

Proof of Theorem 2.

We recall that

where the latter two terms are not associated with the variable

. Hence, we only need to focus on the following part of the loss function:

(8)

where is the optimal solution.

Note that

(9)

We calculate the difference of between two successive iterations:

(10)

where is defined in Subsection 3.2. Hence, we have

(11)

Substituting (4.1) and (4.1) into (4.1), we have

(12)

where (a) follows from .

The algorithm starts from model parameters . When client is available, moves towards , and when client is available, moves towards . Hence, is always within range of in the training:

(13)

where is the the largest gradient norm during the training process. Substituting (13) into (4.1), we have

(14)

Referring to the specific client availability model in this example, we have

(15)

where . Therefore, when , summing (14) over iterations from to , we have

(16)

Note that when or , the above formula also holds.

Substituting (16) into (4.1), we have

Rearranging the above formula, we have

(17)

Summing (17) over iterations from 1 to , we have

(18)

Substituting into (4.1), we have

(19)

Finally, we have

(20)

4.2 Convergence on General Non-convex Functions

In this subsection, we show the convergence of FedLaAvg on general non-convex functions.

We first introduce Lemma 1 about client participation.

Lemma 1.

Under Assumption 4, the client selection policy in FedLaAvg guarantees that for each client, the latest participating iteration is at most iterations earlier than the current iteration:

(21)
Proof of Lemma 1.

Due to space limitations, please refer to Appendix B for the detailed proof. ∎

With such a client participation condition, we can derive a key result for analyzing the convergence of FedLaAvg.

Theorem 3.

By setting in FedLaAvg, we can derive the following bound on the average expected squared gradient norm under Assumptions 14:

where is the the optimal solution for the general non-convex optimization problem.

Proof of Theorem 3.

The basic idea is similar to the simple case discussed in Subsection 4.1. With Lemma 1, we first show that the difference between the latest gradient and the corresponding current gradient is bounded. Then the theorem follows with this bound and the smoothness of . For the detailed proof, please refer to Appendix C. ∎

Before presenting our main result, we consider the full client participation setting discussed in Yu et al. (2019b), in which our FedLaAvg reduces to FedAvg. Since , , and in this setting, the result in Theorem 3 becomes

Choosing , when , we can obtain the convergence, which is consistent with the linear speedup in terms of as proven in Yu et al. (2019b).

For the intermittent client availability setting considered in this work, FedLaAvg achieves a sublinear speedup by choosing appropriate hyperparameters. For easy illustration, we define the loss difference between the initial solution

and the optimal solution as . In addition, we recall that is the proportion of the selected clients in each iteration.

Corollary 1.

By choosing the learning rate and requiring in FedLaAvg, we have the following convergence result:

When , we further obtain the sublinear speedup with respect to the total number of clients:

Proof.

Please refer to Appendix D for detailed proof. ∎

5 Experiments

5.1 Experiment Setting

In this section, we evaluate the performance of FedLaAvg in an image classification task over the CIFAR-10 dataset (Krizhevsky et al., 2009). The CIFAR-10 dataset consists of 60000 color images from 10 different classes, with 6000 images per class (5000 for training and 1000 for testing). We simulate the non-IID data distribution by setting each client to hold only images from one certain class and the number of clients from the same class to be

. To simulate the data unbalance, we let the number of samples on each client roughly follow a normal distribution with mean

and variance

. For image classification, we take the deep learning model architecture from pytorch tutorial, with two convolutional layers followed by two fully connected layers and then a linear transformation layer to produce logits. The total parameter size of such a model is

.

For simple illustration, in the previous discussion, we focus on the case in which participating clients upload gradient information in each iteration. In practical FL deployment, for communication efficiency, each participating client is allowed to perform multiple local training iterations before uploading the accumulated local model update (McMahan et al., 2017). FedLaAvg can be easily extended to this setting with the same performance guarantee, and the detailed design and convergence analysis are presented in Appendix E. To be consistent with practical FL deployment, we conduct experiments for FedLaAvg within the case of multiple local iterations in each communication round.


Figure 1: The diurnal pattern of client availability. Half of the clients are available in white grids, while the remaining clients are available in black grids.

Figure 1 describes the intermittent client availability model with the diurnal pattern adopted in this experiment. In white grids, clients with the first five classes are available for rounds, and in black grids, clients with the other five classes are available for the next rounds. The parameter describes the degree of heterogeneous availability patterns among these two groups of clients.

In the default experiment setting, we set the total number of clients , the period length , the ratio , the proportion of selected clients in each round , and the number of local iterations . For hyperparameters, we let the learning rate decay in each communication round, and tune the initial learning rate and its decay for each experiment. The batch size for local iterations of each client is set to 5. For the detailed learning rate of each experiment, please refer to Appendix G.

5.2 Experimental Results


Figure 2: Performance of FedLaAvg, FedAvg, and sequential SGD under full client availability and intermittent client availability. We use ICA to abbreviate intermittent client availability and FCA to abbreviate full client availability.

Algorithms FCA ICA
FedAvg 59.25% 56.75%
FedLaAvg 64.70% 60.98%
Table 2: Best test accuracy of FedAvg and FedLaAvg achieved within 70000 rounds.
(a) Test accuracy with different .
(b) Test accuracy with different .
(c) Test accuracy with different .
Figure 3: The performance of FedLaAvg with the variation of the total number of clients , the period length , and the proportion of selected clients .

We compare FedLaAvg with FedAvg and sequential SGD, and show the experiment results in Figure 2. We run the standard SGD algorithm to train the global model using the whole dataset, and the total iterations in each round is . The result of sequential SGD can be regarded as the optimal solution for the optimization problem. The test accuracy of sequential SGD decreases after reaching the peak and finally converges to 61.59%, due to the phenomenon of overfitting. The test accuracy of FedLaAvg with both intermittent and full client availability suffers from large oscillation at first but finally converges. This is because in the early stage of training, the model parameters change drastically, leading to a large difference between each client’s latest gradient and her current gradient. As the training progresses, the model parameters would change smoothly, and this difference vanishes, indicating that the latest gradient better approximates the current gradient. In contrast with FedLaAvg, FedAvg has a smaller oscillation in the early stage of training, as it always uses the current gradients to update the model parameters. Although FedAvg can finally converge under full client availability, it suffers from periodical oscillation under intermittent client availability even after a huge number of rounds (e.g., in communication rounds between and ). This is consistent with the divergence of FedAvg analyzed in Subsection 3.1. We can explore a useful trick in practice to combine the advantages of FedAvg and FedLaAvg: use FedAvg to train the model until it reaches the bottleneck, and then switch to FedLaAvg to further improve the performance.

We next show the performance of FedAvg and FedLaAvg in Table 2. We observe that FedLaAvg outperforms FedAvg in the two client availability models, and approaches the optimal solution. For the intermittent client availability, the best test accuracy of FedLaAvg is 4.23% higher than that of FedAvg. For the case of full client availability, FedLaAvg still achieves 5.45% higher test accuracy than FedAvg, because FedLaAvg leverages the gradient information of all clients for model training in each iteration, while FedAvg uses only the gradients of the selected clients. We also observe that FedLaAvg approaches the performance of sequential SGD. The best test accuracy of FedLaAvg under intermittent client availability is only 0.61% lower than the test accuracy that sequential SGD finally converges to. From Figure 2, we note that due to the overfitting, the convergent test accuracy of sequential SGD is slightly lower than the best test accuracy of FedLaAvg under full client availability.

The evaluation results in Figure 3 and Table 3 further validate the convergence result of FedLaAvg in Corollary 1. From Table 3, we see that FedLaAvg generally needs fewer training rounds to reach a certain test accuracy when either the total number of clients or the proportion of selected clients increases, or the period length decreases. This result also validates the sublinear speedup of FedLaAvg with respect to . In addition, as shown in Figure 3, FedLaAvg converges in diverse parameter settings, which is consistent with the convergence guarantee in Corollary 1. The different convergent test accuracies come from the non-convexity of the objective functions. From Subfigure 3(b), we further observe that with a larger , the test accuracy of FedLaAvg oscillates more severely, because the latest gradient becomes an inaccurate approximation of the current gradient when clients are not available during a longer period.


200 400 600 800 1000
Rounds 33550 14800 12850 12700 10100
1 5 10 15
Rounds 10000 9950 10100 13400
0.02 0.04 0.06 0.08 0.10
Rounds 27800 32200 13450 9400 10100
Table 3: The number of training rounds needed to reach 55% test accuracy with different parameters.

6 Conclusion

In this work, we investigate intermittent client availability in federated learning and its impact on the convergence of the classical federated averaging algorithm. We use a collection of time-varying sets to represent the available clients in each training iteration, which can accurately model the intermittent client availability. Furthermore, we design a simple FedLaAvg algorithm with an convergence guarantee for general distributed non-convex optimization problems. Empirical experiments with the CIFAR-10 dataset demonstrate the effectiveness and efficiency of FedLaAvg with a remarkable performance improvement and a sublinear speedup.

References

  • K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, H. B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselander (2019) Towards federated learning at scale: system design. In Proceedings of MLSys, Cited by: §1.
  • K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons (2019) The non-IID data quagmire of decentralized machine learning. CoRR abs/1910.00189. Cited by: §1.
  • P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao (2019) Advances and open problems in federated learning. CoRR abs/1912.04977. Cited by: §1.
  • A. Khaled, K. Mishchenko, and P. Richtárik (2019) First analysis of local GD on heterogeneous data. CoRR abs/1909.04715. Cited by: Table 1, §1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technique report. Cited by: §5.1.
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server. In Proceedings of OSDI, pp. 583–598. Cited by: §1.
  • T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2019a) Federated learning: challenges, methods, and future directions. CoRR abs/1908.07873. Cited by: §1.
  • X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang (2019b) On the convergence of FedAvg on non-IID data. CoRR abs/1907.02189. Cited by: Table 1, §1, §2.
  • X. Lian, W. Zhang, C. Zhang, and J. Liu (2018)

    Asynchronous decentralized parallel stochastic gradient descent

    .
    In Proceedings of ICML, pp. 3043–3052. Cited by: §1.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Proceedings of AISTATS, pp. 1273–1282. Cited by: §1, §1, §2, §5.1.
  • M. Mohri, G. Sivek, and A. T. Suresh (2019) Agnostic federated learning. In Proceedings of ICML, pp. 4615–4625. Cited by: §1.
  • J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N. V. Chawla, and F. Herrera (2012) A unifying view on dataset shift in classification. Pattern Recognition 45 (1), pp. 521–530. Cited by: §1.
  • J. Quiñonero-Candela, M. S. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009) Dataset shift in machine learning. The MIT Press. External Links: ISBN 0262170051 Cited by: §1.
  • J. Snoek, Y. Ovadia, E. Fertig, B. Lakshminarayanan, S. Nowozin, D. Sculley, J. Dillon, J. Ren, and Z. Nado (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Proceedings of NeurIPS, pp. 13969–13980. Cited by: §1.
  • S. U. Stich, J. Cordonnier, and M. Jaggi (2018) Sparsified SGD with memory. In Proceedings of NeurIPS, pp. 4447–4458. Cited by: §2.
  • S. U. Stich and S. P. Karimireddy (2019) The error-feedback framework: better rates for SGD with delayed gradients and compressed communication. CoRR abs/1909.05350. Cited by: Table 1, §1.
  • S. U. Stich (2019) Local SGD converges fast and communicates little. In Proceedings of ICLR, Cited by: Table 1, §1, §2.
  • A. Subbaswamy, P. Schulam, and S. Saria (2019) Preventing failures due to dataset shift: learning predictive models that transport. In Proceedings of AISTATS, pp. 3118–3127. Cited by: §1.
  • H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu (2018a) Communication compression for decentralized training. In Proceedings of NeurIPS, pp. 7652–7662. Cited by: §1.
  • H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu (2018b) : Decentralized training over decentralized data. In Proceedings of ICML, pp. 4855–4863. Cited by: §1.
  • H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu (2019) DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of ICML, pp. 6155–6165. Cited by: §1.
  • J. Wang and G. Joshi (2018) Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. CoRR abs/1808.07576. Cited by: Table 1, §1.
  • T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, and F. Beaufays (2018) Applied federated learning: improving google keyboard query suggestions. CoRR abs/1812.02903. Cited by: §1.
  • H. Yu, R. Jin, and S. Yang (2019a) On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of ICML, pp. 7184–7193. Cited by: §2.
  • H. Yu and R. Jin (2019) On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization. In Proceedings of ICML, pp. 7174–7183. Cited by: §1.
  • H. Yu, S. Yang, and S. Zhu (2019b) Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning. In Proceedings of AAAI, pp. 5693–5700. Cited by: Table 1, §1, §2, §3.2, §4.2.
  • Y. Zhang, M. J. Wainwright, and J. C. Duchi (2012) Communication-efficient algorithms for statistical optimization. In Proceedings of NeurIPS, pp. 1502–1510. Cited by: §2.

Appendix A Proof of Theorem 1

Proof of Theorem 1..

We first show that if , will converge to

(22)

Note that for iterations where client is available, we have

where is the learning rate. Rearrange the equation, we have

which implies that is a geometric progression. Hence, we have

(23)

Applying the same analysis on iterations where client is available, we have

(24)

Substituting (23) into (24), we have

(25)

Based on this recursion formula, we have

Since , we have

Based on L’Hopital’s rule, we have

The global minimization objective is

which is obtained when . Note that only when (data distributions are IID) or . Hence, FedAvg will produce arbitrarily poor-quality results without these inpractical assumptions. ∎

Appendix B Proof of Lemma 1

Proof of Lemma 1.

, we focus on the training process from (not included). In iteration , under Assumption 4, client has been available for at least times. Note the iterations as . We prove the lemma by contradiction. Suppose is not selected in any of these iterations. Then we have . In the iterations where client is available, clients have been selected. All these clients (noted as ) are with for all iterations before she participates in the training process and for all iterations after participation. Hence, the clients are distinct. Including client , the system has at least clients. However, the system has only clients. This forms a contradiction. Therefore, for all , the next iteration where participates in the training process after iteration satisfies

(27)

For all client , by setting to iterations where client is selected in (27), we can derive

Appendix C Proof of Theorem 3

Note that local gradient is not calculated in each iteration. In this subsection of the appendix, for mathematical analysis, we extend the definition . For iterations where client does not participate,

is a random variable which follows

.

Lemma 2.

Under Assumption 2 and 3, we have

and

Proof of Lemma 2.

Our Assumptions 2 and 3 take the expectation over the randomness of one training iteration. But we care about the expectation taken over the randomness of the whole training process. This trivial lemma builds the gap.

For the gradient, we have

(28)

where (a) follows from Law of Total Expectation ; (b) follows from Assumption 3.

For the variance, we have

(29)

where (a) follows from Law of Total Expectation ; (b) follows from Assumption 2. ∎

Lemma 3.

, we have

Proof of Lemma 3..

This lemma follows because training data are independent across clients. Specifically, note that

(30)

where (a) follows from Law of Total Expectation . Then we illustrate (b) case by case. Note that

is equal to when . When , without loss of generality, suppose . Then it is equal to

(31)

because and are determined by . When