Federated learning (FL) is a training paradigm enabling different clients to jointly learn a global model without sharing their respective data. Federated learning is a generalization of distributed learning (DL), which was first introduced to optimize a given model in star-shaped networks composed of a server communicating with computing machines (Bertsekas and Tsitsiklis, 1989; Nedić et al., 2001; Zinkevich et al., 2009)
. In DL, the server owns the dataset and distributes it across machines. At every optimization round, the machines return the estimated gradients, and the server aggregates them to perform an SGD step. DL was later extended to account for SGD, and FL extends DL to enable optimization without sharing data between clients. Typical federated training schemes are based on the averaging of clients model parameters optimized locally by each client, such as inFedAvg (McMahan et al., 2017), where at every optimization round clients perform a fixed amount of stochastic gradient descent (SGD) steps initialized with the current global model parameters, and subsequently return the optimized parameters to the server. The server computes the new global model as the average of the clients updates weighted by their respective data ratio.
A key methodological difference between the optimization problem solved in FL and the one of DL lies in the assumption of potentially non independent and identically distributed (iid) data instances (Kairouz et al., 2019; Yang et al., 2019). Proving convergence in the non-iid setup is more challenging, and in some settings, FedAvg has been shown to converge to a sub-optimum, e.g. when each client performs a different amount of local work (Wang et al., 2020a), or when clients are not sampled in expectation according to their importance (Cho et al., 2020).
A major drawback of FedAvg concerns the time needed to complete an optimization round, as the server must wait for all the clients to perform their local work to synchronize their update and create a new global model. As a consequence, due to the potential heterogeneity of the hardware across clients, the time for an optimization round is conditioned to the one of the slowest update, while the fastest clients stay idle once they have sent their updates to the server. To address these limitations, asynchronous FL has been proposed to take full advantage of the clients computation capabilities (Xu et al., 2021; Nguyen et al., 2018; Koloskova et al., 2019; De Sa et al., 2015). In the asynchronous setting, whenever the server receives a client’s contribution, it creates a new global model and sends it back to the client. In this way, clients are never idle and always perform local work on a different version of the global model. While asynchronous FL has been investigated in the iid case (Stich and Karimireddy, 2020), a unified theoretical and practical investigation in the non-iid scenario is currently missing.
This work introduces a novel theoretical framework for asynchronous FL based on the generalization of the aggregation scheme of FedAvg, where asynchronicity is modeled as a stochastic process affecting clients’ contribution at a given federated aggregation step. More specifically, our framework is based on a stochastic formulation of FL, where clients are given stochastic aggregation weights dependent on their effectiveness in returning an update. Based on this formulation, we provide sufficient conditions for asynchronous FL to converge, and we subsequently give sufficient conditions for convergence to the FL optimum of the associated synchronous FL problem. Our conditions depend on the clients computation time (which can be eventually estimated by the server), and are independent from the clients data heterogeneity, which is usually unknown to the server.
With asynchronous FL, the server only waits for one client contribution to create the new global. As a result, optimization rounds are potentially faster even though the new global improves only for the participating client at the detriment of the other ones. This aspect may affect the stability of asynchronous FedAvg as compared to synchronous FedAvg and, as we demonstrate in this work, even diverge in some cases. To tackle this issue, we propose FedFix, a robust asynchronous FL scheme, where new global models are created with all the clients contributions received after a fixed amount of time. We prove the convergence of FedFix and verify experimentally that it outperforms standard asynchronous FedAvg in the considered experimental scenarios.
The paper is structured as follows. In Section 2, we introduce our aggregation scheme and the close-form of its aggregation weights in function of the clients computation capabilities and the considered FL optimization routine. Based on our aggregation scheme, in Section 3, we provide convergence guarantees, and we give sufficient conditions for the learning procedure to converge to the optimum of the FL optimization problem. In Section 4, we apply our theoretical framework to synchronous and asynchronous FedAvg, and show that our work extends current state-of-the-art approaches to asynchronous optimization in FL. Finally, in Section 5, we demonstrate experimentally our theoretical results.
We define here the formalism required by the theory that will be introduced in the following sections. We first introduce in Section 2.1 the FL optimization problem, and we adapt it in section 2.2 to account for delays in client contributions. We then generalize in Section 2.3 the FedAvg aggregation scheme to account for contributions delays. In Section 2.4, we introduce the notion of virtual global models as a direct generalization of gradient descent, and introduce in Section 2.5 the final asynchronous FL optimization problem. Finally, we introduce in Section 2.6 a formalization of the concept of data heterogeneity across clients.
2.1 Federated Optimization Problem
We have participants owning data points independently sampled from a fixed unknown distribution over a sample space . We have
for supervised learning, whereis the input of the statistical model, and its desired target, while we denote
for unsupervised learning. Each client optimizes the model’s parametersbased on the estimated local loss . The aim of FL is solving a distributed optimization problem associated with the averaged loss across clients
where the expectation is taken with respect to the sample distribution across the participating clients. We consider a general form of the federated loss of equation (1) where clients local losses are weighted by an associated parameter such that , i.e.
The weight can be interpreted as the importance given by the server to client in the federated optimization problem. While any combination of is possible, we note that in typical FL formulations, either (a) every client has equal importance, i.e. , or (b) every data point is equally important, i.e. .
2.2 Asynchronicity in Clients Updates
An optimization round starts at time with global model , finishes at time with the new global model , and takes time to complete. No assumptions are made on
, which can be a random variable, and we set for convenience. In this section, we introduce the random variables needed to develop in Section 2.3 the server aggregation scheme connecting two consecutive global models and .
We define the random variable representing the update time needed for client to perform its local work and send it to the server for aggregation. depends on the client computation and communication hardware, and is assumed to be independent from the current optimization round . If the server sets the FL round time to , the aggregation is performed by waiting for the contribution of every client, and we retrieve the standard client-server communication scheme of synchronous FedAvg.
With asynchronous FedAvg, we need to relate to the server aggregation time . We introduce the index of the most recent global model received by client at optimization round and, by construction, we have . We define by
the remaining time at optimization round needed by client to complete its local work.
Comparing with indicates whether a client is participating to the optimization round or not, through the stochastic event . When , the local work of client is used to create the new global model , while client does not contribute when . With synchronous FedAvg, we retrieve for every client.
Figure 1 illustrates the notations described in this section in a FL process with clients.
2.3 Server Aggregation Scheme
We consider the contribution of client received by the server at optimization round . In the rest of this work, we consider that clients perform steps of SGD on the model they receive from the server. By calling their trained model after SGD, we can rewrite clients contribution for FedAvg as , and the FedAvg aggregation scheme as
With FedAvg, the server waits for every client to send its contribution to create the new global model. To allow for partial computation within the server aggregation scheme, we introduce the aggregation weight corresponding to the weight given by the server to client at optimization round . We can then define the stochastic aggregation weight given to client at optimization step as
with if client updated its work at optimization round and otherwise. In the general setting, client receives and its contribution is . By weighting each delayed contribution with its stochastic aggregation weight , we propose the following aggregation scheme
where is a global learning rate that the server can use to mitigate the disparity in clients contributions (Reddi et al., 2021; Karimireddy et al., 2020; Wang et al., 2020b). Equation (6) generalizes FedAvg aggregation scheme (4) ( and ), and the one of Fraboni et al. (2022) based on client sampling.
2.4 Expressing FL as cumulative GD steps
To obtain the tightest possible convergence bound, we consider a convergence framework similar to the one of Li et al. (2020b) and Khaled et al. (2020). We introduced the aggregation rule for the server global models in Section 2.3, and we generalize it in this section by introducing the virtual sequence of global models . This sequence corresponds to the virtual global model that would be obtained with the clients contribution at optimization round computed on SGD, i.e.
We retrieve and . The server has not access to when or . Hence the name virtual for the model .
The difference between two consecutive global models in our virtual sequence depends on the sum of the differences between local models , where is a random batch of data samples of client . Hence, we can rewrite the aggregation process as a GD step with
2.5 Asynchronous FL as a Sequence of Optimization Problems
For the rest of this work, we define , the expected aggregation weight of client at optimization round . No assumption is made on which can vary across optimization rounds. The expected clients contribution help minimizing the optimization problem defined as
We denote by the optimum of and by the optimum of the optimization problem defined in equation (2). Finally, we define by the expected importance given to client over the server aggregations during the FL process, and by the normalized expected importance . We define by the associated optimization problem
and we denote by the associated optimum.
Finally, we introduce the following expected convergence residual, which quantifies the variance at the optimum in function of the relative clients importance
The convergence guarantees provided in this work (Section 3) are proportional to the expected convergence residual.
is positive and null only when clients have the same loss function and perform GD steps for local optimization.
2.6 Formalizing Heterogeneity across Clients
We assume the existence of different clients feature spaces and, without loss of generality, assume that the first clients feature spaces are different. This formalism allows us to represent the heterogeneity of data distribution across clients. In DL problems, we have when the same dataset split is accessible to many clients. When clients share the same distribution, we assume that their optimization problem is equivalent. In this case, we call their loss function with optimum . The federated problem of equation (2) can thus be formalized with respect to the discrepancy between the clients feature spaces . To this end, we define the set of clients with the same feature space of client , i.e. . Each feature space as thus importance , and expected importance such that
As for , we define .
In Table 1, we summarize the different weights used to adapt the federated optimization problem to account respectively for heterogeneity in clients importance and data distributions across rounds.
|Stochastic aggregation weight||-|
|Expected agg. weight|
|Normalized expected agg. weight|
|Expected agg. weight over rounds|
3 Convergence of Federated Problem (2)
In this section, we prove the convergence of the optimization based on the stochastic aggregation scheme defined in equation (6), with implementation given in Algorithm 1. We first introduce in Section 3.1 the necessary assumptions and then prove with Theorem 1 the convergence of the sequence of optimized models (Section 3.2). We show in Section 3.3 the implications of Theorem 1 on the convergence of the federated problem (2), and propose sufficient conditions for the learnt model to be the associated optimum. Finally, with two additional assumptions, we propose in Section 3.4 simpler and practical sufficient conditions for FL convergence to the optimum of the federated problem (2).
We make the following assumptions regarding the Lipschitz smoothness and convexity of the clients local loss functions (Assumption 1 and 2), unbiased gradients estimators (Assumption 3), finite answering time for the clients (Assumption 4), and the clients aggregation weights (Assumption 5). Assumption 3 (Khaled et al., 2020) considers unbiased gradient estimators without assuming bounded variance, giving in turn more interpretable convergence bounds. Assumption 5 states that the covariance between two aggregation weights can be expressed as the product of their expected aggregation weight up to a positive multiplicative factor . We show in Section 4 that Assumption 5 is not limiting as it is satisfied by all the standard FL optimization schemes considered in this work.
Assumption 1 (Smoothness).
Clients local objective functions are -Lipschitz smooth, that is, .
Assumption 2 (Convexity).
Clients local objective functions are convex.
Assumption 3 (Unbiased Gradient).
Every client stochastic gradient of a model with parameters evaluated on batch is an unbiased estimator of the local gradient, i.e.
is an unbiased estimator of the local gradient, i.e..
Assumption 4 (Finite Answering Time).
The server receives a client local work in at most optimization steps, which satisfy .
There exists such that .
3.2 Convergence of Algorithm 1
Theorem 1 is proven in Appendix A. The convergence guarantee provided in Theorem 1 is composed of 5 terms: , , , , . In the following, we describe these terms and explain their origin in a given optimization scheme.
Optimized expected residual . The residual quantifies the sensitivity of between its optimum and the optimum of the overall expected minimized problem across optimization rounds . As such, the residual accounts for the heterogeneity in the history of optimized problems, and is minimized to 0 when the same optimization problem is minimized at every round , i.e. . This condition is always satisfied when clients have identical data distributions, but requires for the server to set properly every client aggregation weight in function of the server waiting time policy and the clients hardware capabilities in the general case (Section 3.3 and 3.4).
Initialization quality . only depends of the quality of the initial model through its distance with respect to the optimum of the overall expected minimized problem across optimization rounds . This convergence term can only be minimized by performing as many serial SGD steps .
Clients data heterogeneity . This term accounts for the disparity in the clients updated models, and is proportional to the clients amount of local work (quadratically) and to the heterogeneity of their data distributions through . When , every client perform its SGD on the same model, which reduces the server aggregation to a traditional centralized SGD. We retrieve .
Gradient delay through and . Decreasing the server time policy
allows faster optimization rounds but decreases a client’s participation probabilityresulting in an increased maximum answering time . In turn, we note that and are quadratically proportional to the maximum amount of serial SGD . This latter terms quantifies the maximum amount of SGD integrated in the global model .
3.3 Sufficient Conditions for Minimizing the Federated Problem (2)
Theorem 1 provides convergence guarantees for the history of optimized models . Under the same assumptions of Theorem 1, we can provide convergence guarantees for the original FL problem (proof in Appendix B).
Under the same conditions of Theorem 1, we have
, and .
Theorem 2 provides convergence guarantees for the optimization problem (2). We retrieve the components of the convergence bound of Theorem 1. The terms to can be mitigated by choosing an appropriate local learning rate , but the same cannot be said for , , . Behind these three quantities, Theorem 2 shows that proper expected representation of every dataset type is needed, i.e. . Indeed, if a client is poorly represented, i.e. , then and , while if a client is not represented at all, i.e. , then . Therefore, we propose, with Corollary 1, sufficient conditions for any FL optimization scheme satisfying Algorithm 1 to converge to the optimum of the federated problem (2).
Follows directly. implies , , , and . ∎
These theoretical results provide relevant insights for different FL scenarios.
iid data distributions, . Consistently with the extensive literature on synchronous and asynchronous distributed learning, when clients have data points sampled from the same data distribution, FL always converges to its optimum (Corollary 1). Indeed, regardless of which clients are participating, and what importance or aggregation weight a client is given.
non-iid data distributions. The convergence of FL to the optimum requires to optimize by considering every data distribution type fairly at every optimization round, i.e. (Corollary 1). This condition is weaker than requiring to treat fairly every client at every optimization round, i.e. . Ideally, only one client per data type needs to have a non-zero participating probability, i.e. , and an appropriate such that is satisfied. In practice, knowing the clients data distribution is not possible. Therefore, ensuring FL convergence to its optimum requires at every optimization round (Wang et al., 2020a).
We provide in Example 1 an illustration on these results based on quadratic loss functions to show that considering fairly data distributions is sufficient for an optimization scheme satisfying Algorithm 1 to converge to the optimum of the optimization problem (2), since is satisfied at every optimization round, while may not be satisfied.
Let us consider four clients with data distributions such that their loss can be expressed as with (), (), and identical client importance, i.e. .
Therefore, each data type has identical importance, i.e. , and the optimum satisfies .
We consider that clients with odd index participate at odd optimization rounds while the ones with even index at even optimization rounds, i.e.
. We consider that clients with odd index participate at odd optimization rounds while the ones with even index at even optimization rounds, i.e.and which gives and or but not . With , equation (6) can be rewritten as
Clients update can be rewritten as where . Equation (21) can thus be rewritten as
Solving equation (22) proves FL asymptotic convergence to the optimum .
3.4 Relaxed Sufficient Conditions for Minimizing the Federated Problem (2)
Theorem 2 holds for any client’s update time and optimization scheme satisfying Algorithm 1, and provides finite convergence guarantees for the optimization problem (2). Corollary 1 shows that for the asymptotic convergence of FL, data distribution types should be treated fairly in expectation, i.e. . This sufficient condition is not necessarily realistic, since the server cannot know the clients data distributions and participation time, and thus needs to give to every client an aggregation weight such that without knowing .
In Example 1, we note that we have . Therefore, every client is given proper consideration every two optimization rounds. Based on Example 1, in Theorem 3 we provide weaker sufficient conditions than the ones of Corollary 1. To this end, we assume that clients are considered with identical importance across optimization rounds (Assumption 6) and that clients gradients are bounded (Assumption 7).
Assumption 6 (Window).
such that .
With Assumption 6, we assume that over a cycle of aggregations, the sum of the clients expected aggregation weights are constant. By definition of , Assumption 6 is always satisfied with . Also, by construction, we have . We note that Assumption 6 is made on a series of windows of size and not for any window of size .
Assumption 7 (Bounded Gradients).
such that .
Gradient clipping is a typical operation performed during the optimization of deep learning models to prevent exploding gradients. A pre-determined gradient threshold is introduced, and gradients with norm exceeding this threshold are clipped to the norm . Therefore, using Assumption 7 and the subadditivity of the norm, the distance between two consecutive global models can be bounded by
which, thanks to the convexity of the clients loss function and to the Cauchy Schwartz inequality, gives
Theorem 3 shows that the condition is sufficient to minimize the optimization problem (2). In practice, for privacy concerns, clients may not want to share their data distribution with the server, and thus the relaxed sufficient condition becomes . This condition is weaker than the one obtained with Corollary 1, at the detriment of a looser convergence bound including an additional asymptotic term linearly proportional to the window size . Therefore, for a given learning application, the maximum local work delay and the window size need to be considered when selecting an FL optimization scheme satisfying Algorithm 1. Also, the server needs to properly allocate clients aggregation weight such that Assumption 6 is satisfied while keeping at a minimum the window size . We note that depends of the considered FL optimization scheme and clients hardware capabilities. Based on the results of Theorem 3, in the following section, we introduce FedFix, a novel asynchronous FL setting based on a waiting policy over fixed time windows .
Finally, the following example illustrates a practical application of the condition .
The conditions of Corollary 1 and Theorem 3 are equivalent when , where we retrieve . They are also equivalent when clients have the same data distributions, and we retrieve at every optimization round, which also implies that .
The convergence guarantee proposed in Theorem 3 depends on the window size , and to the maximum amount of optimizations needed for a client to update its work . We provide sufficient conditions in Corollary 2 for the parameters , and , such that an optimization scheme satisfying Algorithm 1 converges to the optimum of the optimization problem (2).
Let us assume there exists and such that , , and . The convergence bound of Theorem 3 asymptotically converges to 0 if
The bound of Theorem 3 converges to 0 if the following quantities also do: , , , . We get the following conditions on , , and : , , , , which completes the proof. ∎
By construction and definition of , Assumption 6 is always satisfied with . However, Corollary 2 shows that when , no learning rate can be chosen such that the learning process converges to . Also, Corollary 2 shows that Assumption 4 can be relaxed. Indeed, Assumption 4 implies that and Corollary 2 shows that is sufficient. We show in Section 4 that all the considered optimization schemes satisfy and , and also depend of the clients hardware capabilities and amount of participating clients .
In this section, we show that the formalism of Section 2 can be applied to a wide-range of optimization schemes, demonstrating the validity of the conclusions of Corollary 1 and Theorem 3 (Section 3). When clients have identical data distributions, the sufficient conditions of Corollary 1 are always satisfied (Section 3). In the heterogeneous case, these conditions can also (theoretically) be satisfied. It suffices that every client has a non-null participation probability, i.e. such that there exists an appropriate satisfying . Yet, in practice clients generally may not even know their update time distribution making the computation of intractable. In what follows, we thus focus on Theorem 3 to obtain the close-form of , which only requires from the server to know the clients time .
Theorem 3 provides a close-form for the convergence bound of an optimization scheme in function of the amount of server aggregation rounds . We first introduce in Section 4.1 our considerations for the clients hardware and data to instead express in function of the training time . The quantity also depends on the optimization scheme time policy through , and , and on the clients data heterogeneity through and . We provide their close-form for synchronous FedAvg (Section 4.2), asynchronous FedAvg (Section 4.3), and FedFix (Section 4.4), a novel asynchronous optimization scheme motivated by Section 3.4. Finally, in Section 4.5, we show that the conclusions drawn for synchronous/asynchronous FedAvg and FedFix can also be extended to other distributed optimization schemes with delayed gradients. Of course, similar bounds can seamlessly be derived for centralized learning and client sampling, which we differ to Appendix C to focus on asynchronous FL in this section.
4.1 Heterogeneity of clients hardware and data distributions
Clients importance. We restrict our investigation to the case where clients have identical aggregation weights during the learning process, i.e. . We also consider identical client importance . We can therefore define the averaged optimum residual defined as the average of the clients SGD evaluated on the global optimum, i.e.
When clients have identical data distributions, can be simplified as , and when clients perform GD. We note that in the DL and FL literature is often simplified by assuming bounded variance of the stochastic gradients, i.e. , where is the bounded variance of any client SG.
Clients computation time. In the rest of this work, we consider that clients guarantee reliable computation and communication, although with heterogeneous hardware capabilities, i.e. . Without loss of generality, we assume that clients are ordered by increasing , i.e. , where the unit of is such that is an integer. In what follows, we provide the close form of
for all the considered optimization schemes. This derivation still holds for applications where clients have unreliable hardware capabilities that can be modeled as an exponential distribution, i.e.which gives .
Clients data distributions. Unless stated otherwise, we will consider the FL setting where each client has its unique data distribution. Therefore, clients have heterogeneous hardware and non-iid data distributions. The obtained results can be simplified for the DL setting where a dataset is made available to processors. In this special case, clients have iid data distributions () , and identical computation times (