On Convergence of FedProx: Local Dissimilarity Invariant Bounds, Non-smoothness and Beyond

06/10/2022
by   Xiao-Tong Yuan, et al.
0

The FedProx algorithm is a simple yet powerful distributed proximal point optimization method widely used for federated learning (FL) over heterogeneous data. Despite its popularity and remarkable success witnessed in practice, the theoretical understanding of FedProx is largely underinvestigated: the appealing convergence behavior of FedProx is so far characterized under certain non-standard and unrealistic dissimilarity assumptions of local functions, and the results are limited to smooth optimization problems. In order to remedy these deficiencies, we develop a novel local dissimilarity invariant convergence theory for FedProx and its minibatch stochastic extension through the lens of algorithmic stability. As a result, we contribute to derive several new and deeper insights into FedProx for non-convex federated optimization including: 1) convergence guarantees independent on local dissimilarity type conditions; 2) convergence guarantees for non-smooth FL problems; and 3) linear speedup with respect to size of minibatch and number of sampled devices. Our theory for the first time reveals that local dissimilarity and smoothness are not must-have for FedProx to get favorable complexity bounds. Preliminary experimental results on a series of benchmark FL datasets are reported to demonstrate the benefit of minibatching for improving the sample efficiency of FedProx.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/11/2020

Federated Learning's Blessing: FedAvg has Linear Speedup

Federated learning (FL) learns a model jointly from a set of participati...
01/27/2021

Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning

Federated learning (FL) is a distributed machine learning architecture t...
01/27/2022

On the Convergence of Heterogeneous Federated Learning with Arbitrary Adaptive Online Model Pruning

One of the biggest challenges in Federated Learning (FL) is that client ...
02/12/2021

Stragglers Are Not Disaster: A Hybrid Federated Learning Algorithm with Delayed Gradients

Federated learning (FL) is a new machine learning framework which trains...
01/19/2022

Communication-Efficient Device Scheduling for Federated Learning Using Stochastic Optimization

Federated learning (FL) is a useful tool in distributed machine learning...
07/21/2022

FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data

Federated learning (FL) provides an effective paradigm to train machine ...
07/17/2022

Fast Composite Optimization and Statistical Recovery in Federated Learning

As a prevalent distributed learning paradigm, Federated Learning (FL) tr...

1 Introduction

Federated Learning (FL) has recently emerged as a promising paradigm for communication-efficient distributed learning on remote devices, such as smartphones, internet of things, or agents (Konečnỳ et al., 2016; Yang et al., 2019). The goal of FL is to collaboratively train a shared model that works favorably for all the local data but without requiring the learners to transmit raw data across the network. The principle of optimizing a global model while keeping data localized can be beneficial for both computational efficiency and data privacy (Bhowmick et al., 2018). While resembling the classic distributed learning regimes, there are two most distinct features associated with FL: 1) large statistical heterogeneity of local data mainly due to the non-iid manner of data generalization and collection across the devices (Hard et al., 2020); and 2) partial participants of devices in the network mainly due to the massive number of devices. These fundamental challenges make FL highly demanding to tackle, both in terms of optimization algorithm design and in terms of theoretical understanding of convergence behavior (Li et al., 2020a).

FL is most conventionally formulated as the following problem of global population risk minimization averaged over a set of devices:

(1.1)

where is the local population risk on device ,

is a non-negative loss function whose value

measures the loss over a random data point with parameter , represents an underlying random data distribution over . Since the data distribution is typically unknown, the following empirical risk minimization (ERM) version of (1.1) is often considered alternatively:

(1.2)

where is the local empirical risk over the training sample on device . The sample size may vary significantly across devices, which can be regarded as another source of data heterogeneity. Federated optimization algorithms for solving (1.1) or (1.2) have attracted significant research interest from both academia and industry, with a rich body of efficient solutions developed that can flexibly adapt to the communication-computation tradeoffs and data/system heterogeneity. Several popularly used FL algorithms for this setting include FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020b), SCAFFOLD (Karimireddy et al., 2020), and FedPD (Zhang et al., 2020)

, to name a few. A consensus among these methods on communication-efficient implementation is trying to extensively update the local models (e.g., with plenty epochs of local optimization) over subsets of devices so as to quickly find an optimal global model using a minimal number of inter-device communication rounds for model aggregation.

In this paper, we revisit the FedProx algorithm which is one of the most prominent frameworks for heterogeneous federated optimization. Reasons for the interests of FedProx include implementation simplicity, low communication cost, promise in dealing with data heterogeneity and tolerance to partial participants of devices (Li et al., 2020b). We analyze its convergence behavior, expose problems, and propose alternatives more suitable for scaling up and generalization. We contribute to derive some new and deeper theoretical insights into the algorithm from a novel perspective of algorithmic stability theory.

1.1 Review of FedProx

For solving FL problems in the presence of data heterogeneity, methods such as FedAvg

 based on local stochastic gradient descent (SGD) can fail to converge in practice when the selected devices perform too many local updates 

(Li et al., 2020b). To mitigate this issue, FedProx (Li et al., 2020b) was recently proposed for solving the empirical FL problem (1.2) using the (inexact) proximal point update for local optimization. The benefits of FedProx include: 1) it provides more stable local updates by explicitly enforcing the local optimization in the vicinity of the global model to date; 2) the method comes with convergence guarantees for both convex and non-convex functions, even under partial participation and very dissimilar amounts of local updates (Li et al., 2020a). More specifically, at each time instance , FedProx uniformly randomly selects a subset of devices and introduces for each device the following proximal point ERM sub-problem for local update around the previous global model :

(1.3)

where is the learning rate that controls the impact of the proximal term. Then the global model is updated by uniformly aggregating those local updates from as

In the extreme case of allowing in (1.3), FedProx reduces to the regime of FedAvg if using SGD for local optimization. Since its inception, FedProx and its variants have received significant interests in research (Li et al., 2019; Nguyen et al., 2020; Pathak and Wainwright, 2020) and become an algorithm of choice in application areas such as automatous driving (Donevski et al., 2021)

and computer vision 

(He et al., 2021). Theoretically, FedProx comes with convergence guarantees under the following bounded local gradient dissimilarity assumption that captures the statistical heterogeneity of local objectives across the network:

Definition 1 (-Lgd).

We say the local functions have -local gradient dissimilarity (LGD) if the following holds for all :

The definition naturally extends to the local empirical risks .

Specially in the homogenous setting where , , we have and . Under -LGD and some regularization condition on the modulus , it was shown that FedProx for non-convex problems requires rounds of inter-device communication to reach an -stationary solution, i.e.,  (Li et al., 2020b). Similar guarantees have also been established for a variant of FedProx with non-uniform model aggregation schemes (Nguyen et al., 2020).

Open issues and motivation. In spite of the remarkable success achieved by FedProx and its variants, there are still a number of important theoretical issues regarding the unrealistic assumptions, restrictive problem regimes and expensive local oracle cost that remain open for exploration, as specified below.

  • Local dissimilarity. The appealing convergence behavior of FedProx is so far characterized under a key but non-standard -LGD (cf. Definition 1) condition with and . Such a condition is obviously unrealistic in practice: it essentially requires the local objectives share the same stationary point as the global objective since implies for all . However, if the optima of are exactly (or even approximately) the same, there would be little point in distributing data across devices for federated learning. It is thus desirable to understand the convergence behavior of FedProx for heterogeneous FL without imposing stringent local dissimilarity conditions like -LGD with .

  • Non-smooth optimization. The existing convergence guarantees of FedProx

     are only available for FL with smooth losses. More often than not, however, FL applications involve non-smooth objectives due to the popularity of non-smooth losses (e.g., hinge loss and absolute loss) in machine learning, and training deep neural networks with non-smooth activation like ReLU.

    Therefore, it is desirable to understand the convergence behavior of FedProx in non-smooth problem regimes.

  • Local oracle complexity. Unlike the (stochastic) first-order oracles such as SGD used by FedAvg, the proximal point oracle (1.3) for local update is by itself a full-batch ERM problem which tends to be expensive to solve even approximately per-iteration. Plus, due to the potentially imbalanced data distribution over devices, the computational overload of the proximal point oracle could vary significantly across the network. Therefore, it is important to investigate whether using the stochastic approximation to the proximal point oracle (1.3) can provably improve the computational efficiency of FedProx.

Last but not least, existing convergence analysis of FedProx mainly focuses on the empirical FL problem (1.2). The optimality in terms of the population FL problem (1.1) is not yet clear for FedProx. The primary goal of this work is to remedy these theoretical issues simultaneously, so as to lay a more solid theoretical foundation for the popularly applied FedProx algorithm.

1.2 Our Contributions

In this paper, we make progress towards understanding the convergence behavior of FedProx for non-convex heterogenous FL under weaker and more realistic conditions. The main results are a set of local dissimilarity invariant bounds for smooth or non-smooth problems.

Main results for the vanilla FedProx. As a starting point to address the restrictiveness of local dissimilarity assumption, we provide a novel convergence analysis for the vanilla FedProx algorithm independent of local dissimilarity type conditions. For smooth and non-convex optimization problems, our result in Theorem 1 shows that the rate of convergence to a stationary point is upper bounded by

(1.4)

where is the number devices randomly selected for local update at each iteration. If all the devices participate in the local updates for every round, i.e. , the rate of convergence can be improved to . For , the rate in (1.4) is dominated by which gives the communication complexity to achieve an -stationary solution. On the other hand when , the rate is dominated by which gives the communication complexity . Compared to the already known complexity bound of FedProx under the unrealistic -LGD condition (Li et al., 2020b), our rate in (1.4) is slower but it holds without needing to impose stringent regularity conditions on the dissimilarity of local functions, and it reveals the effect of device sampling for accelerating convergence. Further for non-smooth and non-convex problems, we establish in Theorem 2 the following rate of convergence

(1.5)

which is invariant to the number of selected devices in each round. In the case of , the bounds in (1.4) and (1.5) are comparable, which indicates that smoothness is not must-have for FedProx to get sharper convergence bound especially with low participation ratio. On the other end when , the bound (1.5) for non-smooth problems is slower than the bound (1.4) for smooth functions in large-scale networks.

Main results for minibatch stochastic FedProx. Then as the chief contribution of the present work, we propose a minibatch stochastic extension of FedProx along with its population optimization performance analysis from a novel perspective of algorithmic stability theory. Inspired by the recent success of minibatch stochastic proximal point methods (MSPP) (Li et al., 2014; Wang et al., 2017; Asi et al., 2020; Deng and Gao, 2021), we propose to implement FedProx using MSPP as the local update oracle. The resulting method, which is referred to as FedMSPP, is expected to attain improved trade-off between computation, communication and memory efficiency for large-scale FL. In the case of imbalanced data distribution, minibatching is also beneficial for making the local computation more balanced across the devices. Based on some extended uniform stability arguments for gradients, we show in Theorem 3 the following local dissimilarity invariant rate of convergence for FedMSPP in terms of population optimality:

(1.6)

where is the minibatch size of local update. For empirical FL, identical bound holds under sampling according to empirical distribution. For , the rate in (1.6) is dominated by which gives the communication complexity , and it matches that of the vanilla FedProx. For sufficiently large , the rate is dominated by which gives the communication complexity . This shows that local minibatching and device sampling are both beneficial for linearly speeding up communication. Further, when applied to non-smooth problems, we can similarly show that FedMSPP converges at the rate of

which is comparable to that of (1.6) when and , but without showing the effect of linear speedup with respect to and .

Method Work Commun. Complex. LD Independ. NS PP
FedProx Li et al. (2020b)
Theorem 1 (ours)
Theorem 2 (ours)
FedMSPP Theorem 3 (ours)
Theorem 4 (ours)
FedAvg  Karimireddy et al. (2020)
Yu et al. (2019)
 Khanduri et al. (2021)
SCAFFOLD  Karimireddy et al. (2020)
FedPD  Zhang et al. (2020)
STEM  Khanduri et al. (2021)
FCO  Yuan et al. (2021)
(convex composite)
Table 1: Comparison of heterogeneous FL algorithms in terms of communication complexity bounds for reaching an -stationary solution, independence of local dissimilarity (LD), applicability to non-smooth (NS) functions and tolerance to partial participation (PP). Except for FCO, all the results listed are for non-convex functions. The involved of quantities are : total number of devices; : number of chosen devices for partial participation; : minibatch size for local stochastic optimization.

Comparison with prior results. In Table 1, we summarize our communication complexity bounds for FedProx (FedMSPP) and compare them with several related heterogeneous FL algorithms in terms of the dependency on local dissimilarity, applicability to non-smooth problems and tolerance to partial participation. A few observations are in order. First, regarding the requirement of local dissimilarity, all of our bounds are independent of local dissimilarity conditions, and they are comparable to those of SCAFFOLD and FCO (for convex problems) which are also invariant to local dissimilarity. Second, with regard to the applicability to non-smooth optimization, our convergence guarantees in Theorem 2 and Theorem 4 are established for non-smooth and weakly convex functions. While FCO is the only one in the other considered algorithms that can be applied to non-smooth problems, it is customized for federated convex composite optimization with potentially non-smooth regularizers (Yuan et al., 2021). Third, in terms of tolerance to partial participation, all of our results are robust to device sampling, and the bound in Theorem 3 for FedMSPP is comparable to the best known results under partial participation as achieved by FedAvg and SCAFFOLD. If assuming that all the devices participate in local update for each communication round and under certain local dissimilarity conditions, substantially faster bounds are possible for STEM and FedPD, while the bounds can be achieved by FedAvg (Khanduri et al., 2021). To summarize the comparison, our local dissimilarity invariant convergence bounds for FedProx (FedMSPP) are comparable to the best-known rates in the identical setting, while covering the generic non-smooth and non-convex cases which to our knowledge so far has not been possible for other FL algorithms.

Highlight of theoretical contributions:

  • From the perspective of algorithmic stability theory, we provide a set of novel local dissimilarity invariant convergence guarantees for the widely used FedProx algorithm for non-convex heterogeneous FL, with smooth or non-smooth local functions. Our theory for the first time reveals that local dissimilarity and smoothness are not necessary to guarantee the convergence of FedProx with reasonable rates.

  • We present FedMSPP as a minibatch stochastic extension of FedProx and analyze its convergence behavior in terms of population optimality, again without assuming any type of local dissimilarity conditions. The main result provably shows that FedMSPP converges favorably for both smooth and non-smooth objectives, while enjoying linear speedup in terms of minibatching size and partial participation ratio for smooth problems.

Finally, while the main contribution of this work is essentially theoretical, we have also carried out a preliminary numerical study on several benchmark FL datasets to corroborate our theoretical findings about the improved sample efficiency of FedMSPP.

Paper organization. In Section 2 we present our local dissimilarity invariant convergence analysis for the vanilla FedProx with smooth or non-smooth loss functions. In Section 3 we propose FedMSPP as a minibatch stochastic extension of FedProx and analyze its convergence behavior through the lens of algorithmic stability theory. In Section 4, we present some additional related work on the topics covered by this paper. In Section 5, we present a preliminary experimental study on the convergence behavior of FedMSPP. The concluding remarks are made in Section 6. The technical proofs are relegated to the appendix sections.

2 Convergence of FedProx

We begin by providing an improved analysis for the vanilla FedProx independent of the local dissimilarity type conditions. We first introduce notations that will be used in the analysis to follow.

Notations. Throughout the paper, we use to denote the set , to denote the Euclidean norm and to denote the Euclidean inner product. We say a function is -Lipschitz continuous if for all , and it is -smooth if for all . Moreover, we say is -weakly convex if for any ,

where represents a subgradient of evaluated at . We denote by

the -Moreau-envelope of , and by

the proximal mapping associated with . We also need to access the following definition of inexact local update oracle for FedProx.

Definition 2 (Local inexact oracle of FedProx).

Suppose that the local proximal point regularized objective (cf. (1.3)) admits a global minimizer. For each time instance , we say that the local update oracle of FedProx is -inexactly solved with sub-optimality if

We conventionally assume that the objective value gap is bounded.

2.1 Results for Smooth Problems

The following theorem is our main result on the convergence rate of FedProx for smooth and non-convex federated optimization problems.

Theorem 1.

Assume that for each , the loss function is -Lipschitz and -smooth with respect to its first argument. Set and . Suppose that the local update oracle of FedProx is -inexactly solved with . Let be an index uniformly randomly chosen in . Then,

Proof.

A proof of this result is deferred to Appendix B.1. ∎

A few remarks are in order.

Remark 1.

Compared to the bound from Li et al. (2020b), our rate established in Theorem 1 is slower but it is valid without assuming the unrealistic -LGD conditions and imposing strong regularization conditions on  (see, e.g., Li et al., 2020b, Remark 5). Moreover, the dominant term in our bound reveals the benefit of device sampling for linear speedup which is not clear in the original analysis of Li et al. (2020b).

Remark 2.

In the extreme case of full device participation, i.e., , the terms related to in Theorem 1 can be removed and thus the convergence rate becomes under . In this same setting, we comment that the rate can also be improved to using our proof augments if -LGD is additionally assumed.

Remark 3.

The -Lipschitz-loss assumption in Theorem 1 can be alternatively replaced by the bounded gradient condition as commonly used in the analysis of FL algorithms (Li et al., 2020b; Zhang et al., 2020). Despite that our analysis does not explicitly access to any local dissimilarity conditions, the assumed -Lipschitz (or bounded gradient) condition actually implies that the local objective gradients are not too dissimilar, which shares a close spirit to the typically assumed -LGD condition (Karimireddy et al., 2020)

and inter-client-variance condition 

(Khanduri et al., 2021). It is noteworthy that these mentioned client heterogeneity conditions are substantially milder than the -LGD condition as required in the original analysis of FedProx.

2.2 Results for Non-smooth Problems

Now we turn to study the convergence of FedProx for weakly convex but not necessarily smooth problems. For the sake of presentation clarity, we work on the exact FedProx in which the local update oracle is assumed to be exactly solved, i.e. . Extension to the inexact case is more or less straightforward, though with somewhat more involved perturbation treatment. We assume that the objective value gap associated with -Moreau-envelope of is bounded. The following is our main result on the convergence of FedProx for non-smooth and weakly convex problems.

Theorem 2.

Assume that for each , the loss function is -Lipschitz and -weakly convex with respect to its first argument. Set for arbitrary . Suppose that the local update oracle of FedProx is exactly solved with . Let be an index uniformly randomly chosen in . Then it holds that

Proof.

The proof technique is inspired by the arguments from Davis and Drusvyatskiy (2019) developed for analyzing stochastic model-based algorithms, with several new elements along developed for handling the challenges introduced by the model averaging and partial participation mechanisms associated with FedProx. A particular crux here is that due to the random subset model aggregation of , the local function values are no longer independent to each other though is uniformly random. As a consequence, is not

an unbiased estimation of

. To overcome this technical obstacle, we make use of a key observation that will be almost surely close enough to if the learning rate is small enough (which is the case in our choice of ), and thus we can replace the former with the latter whenever beneficial but without introducing too much approximation error. A full proof of this result can be found in Appendix B.2. ∎

A few comments are in order.

Remark 4.

To our best knowledge, Theorem 2 is the first convergence guarantee for FL algorithms applicable to generic non-smooth and weakly convex problems. This is in sharp contrast with FCO (Yuan et al., 2021) which focuses on composite convex and non-smooth problems such as

-estimation, or

Fed-HT (Tong et al., 2020) which is specially customized for cardinality-constrained sparse learning problems where the non-convexity essentially arises from the -constraint.

Remark 5.

Let us consider , the proximal mapping of associated with . In view of a feature of Moreau envelope to characterize stationarity (Davis and Drusvyatskiy, 2019), if has small gradient norm , then must be a near-stationary solution and stays in the proximity of due to the identity . Therefore, the bound in Theorem 2 suggests that in expectation converges to a stationary solution and converges to , both at the rate of .

Remark 6.

We comment that the bound in Theorem 2 is not dependent on , the number of selected devices. For and sufficiently large , the bounds Theorem 1 and Theorem 2 are comparable to each other, which demonstrates that the smoothness is not must-have for FedProx to get sharper convergence bound with small device sampling rate. However, in the near-full participation setting where , the bound in Theorem 2 for non-smooth problems will be slower when is large. Extremely when , the bound is substantially inferior to the smooth case which has improved rate of as discussed in Remark 2.

3 Convergence of FedProx with Stochastic Minibatching

When it comes to the implementation of FedProx, a notable challenge is that the local proximal point update oracle (1.3) is by itself a full-batch ERM problem which would be expensive to solve even approximately in large-scale settings. Moreover, in the settings where the data distribution over devices is highly imbalanced, the computational overload of local update could vary significantly across the network, which impairs communication efficiency. It is thus desirable to seek stochastic approximation schemes for hopefully improving the local oracle update efficiency and overload balance of FedProx. To this end, inspired by the recent success of minibatch stochastic proximal point methods (MSPP) (Asi et al., 2020; Deng and Gao, 2021), we propose to implement FedProx using MSPP as the local stochastic optimization oracle. More precisely, let be a minibatch of i.i.d. samples drawn from the distribution at device and time instance . We denote

(3.1)

as the local minibatch empirical risk function over . The only modification we propose to make here is to replace the empirical risk in the original update form (1.3) with its minibatch counterpart . The resultant FL framework, which we refer to as FedMSPP (Federated MSPP), is outlined in Algorithm 1. Clearly, the vanilla FedProx is a special case of FedMSPP when applied to the federated ERM form (1.2) with full data batch .

Input : Minibatch size ; learning rates .
Output : .
Initialization Set

, e.g., typically as a zero vector.

for  do
       /* Device selection and model broadcast on the server */
       Server uniformly randomly selects a subset of devices and sends to all the selected devices; /* Local model updates on the selected devices */
       for  in parallel do
             Device samples a minibatch . Device inexactly updates the its local model as
(3.2)
where is given by (3.1). Device sends back to server.
       end for
      /* Model aggregation on the server */
       Sever aggregates the local models received from to update the global model as .
end for
Algorithm 1 FedMSPP: Federated Minibatch Stochastic Proximal Point

3.1 Results for Smooth Problems

We first analyze the convergence rate of FedMSPP for smooth and non-convex problems using the tools borrowed from algorithmic stability theory. Analogous to the Definition 2, we introduce the following definition of inexact local update oracle for FedMSPP.

Definition 3 (Local inexact oracle of FedMSPP).

Suppose that the local proximal point regularized objective (cf. (3.2)) admits a global minimizer. For each time instance , we say that the local update oracle of FedMSPP is -inexactly solved with sub-optimality if

We also assume that the population value gap is bounded. The following theorem is our main result on FedMSPP for smooth and non-convex FL problems.

Theorem 3.

Assume that for each , the loss function is -Lipschitz and -smooth with respect to its first argument. Set and . Suppose that the local update oracle of FedMSPP is -inexactly solved with . Then we have

Proof.

Let us consider which is roughly the local update direction on device , in the sense that given that the local update oracle is solved to sufficient accuracy. As a key ingredient of our proof, we show via some extended uniform stability arguments in terms of gradients (see Lemma 3) that the averaged directions aligns well with the global gradient in expectation (see Lemma 11). Therefore, in average it roughly holds that , which suggests that is updated roughly along the direction of global gradient descent and thus guarantees quick convergence. Based on this novel analysis, we are free of imposing any kind of local dissimilarity conditions on local objectives. See Appendix C.1 for a full proof of this result. ∎

Remark 7.

For , the bound in Theorem 3 is dominated by which gives the communication complexity . This shows that FedMSPP enjoys linear speedup both in the size of local minibatching and in the size of device sampling.

Remark 8.

While the bound in Theorem 3 is derived for the population form of FL in (1.1), identical bound naturally holds for the empirical form (1.2) under minibatch sampling according to local data empirical distribution.

3.2 Results for Non-smooth Problems

Analogues to FedProx , we can further show that FedMSPP converges reasonably well when applied to weakly convex and non-smooth problems. We assume that the objective value gap associated with -Moreau-envelope of is bounded. The following is our main result in this line.

Theorem 4.

Assume that for each , the loss function is -Lipschitz and -weakly convex with respect to its first argument. Set for arbitrary . Suppose that the local update oracle of FedMSPP is exactly solved with . Let be an index uniformly randomly chosen in . Then it holds that

Proof.

The proof argument is a slight adaptation of that of Theorem 2 to the population FL setup (1.1) with FedMSPP. For the sake of completeness, a full proof is reproduced in Appendix C.2. ∎

We comment in passing that the discussions made in Remarks 4-6 immediately extend to Theorem 4.

4 Additional Related Work

Heterogenous federated learning. The presence of device heterogeneity features a key distinction between FL and classic distributed learning. The most commonly used FL method is FedAvg (McMahan et al., 2017), where the local update oracle is formed as multi-epoch SGD. FedAvg was early analyzed for identical functions (Stich, 2019; Stich and Karimireddy, 2020) under the name of local SGD. In heterogeneous setting, numerous recent studies have focused on the analysis of FedAvg and other variants under various notions of local dissimilarity (Chen et al., 2020; Khaled et al., 2020; Li et al., 2020c; Woodworth et al., 2020; Reddi et al., 2021; Li et al., 2022). As another representative FL method, FedProx (Li et al., 2020b) has recently been proposed to apply averaged proximal point updates to solve heterogeneous federated minimization problems. The theoretical guarantees of FedProx have been established for both convex and non-convex problems, but under a fairly stringent assumption of gradient similarity to measure data heterogeneity (Li et al., 2020b; Nguyen et al., 2020; Pathak and Wainwright, 2020). This assumption was relaxed by FedPD (Zhang et al., 2020) inside a meta-framework of primal-dual optimization. The SCAFFOLD (Karimireddy et al., 2020) and VRL-SGD (Liang et al., 2019) are two algorithms that utilize variance reduction techniques to correct the local update directions, achieving convergence guarantees independent of the data heterogeneity. For composite non-smooth FL problems, the FCO proposed in Yuan et al. (2021) employs a server dual averaging procedure to circumvent the curse of primal averaging suffered by FedAvg. In sharp contrast to these prior works which either require local dissimilarity conditions, or require full device participation, or only applicable to smooth problems, we show through a novel analysis based on algorithmic stability theory that the well-known FedProx can actually overcome these shortcomings simultaneously.

Minibatch stochastic proximal point methods. The proposed FedMSPP algorithm is a variant of FedProx that simply replaces the local proximal point oracle with MSPP, which in each iteration updates the local model via (approximately) solving a proximal point estimator over a stochastic minibatch. The MSPP-type methods have been shown to attain a substantially improved iteration stability and adaptivity for large-scale machine learning, especially in non-smooth optimization settings (Li et al., 2014; Wang et al., 2017; Asi and Duchi, 2019; Deng and Gao, 2021). However, it is not yet known if FedProx or FedMSPP can achieve similar strong guarantees for non-smooth heterogenous FL problems.

Algorithmic stability. Our analysis for FedMSPP builds largely upon the classic algorithmic stability theory. Since the fundamental work of Bousquet and Elisseeff (2002), algorithmic stability has been serving as a powerful proxy for establishing strong generalization bounds (Zhang, 2003; Mukherjee et al., 2006; Shalev-Shwartz et al., 2010). Particularly, the state-of-the-art risk bounds of strongly convex ERM are offered by approaches based on the notion of uniform stability (Feldman and Vondrák, 2018, 2019; Bousquet et al., 2020; Klochkov and Zhivotovskiy, 2021). It was shown by Hardt et al. (2016) that the solution obtained via SGD is expected to be unformly stable for smooth convex or non-convex loss functions. For non-smooth convex losses, the stability induced generalization bounds have been established for SGD (Bassily et al., 2020; Lei and Ying, 2020). Through the lens of algorithmic stability theory, convergence rates of MSPP have been studied for non-smooth and convex (Wang et al., 2017) or weakly convex (Deng and Gao, 2021) losses.

5 Experimental Results

In this section, we carry out a preliminary experimental study to demonstrate the speed-up behavior of FedMSPP under varying minibatch sizes for achieving comparable test performances to FedProx. We also conventionally use FedAvg as a baseline algorithm for comparison.

5.1 Data and Models

Dataset Model # Devices # Samples (Training)
MNIST 2-layer CNN 100
FEMNIST 2-layer CNN 50
Sent140 2-layer LSTM 261
Table 2: Statistics of data and models used in the experiments.

We compare the considered algorithms over the following three benchmark datasets popularly used for evaluating heterogenous FL approaches:

  • The MNIST (LeCun et al., 1998)

    dataset of handwritten digits 0-9 is used for digit image classification with a two layer convolutional neural network (CNN). The model takes as input the images of size

    , and first performs a 2-layer ({1, 32, max-pooling}, {32, 64, max-pooling}) convolution followed by a fully connected (FC) layer. We use 63,000 images in which

    are for training and the rest for test. The data are distributed over 100 devices such that each device has samples of only 2 digits.

  • The FEMNIST (Li et al., 2020b) dataset is a subset of the 62-class EMNIST (Cohen et al., 2017) database constructed by sub-sampling 10 lower case characters (’a’-’j’). We study the performances of the considered algorithms for character image classification using the same two layer CNN as used for MNIST, which takes as input the images of size . We use 55,050 images in which are for training and the rest for test. The data are distributed over 50 devices, each of which has samples of 3 characters.

  • The Sent140 (Go et al., 2009)

    dataset of text sentiment analysis on tweets is used for evaluating the considered algorithms for sentiment classification. The model we use is a two layer LSTM binary classifier containing 256 hidden units followed by a densely-connected layer. The input is a sequence of 25 characters represented by a 300-dimensional GloVe embedding 

    (Pennington et al., 2014) and the output is one character per training sample. We use for our experiment a total number of tweets from twitter accounts, each of which corresponds to a device. The training/test sample split is versus .

The statistics of the data and models in use are summarized in Table 2.

5.2 Implementation Details and Performance Metrics

We generally follow the instructions of Li et al. (2020b) for implementing FedProx, FedMSPP and FedAvg. More specifically, we use SGD as the local solver for FedProx, FedMSPP and FedAvg. For FedMSPP, we implement with three varying minibatch sizes on each dataset as shortly reported in the next subsection about results. The hyper-parameters used in our implementation, such as number of communication rounds and number of local SGD epochs, are listed in Table 3.

Hyper-parameter MNIST FEMNIST Sent140
#Communication rounds 200 300 300
#Local SGD epochs 2 5 10
Local SGD minibatch size 567 512 100
Local SGD learning rate 0.25 0.06 0.1
Strength of regularization 0.1 0.1 0.001
Table 3: Hyper-parameter settings.

Since the chief goal of this empirical study is to illustrate the benefit of FedMSPP for speeding up the convergence of FedProx, we use the numbers of data points and communication rounds needed for reaching the desired solution accuracy as performance metrics. The desired test accuracies are on MNIST, on FEMNIST, and on Sent140.

(a) MNIST: Numbers of data points needed to reach , and test accuracies. For FedMSPP, we test with different minibatch sizes 81, 63, 10.
(b) FEMNIST: Numbers of data points needed to reach , and test accuracies. For FedMSPP, we test with different minibatch sizes 128, 64, 16.
(c) Sent140: Numbers of data points needed to reach , and test accuracies. For FedMSPP, we test with different minibatch sizes 70, 50, 20.
Figure 1: Comparison of numbers of data points accessed by the considered algorithms to reach varying desired test accuracies.
(a) MNIST: Rounds of communication needed to reach , and test accuracies. For FedMSPP, we test with different minibatch sizes 81, 63, 10.
(b) FEMNIST: Rounds of communication needed to reach , and test accuracies. For FedMSPP, we test with different minibatch sizes 128, 64, 16.
(c) Sent140: Rounds of communication needed to reach , and test accuracies. For FedMSPP, we test with different minibatch sizes 70, 50, 20.
Figure 2: Comparison of rounds of communication needed by the considered algorithms to reach varying desired test accuracies.

5.3 Results

In Figure 1, we show the numbers of data samples accessed by the considered algorithms to reach comparable test accuracies. For FedMSPP, we test with minibatch sizes on MNIST, on FEMNIST, and on Sent140. From this set of results we can observe that:

  • On all the three datasets in use, FedMSPP with varying minibatch sizes consistently needs significantly fewer samples than FedProx and FedAvg to reach the desired test accuracies.

  • FedMSPP with smaller minibatch size tends to have better sample efficiency.

Figure 2 shows the corresponding rounds of communication needed to reach comparable test accuracies. From this group results we can see that in most cases, FedMSPP just needs slightly increased rounds of communication than FedProx and FedAvg to reach comparable generalization accuracy.

Overall, our numerical results confirm that FedMSPP can be served as a safe and computationally more efficient replacement to FedProx on the considered heterogenous FL tasks.

6 Conclusion

In this paper, we have exposed three shortcomings of the prior convergence analysis for FedProx in unrealistic local dissimilarity assumptions, inapplicability to non-smooth problems and expensive (and potentially imbalanced) computational cost of local update. In order to tackle these issues, we developed a novel convergence theory for the vanilla FedProx and its minibatch stochastic variant, FedMSPP, through the lens of algorithmic stability theory. At nutshell, our results reveal that with minimal modifications, FedProx is able to kill three birds with one stone: it enjoys favorable rates of convergence which are simultaneously invariant to local dissimilarity, applicable to smooth or non-smooth problems, and scaling linearly with respect to local minibatch size and device sampling ratio for smooth problems. To the best of our knowledge, the present work is the first theoretical contribution that achieves all these appealing properties in a single FL framework.

Acknowledgement

Xiao-Tong Yuan was funded in part by the National Key Research and Development Program of China under Grant No. 2018AAA0100400 and in part by Natural Science Foundation of China (NSFC) under Grant No.61876090, No.61936005 and No.U21B2049.

References

  • Asi and Duchi (2019) Hilal Asi and John C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM J. Optim., 29(3):2257–2290, 2019.
  • Asi et al. (2020) Hilal Asi, Karan N. Chadha, Gary Cheng, and John C. Duchi. Minibatch stochastic approximate proximal point methods. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
  • Bassily et al. (2020) Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
  • Bhowmick et al. (2018) Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
  • Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, 2002.
  • Bousquet et al. (2020) Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Proceedings of the Conference on Learning Theory (COLT), pages 610–626, Virtual Event [Graz, Austria], 2020.
  • Chen et al. (2020) Xiangyi Chen, Xiaoyun Li, and Ping Li. Toward communication efficient adaptive gradient method. In

    Proceedings of the ACM-IMS Foundations of Data Science Conference (FODS)

    , pages 119–128, Virtual Event, 2020.
  • Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: extending MNIST to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926, Anchorage, AK, 2017.
  • Davis and Drusvyatskiy (2019) Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM J. Optim., 29(1):207–239, 2019.
  • Deng and Gao (2021) Qi Deng and Wenzhi Gao. Minibatch and momentum model-based methods for stochastic weakly convex optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 23115–23127, virtual, 2021.
  • Donevski et al. (2021) Igor Donevski, Jimmy Jessen Nielsen, and Petar Popovski. On addressing heterogeneity in federated learning for autonomous vehicles connected to a drone orchestrator. arXiv preprint arXiv:2108.02712, 2021.
  • Elisseeff et al. (2005) André Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability of randomized learning algorithms. J. Mach. Learn. Res., 6:55–79, 2005.
  • Feldman and Vondrák (2018) Vitaly Feldman and Jan Vondrák. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 9770–9780, Montréal, Canada, 2018.
  • Feldman and Vondrák (2019) Vitaly Feldman and Jan Vondrák.

    High probability generalization bounds for uniformly stable algorithms with nearly optimal rate.

    In Proceedings of the Conference on Learning Theory (COLT), pages 1270–1279, Phoenix, AZ, 2019.
  • Go et al. (2009) Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009, 2009.
  • Hard et al. (2020) Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio Lopez-Moreno, and Rajiv Mathews. Training keyword spotting models on non-iid data with federated learning. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pages 4343–4347, Virtual Event, Shanghai, China, 2020.
  • Hardt et al. (2016) Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 1225–1234, New York City, NY, 2016.
  • He et al. (2021) Chaoyang He, Alay Dilipbhai Shah, Zhenheng Tang, Di Fan1Adarshan Naiynar Sivashunmugam, Keerti Bhogaraju, Mita Shimpi, Li Shen, Xiaowen Chu, Mahdi Soltanolkotabi, and Salman Avestimehr. FedCV: A federated learning framework for diverse computer vision tasks. arXiv preprint arXiv:2111.11066, 2021.
  • Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 5132–5143, Virtual Event, 2020.
  • Khaled et al. (2020) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local SGD on identical and heterogeneous data. In

    Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pages 4519–4529, Online [Palermo, Sicily, Italy], 2020.
  • Khanduri et al. (2021) Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, and Pramod K. Varshney. STEM: A stochastic two-sided momentum algorithm achieving near-optimal sample and communication complexities for federated learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 6050–6061, virtual, 2021.
  • Klochkov and Zhivotovskiy (2021) Yegor Klochkov and Nikita Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate . In Advances in Neural Information Processing Systems (NeurIPS), pages 5065–5076, virtual, 2021.
  • Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
  • Lei and Ying (2020) Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 5809–5819, Virtual Event, 2020.
  • Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 661–670, New York, NY, 2014.
  • Li et al. (2019) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. FedDANE: A federated newton-type method. In Proceedings of the 53rd Asilomar Conference on Signals, Systems, and Computers (Asilomar), pages 1227–1231, 2019.
  • Li et al. (2020a) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag., 37(3):50–60, 2020a.
  • Li et al. (2020b) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems (MLSys), Austin, TX, 2020b.
  • Li et al. (2020c) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of FedAvg on non-iid data. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020c.
  • Li et al. (2022) Xiaoyun Li, Belhal Karimi, and Ping Li. On distributed adaptive optimization with gradient compression. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.
  • Liang et al. (2019) Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local sgd with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, Fort Lauderdale, FL, 2017.
  • Mukherjee et al. (2006) Sayan Mukherjee, Partha Niyogi, Tomaso A. Poggio, and Ryan M. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math., 25(1-3):161–193, 2006.
  • Nguyen et al. (2020) Hung T Nguyen, Vikash Sehwag, Seyyedali Hosseinalipour, Christopher G Brinton, Mung Chiang, and H Vincent Poor. Fast-convergent federated learning. IEEE Journal on Selected Areas in Communications, 39(1):201–218, 2020.
  • Pathak and Wainwright (2020) Reese Pathak and Martin J. Wainwright. FedSplit: an algorithmic framework for fast federated optimization. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543, Doha, Qatar, 2014.
  • Reddi et al. (2021) Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 2021.
  • Rivasplata et al. (2018) Omar Rivasplata, Csaba Szepesvári, John Shawe-Taylor, Emilio Parrado-Hernández, and Shiliang Sun. Pac-bayes bounds for stable algorithms with instance-dependent priors. In Advances in Neural Information Processing Systems (NeurIPS), pages 9234–9244, Montréal, Canada, 2018.
  • Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, 2010.
  • Stich (2019) Sebastian U. Stich. Local SGD converges fast and communicates little. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, 2019.
  • Stich and Karimireddy (2020) Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed updates. J. Mach. Learn. Res., 21:237:1–36, 2020.
  • Tong et al. (2020) Qianqian Tong, Guannan Liang, Tan Zhu, and Jinbo Bi. Federated nonconvex sparse learning. arXiv preprint arXiv:2101.00052, 2020.
  • Wang et al. (2017) Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication efficient distributed stochastic optimization with minibatch prox. In Proceedings of the 30th Conference on Learning Theory (COLT), pages 1882–1919, Amsterdam, The Netherlands, 2017.
  • Woodworth et al. (2020) Blake E. Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local SGD for heterogeneous distributed learning. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
  • Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol., 10(2):12:1–12:19, 2019.
  • Yu et al. (2019) Hao Yu, Sen Yang, and Shenghuo Zhu.

    Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning.

    In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), pages 5693–5700, Honolulu, HI, 2019.
  • Yuan et al. (2021) Honglin Yuan, Manzil Zaheer, and Sashank J. Reddi. Federated composite optimization. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 12253–12266, Virtual Event, 2021.
  • Zhang (2003) Tong Zhang. Leave-one-out bounds for kernel methods. Neural Comput., 15(6):1397–1437, 2003.
  • Zhang et al. (2020) Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. FedPD: A federated learning framework with optimal rates and adaptivity to non-iid data. arXiv preprint arXiv:2005.11418, 2020.

Appendix A Preliminaries

We present in this section some preliminary results on the classic algorithmic stability theory to be used in our analysis. Let us consider an algorithm that maps a training dataset to a model in a closed subset such that the following population risk function (with a slight abuse of notation) evaluated at the model is as small as possible:

The corresponding empirical risk is defined by

We denote by if a pair of datasets and differ in a single data point. The following concept of stability that serves as a powerful tool for analyzing the generalization bounds of learning algorithms [Elisseeff et al., 2005, Hardt et al., 2016, Bassily et al., 2020].

Definition 4 (Uniform Argument Stability).

Let be a learning algorithm that maps a dataset to a model . Then is said to have -uniform stability if for every ,

The following basic lemma is about the uniform argument stability of an inexact regularized empirical risk minimization (ERM) estimator. See Appendix D.1 for its proof.

Lemma 1.

Assume that the loss function is -Lipschitz with respect to its first argument. Suppose that the regularized objective is -strongly convex for any . Consider the inexact estimator that satisfies the following for some :

Then has uniform argument stability with parameter .

We further need to use the following variant of Efron-Stein inequality to random vector-valued functions [see, e.g., Lemma 6, Rivasplata et al., 2018].

Lemma 2 (Efron-Stein inequality for vector-valued functions).

Let

be a set of i.i.d. random variables valued in

. Suppose that the function valued in a Hilbert space is measurable and satisfies the bounded differences property, i.e., the following inequality holds for any and any :

Then it holds that

Based on the Efron-Stein inequality in Lemma 2, we can establish the following lemma which states the generalization bounds of a uniformly stable learning algorithm in terms of gradient. A proof of this result can be found in Appendix D.2.

Lemma 3.

Suppose that a learning algorithm has -uniform stability. Assume that the loss function is -Lipschitz and -smooth with respect to its first argument. Then the following bounds hold:

Appendix B Proofs for Section 2

b.1 Proof of Theorem 1

Let . We define the following quantities

(B.1)

The following elementary lemma is useful in our analysis.

Lemma 4.

Assume that for each , the loss function is -Lipschitz. Then it holds that

Proof.

By uniform without-replacement sampling strategy we have

Then it follows that

where we have used the fact , the independence among the indices in and the -Lipschitzness of losses. The desired bounds are proved. ∎

We also need the following lemma which quantifies the impact of local update precision to the gradient norm at the inexact solution.

Lemma 5.

Assume that for each , the loss function is -smooth with respect to its first argument. Suppose that the local update oracle of FedProx is -inexactly solved and . Then it holds that

Proof.

Recall . Since the loss functions are -smooth and , is strongly convex and thus admits a global minimizer. Then we have

where in the last inequality is due to Definition 2. This implies the desired bound. ∎

Lemma 6.

Assume that for each , the loss function is -Lipschitz and