1 Introduction
Federated Learning (FL) has recently emerged as a promising paradigm for communicationefficient distributed learning on remote devices, such as smartphones, internet of things, or agents (Konečnỳ et al., 2016; Yang et al., 2019). The goal of FL is to collaboratively train a shared model that works favorably for all the local data but without requiring the learners to transmit raw data across the network. The principle of optimizing a global model while keeping data localized can be beneficial for both computational efficiency and data privacy (Bhowmick et al., 2018). While resembling the classic distributed learning regimes, there are two most distinct features associated with FL: 1) large statistical heterogeneity of local data mainly due to the noniid manner of data generalization and collection across the devices (Hard et al., 2020); and 2) partial participants of devices in the network mainly due to the massive number of devices. These fundamental challenges make FL highly demanding to tackle, both in terms of optimization algorithm design and in terms of theoretical understanding of convergence behavior (Li et al., 2020a).
FL is most conventionally formulated as the following problem of global population risk minimization averaged over a set of devices:
(1.1) 
where is the local population risk on device ,
is a nonnegative loss function whose value
measures the loss over a random data point with parameter , represents an underlying random data distribution over . Since the data distribution is typically unknown, the following empirical risk minimization (ERM) version of (1.1) is often considered alternatively:(1.2) 
where is the local empirical risk over the training sample on device . The sample size may vary significantly across devices, which can be regarded as another source of data heterogeneity. Federated optimization algorithms for solving (1.1) or (1.2) have attracted significant research interest from both academia and industry, with a rich body of efficient solutions developed that can flexibly adapt to the communicationcomputation tradeoffs and data/system heterogeneity. Several popularly used FL algorithms for this setting include FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020b), SCAFFOLD (Karimireddy et al., 2020), and FedPD (Zhang et al., 2020)
, to name a few. A consensus among these methods on communicationefficient implementation is trying to extensively update the local models (e.g., with plenty epochs of local optimization) over subsets of devices so as to quickly find an optimal global model using a minimal number of interdevice communication rounds for model aggregation.
In this paper, we revisit the FedProx algorithm which is one of the most prominent frameworks for heterogeneous federated optimization. Reasons for the interests of FedProx include implementation simplicity, low communication cost, promise in dealing with data heterogeneity and tolerance to partial participants of devices (Li et al., 2020b). We analyze its convergence behavior, expose problems, and propose alternatives more suitable for scaling up and generalization. We contribute to derive some new and deeper theoretical insights into the algorithm from a novel perspective of algorithmic stability theory.
1.1 Review of FedProx
For solving FL problems in the presence of data heterogeneity, methods such as FedAvg
based on local stochastic gradient descent (SGD) can fail to converge in practice when the selected devices perform too many local updates
(Li et al., 2020b). To mitigate this issue, FedProx (Li et al., 2020b) was recently proposed for solving the empirical FL problem (1.2) using the (inexact) proximal point update for local optimization. The benefits of FedProx include: 1) it provides more stable local updates by explicitly enforcing the local optimization in the vicinity of the global model to date; 2) the method comes with convergence guarantees for both convex and nonconvex functions, even under partial participation and very dissimilar amounts of local updates (Li et al., 2020a). More specifically, at each time instance , FedProx uniformly randomly selects a subset of devices and introduces for each device the following proximal point ERM subproblem for local update around the previous global model :(1.3) 
where is the learning rate that controls the impact of the proximal term. Then the global model is updated by uniformly aggregating those local updates from as
In the extreme case of allowing in (1.3), FedProx reduces to the regime of FedAvg if using SGD for local optimization. Since its inception, FedProx and its variants have received significant interests in research (Li et al., 2019; Nguyen et al., 2020; Pathak and Wainwright, 2020) and become an algorithm of choice in application areas such as automatous driving (Donevski et al., 2021)
and computer vision
(He et al., 2021). Theoretically, FedProx comes with convergence guarantees under the following bounded local gradient dissimilarity assumption that captures the statistical heterogeneity of local objectives across the network:Definition 1 (Lgd).
We say the local functions have local gradient dissimilarity (LGD) if the following holds for all :
The definition naturally extends to the local empirical risks .
Specially in the homogenous setting where , , we have and . Under LGD and some regularization condition on the modulus , it was shown that FedProx for nonconvex problems requires rounds of interdevice communication to reach an stationary solution, i.e., (Li et al., 2020b). Similar guarantees have also been established for a variant of FedProx with nonuniform model aggregation schemes (Nguyen et al., 2020).
Open issues and motivation. In spite of the remarkable success achieved by FedProx and its variants, there are still a number of important theoretical issues regarding the unrealistic assumptions, restrictive problem regimes and expensive local oracle cost that remain open for exploration, as specified below.

Local dissimilarity. The appealing convergence behavior of FedProx is so far characterized under a key but nonstandard LGD (cf. Definition 1) condition with and . Such a condition is obviously unrealistic in practice: it essentially requires the local objectives share the same stationary point as the global objective since implies for all . However, if the optima of are exactly (or even approximately) the same, there would be little point in distributing data across devices for federated learning. It is thus desirable to understand the convergence behavior of FedProx for heterogeneous FL without imposing stringent local dissimilarity conditions like LGD with .

Nonsmooth optimization. The existing convergence guarantees of FedProx
are only available for FL with smooth losses. More often than not, however, FL applications involve nonsmooth objectives due to the popularity of nonsmooth losses (e.g., hinge loss and absolute loss) in machine learning, and training deep neural networks with nonsmooth activation like ReLU.
Therefore, it is desirable to understand the convergence behavior of FedProx in nonsmooth problem regimes. 
Local oracle complexity. Unlike the (stochastic) firstorder oracles such as SGD used by FedAvg, the proximal point oracle (1.3) for local update is by itself a fullbatch ERM problem which tends to be expensive to solve even approximately periteration. Plus, due to the potentially imbalanced data distribution over devices, the computational overload of the proximal point oracle could vary significantly across the network. Therefore, it is important to investigate whether using the stochastic approximation to the proximal point oracle (1.3) can provably improve the computational efficiency of FedProx.
Last but not least, existing convergence analysis of FedProx mainly focuses on the empirical FL problem (1.2). The optimality in terms of the population FL problem (1.1) is not yet clear for FedProx. The primary goal of this work is to remedy these theoretical issues simultaneously, so as to lay a more solid theoretical foundation for the popularly applied FedProx algorithm.
1.2 Our Contributions
In this paper, we make progress towards understanding the convergence behavior of FedProx for nonconvex heterogenous FL under weaker and more realistic conditions. The main results are a set of local dissimilarity invariant bounds for smooth or nonsmooth problems.
Main results for the vanilla FedProx. As a starting point to address the restrictiveness of local dissimilarity assumption, we provide a novel convergence analysis for the vanilla FedProx algorithm independent of local dissimilarity type conditions. For smooth and nonconvex optimization problems, our result in Theorem 1 shows that the rate of convergence to a stationary point is upper bounded by
(1.4) 
where is the number devices randomly selected for local update at each iteration. If all the devices participate in the local updates for every round, i.e. , the rate of convergence can be improved to . For , the rate in (1.4) is dominated by which gives the communication complexity to achieve an stationary solution. On the other hand when , the rate is dominated by which gives the communication complexity . Compared to the already known complexity bound of FedProx under the unrealistic LGD condition (Li et al., 2020b), our rate in (1.4) is slower but it holds without needing to impose stringent regularity conditions on the dissimilarity of local functions, and it reveals the effect of device sampling for accelerating convergence. Further for nonsmooth and nonconvex problems, we establish in Theorem 2 the following rate of convergence
(1.5) 
which is invariant to the number of selected devices in each round. In the case of , the bounds in (1.4) and (1.5) are comparable, which indicates that smoothness is not musthave for FedProx to get sharper convergence bound especially with low participation ratio. On the other end when , the bound (1.5) for nonsmooth problems is slower than the bound (1.4) for smooth functions in largescale networks.
Main results for minibatch stochastic FedProx. Then as the chief contribution of the present work, we propose a minibatch stochastic extension of FedProx along with its population optimization performance analysis from a novel perspective of algorithmic stability theory. Inspired by the recent success of minibatch stochastic proximal point methods (MSPP) (Li et al., 2014; Wang et al., 2017; Asi et al., 2020; Deng and Gao, 2021), we propose to implement FedProx using MSPP as the local update oracle. The resulting method, which is referred to as FedMSPP, is expected to attain improved tradeoff between computation, communication and memory efficiency for largescale FL. In the case of imbalanced data distribution, minibatching is also beneficial for making the local computation more balanced across the devices. Based on some extended uniform stability arguments for gradients, we show in Theorem 3 the following local dissimilarity invariant rate of convergence for FedMSPP in terms of population optimality:
(1.6) 
where is the minibatch size of local update. For empirical FL, identical bound holds under sampling according to empirical distribution. For , the rate in (1.6) is dominated by which gives the communication complexity , and it matches that of the vanilla FedProx. For sufficiently large , the rate is dominated by which gives the communication complexity . This shows that local minibatching and device sampling are both beneficial for linearly speeding up communication. Further, when applied to nonsmooth problems, we can similarly show that FedMSPP converges at the rate of
which is comparable to that of (1.6) when and , but without showing the effect of linear speedup with respect to and .
Method  Work  Commun. Complex.  LD Independ.  NS  PP  
FedProx  Li et al. (2020b)  ✗  ✗  ✓  
Theorem 1 (ours)  ✓  ✗  ✓  
Theorem 2 (ours)  ✓  ✓  ✓  
FedMSPP  Theorem 3 (ours)  ✓  ✗  ✓  
Theorem 4 (ours)  ✓  ✓  ✓  
FedAvg  Karimireddy et al. (2020)  ✗  ✗  ✓  
Yu et al. (2019)  ✗  ✗  ✗  
Khanduri et al. (2021)  ✗  ✗  ✗  
SCAFFOLD  Karimireddy et al. (2020)  ✓  ✗  ✓  
FedPD  Zhang et al. (2020)  ✗  ✗  ✗  
STEM  Khanduri et al. (2021)  ✗  ✗  ✗  
FCO  Yuan et al. (2021) 

✓  ✓  ✗ 
Comparison with prior results. In Table 1, we summarize our communication complexity bounds for FedProx (FedMSPP) and compare them with several related heterogeneous FL algorithms in terms of the dependency on local dissimilarity, applicability to nonsmooth problems and tolerance to partial participation. A few observations are in order. First, regarding the requirement of local dissimilarity, all of our bounds are independent of local dissimilarity conditions, and they are comparable to those of SCAFFOLD and FCO (for convex problems) which are also invariant to local dissimilarity. Second, with regard to the applicability to nonsmooth optimization, our convergence guarantees in Theorem 2 and Theorem 4 are established for nonsmooth and weakly convex functions. While FCO is the only one in the other considered algorithms that can be applied to nonsmooth problems, it is customized for federated convex composite optimization with potentially nonsmooth regularizers (Yuan et al., 2021). Third, in terms of tolerance to partial participation, all of our results are robust to device sampling, and the bound in Theorem 3 for FedMSPP is comparable to the best known results under partial participation as achieved by FedAvg and SCAFFOLD. If assuming that all the devices participate in local update for each communication round and under certain local dissimilarity conditions, substantially faster bounds are possible for STEM and FedPD, while the bounds can be achieved by FedAvg (Khanduri et al., 2021). To summarize the comparison, our local dissimilarity invariant convergence bounds for FedProx (FedMSPP) are comparable to the bestknown rates in the identical setting, while covering the generic nonsmooth and nonconvex cases which to our knowledge so far has not been possible for other FL algorithms.
Highlight of theoretical contributions:

From the perspective of algorithmic stability theory, we provide a set of novel local dissimilarity invariant convergence guarantees for the widely used FedProx algorithm for nonconvex heterogeneous FL, with smooth or nonsmooth local functions. Our theory for the first time reveals that local dissimilarity and smoothness are not necessary to guarantee the convergence of FedProx with reasonable rates.

We present FedMSPP as a minibatch stochastic extension of FedProx and analyze its convergence behavior in terms of population optimality, again without assuming any type of local dissimilarity conditions. The main result provably shows that FedMSPP converges favorably for both smooth and nonsmooth objectives, while enjoying linear speedup in terms of minibatching size and partial participation ratio for smooth problems.
Finally, while the main contribution of this work is essentially theoretical, we have also carried out a preliminary numerical study on several benchmark FL datasets to corroborate our theoretical findings about the improved sample efficiency of FedMSPP.
Paper organization. In Section 2 we present our local dissimilarity invariant convergence analysis for the vanilla FedProx with smooth or nonsmooth loss functions. In Section 3 we propose FedMSPP as a minibatch stochastic extension of FedProx and analyze its convergence behavior through the lens of algorithmic stability theory. In Section 4, we present some additional related work on the topics covered by this paper. In Section 5, we present a preliminary experimental study on the convergence behavior of FedMSPP. The concluding remarks are made in Section 6. The technical proofs are relegated to the appendix sections.
2 Convergence of FedProx
We begin by providing an improved analysis for the vanilla FedProx independent of the local dissimilarity type conditions. We first introduce notations that will be used in the analysis to follow.
Notations. Throughout the paper, we use to denote the set , to denote the Euclidean norm and to denote the Euclidean inner product. We say a function is Lipschitz continuous if for all , and it is smooth if for all . Moreover, we say is weakly convex if for any ,
where represents a subgradient of evaluated at . We denote by
the Moreauenvelope of , and by
the proximal mapping associated with . We also need to access the following definition of inexact local update oracle for FedProx.
Definition 2 (Local inexact oracle of FedProx).
Suppose that the local proximal point regularized objective (cf. (1.3)) admits a global minimizer. For each time instance , we say that the local update oracle of FedProx is inexactly solved with suboptimality if
We conventionally assume that the objective value gap is bounded.
2.1 Results for Smooth Problems
The following theorem is our main result on the convergence rate of FedProx for smooth and nonconvex federated optimization problems.
Theorem 1.
Assume that for each , the loss function is Lipschitz and smooth with respect to its first argument. Set and . Suppose that the local update oracle of FedProx is inexactly solved with . Let be an index uniformly randomly chosen in . Then,
Proof.
A proof of this result is deferred to Appendix B.1. ∎
A few remarks are in order.
Remark 1.
Compared to the bound from Li et al. (2020b), our rate established in Theorem 1 is slower but it is valid without assuming the unrealistic LGD conditions and imposing strong regularization conditions on (see, e.g., Li et al., 2020b, Remark 5). Moreover, the dominant term in our bound reveals the benefit of device sampling for linear speedup which is not clear in the original analysis of Li et al. (2020b).
Remark 2.
In the extreme case of full device participation, i.e., , the terms related to in Theorem 1 can be removed and thus the convergence rate becomes under . In this same setting, we comment that the rate can also be improved to using our proof augments if LGD is additionally assumed.
Remark 3.
The Lipschitzloss assumption in Theorem 1 can be alternatively replaced by the bounded gradient condition as commonly used in the analysis of FL algorithms (Li et al., 2020b; Zhang et al., 2020). Despite that our analysis does not explicitly access to any local dissimilarity conditions, the assumed Lipschitz (or bounded gradient) condition actually implies that the local objective gradients are not too dissimilar, which shares a close spirit to the typically assumed LGD condition (Karimireddy et al., 2020)
and interclientvariance condition
(Khanduri et al., 2021). It is noteworthy that these mentioned client heterogeneity conditions are substantially milder than the LGD condition as required in the original analysis of FedProx.2.2 Results for Nonsmooth Problems
Now we turn to study the convergence of FedProx for weakly convex but not necessarily smooth problems. For the sake of presentation clarity, we work on the exact FedProx in which the local update oracle is assumed to be exactly solved, i.e. . Extension to the inexact case is more or less straightforward, though with somewhat more involved perturbation treatment. We assume that the objective value gap associated with Moreauenvelope of is bounded. The following is our main result on the convergence of FedProx for nonsmooth and weakly convex problems.
Theorem 2.
Assume that for each , the loss function is Lipschitz and weakly convex with respect to its first argument. Set for arbitrary . Suppose that the local update oracle of FedProx is exactly solved with . Let be an index uniformly randomly chosen in . Then it holds that
Proof.
The proof technique is inspired by the arguments from Davis and Drusvyatskiy (2019) developed for analyzing stochastic modelbased algorithms, with several new elements along developed for handling the challenges introduced by the model averaging and partial participation mechanisms associated with FedProx. A particular crux here is that due to the random subset model aggregation of , the local function values are no longer independent to each other though is uniformly random. As a consequence, is not
an unbiased estimation of
. To overcome this technical obstacle, we make use of a key observation that will be almost surely close enough to if the learning rate is small enough (which is the case in our choice of ), and thus we can replace the former with the latter whenever beneficial but without introducing too much approximation error. A full proof of this result can be found in Appendix B.2. ∎A few comments are in order.
Remark 4.
To our best knowledge, Theorem 2 is the first convergence guarantee for FL algorithms applicable to generic nonsmooth and weakly convex problems. This is in sharp contrast with FCO (Yuan et al., 2021) which focuses on composite convex and nonsmooth problems such as
estimation, or
FedHT (Tong et al., 2020) which is specially customized for cardinalityconstrained sparse learning problems where the nonconvexity essentially arises from the constraint.Remark 5.
Let us consider , the proximal mapping of associated with . In view of a feature of Moreau envelope to characterize stationarity (Davis and Drusvyatskiy, 2019), if has small gradient norm , then must be a nearstationary solution and stays in the proximity of due to the identity . Therefore, the bound in Theorem 2 suggests that in expectation converges to a stationary solution and converges to , both at the rate of .
Remark 6.
We comment that the bound in Theorem 2 is not dependent on , the number of selected devices. For and sufficiently large , the bounds Theorem 1 and Theorem 2 are comparable to each other, which demonstrates that the smoothness is not musthave for FedProx to get sharper convergence bound with small device sampling rate. However, in the nearfull participation setting where , the bound in Theorem 2 for nonsmooth problems will be slower when is large. Extremely when , the bound is substantially inferior to the smooth case which has improved rate of as discussed in Remark 2.
3 Convergence of FedProx with Stochastic Minibatching
When it comes to the implementation of FedProx, a notable challenge is that the local proximal point update oracle (1.3) is by itself a fullbatch ERM problem which would be expensive to solve even approximately in largescale settings. Moreover, in the settings where the data distribution over devices is highly imbalanced, the computational overload of local update could vary significantly across the network, which impairs communication efficiency. It is thus desirable to seek stochastic approximation schemes for hopefully improving the local oracle update efficiency and overload balance of FedProx. To this end, inspired by the recent success of minibatch stochastic proximal point methods (MSPP) (Asi et al., 2020; Deng and Gao, 2021), we propose to implement FedProx using MSPP as the local stochastic optimization oracle. More precisely, let be a minibatch of i.i.d. samples drawn from the distribution at device and time instance . We denote
(3.1) 
as the local minibatch empirical risk function over . The only modification we propose to make here is to replace the empirical risk in the original update form (1.3) with its minibatch counterpart . The resultant FL framework, which we refer to as FedMSPP (Federated MSPP), is outlined in Algorithm 1. Clearly, the vanilla FedProx is a special case of FedMSPP when applied to the federated ERM form (1.2) with full data batch .
(3.2) 
3.1 Results for Smooth Problems
We first analyze the convergence rate of FedMSPP for smooth and nonconvex problems using the tools borrowed from algorithmic stability theory. Analogous to the Definition 2, we introduce the following definition of inexact local update oracle for FedMSPP.
Definition 3 (Local inexact oracle of FedMSPP).
Suppose that the local proximal point regularized objective (cf. (3.2)) admits a global minimizer. For each time instance , we say that the local update oracle of FedMSPP is inexactly solved with suboptimality if
We also assume that the population value gap is bounded. The following theorem is our main result on FedMSPP for smooth and nonconvex FL problems.
Theorem 3.
Assume that for each , the loss function is Lipschitz and smooth with respect to its first argument. Set and . Suppose that the local update oracle of FedMSPP is inexactly solved with . Then we have
Proof.
Let us consider which is roughly the local update direction on device , in the sense that given that the local update oracle is solved to sufficient accuracy. As a key ingredient of our proof, we show via some extended uniform stability arguments in terms of gradients (see Lemma 3) that the averaged directions aligns well with the global gradient in expectation (see Lemma 11). Therefore, in average it roughly holds that , which suggests that is updated roughly along the direction of global gradient descent and thus guarantees quick convergence. Based on this novel analysis, we are free of imposing any kind of local dissimilarity conditions on local objectives. See Appendix C.1 for a full proof of this result. ∎
Remark 7.
For , the bound in Theorem 3 is dominated by which gives the communication complexity . This shows that FedMSPP enjoys linear speedup both in the size of local minibatching and in the size of device sampling.
3.2 Results for Nonsmooth Problems
Analogues to FedProx , we can further show that FedMSPP converges reasonably well when applied to weakly convex and nonsmooth problems. We assume that the objective value gap associated with Moreauenvelope of is bounded. The following is our main result in this line.
Theorem 4.
Assume that for each , the loss function is Lipschitz and weakly convex with respect to its first argument. Set for arbitrary . Suppose that the local update oracle of FedMSPP is exactly solved with . Let be an index uniformly randomly chosen in . Then it holds that
Proof.
4 Additional Related Work
Heterogenous federated learning. The presence of device heterogeneity features a key distinction between FL and classic distributed learning. The most commonly used FL method is FedAvg (McMahan et al., 2017), where the local update oracle is formed as multiepoch SGD. FedAvg was early analyzed for identical functions (Stich, 2019; Stich and Karimireddy, 2020) under the name of local SGD. In heterogeneous setting, numerous recent studies have focused on the analysis of FedAvg and other variants under various notions of local dissimilarity (Chen et al., 2020; Khaled et al., 2020; Li et al., 2020c; Woodworth et al., 2020; Reddi et al., 2021; Li et al., 2022). As another representative FL method, FedProx (Li et al., 2020b) has recently been proposed to apply averaged proximal point updates to solve heterogeneous federated minimization problems. The theoretical guarantees of FedProx have been established for both convex and nonconvex problems, but under a fairly stringent assumption of gradient similarity to measure data heterogeneity (Li et al., 2020b; Nguyen et al., 2020; Pathak and Wainwright, 2020). This assumption was relaxed by FedPD (Zhang et al., 2020) inside a metaframework of primaldual optimization. The SCAFFOLD (Karimireddy et al., 2020) and VRLSGD (Liang et al., 2019) are two algorithms that utilize variance reduction techniques to correct the local update directions, achieving convergence guarantees independent of the data heterogeneity. For composite nonsmooth FL problems, the FCO proposed in Yuan et al. (2021) employs a server dual averaging procedure to circumvent the curse of primal averaging suffered by FedAvg. In sharp contrast to these prior works which either require local dissimilarity conditions, or require full device participation, or only applicable to smooth problems, we show through a novel analysis based on algorithmic stability theory that the wellknown FedProx can actually overcome these shortcomings simultaneously.
Minibatch stochastic proximal point methods. The proposed FedMSPP algorithm is a variant of FedProx that simply replaces the local proximal point oracle with MSPP, which in each iteration updates the local model via (approximately) solving a proximal point estimator over a stochastic minibatch. The MSPPtype methods have been shown to attain a substantially improved iteration stability and adaptivity for largescale machine learning, especially in nonsmooth optimization settings (Li et al., 2014; Wang et al., 2017; Asi and Duchi, 2019; Deng and Gao, 2021). However, it is not yet known if FedProx or FedMSPP can achieve similar strong guarantees for nonsmooth heterogenous FL problems.
Algorithmic stability. Our analysis for FedMSPP builds largely upon the classic algorithmic stability theory. Since the fundamental work of Bousquet and Elisseeff (2002), algorithmic stability has been serving as a powerful proxy for establishing strong generalization bounds (Zhang, 2003; Mukherjee et al., 2006; ShalevShwartz et al., 2010). Particularly, the stateoftheart risk bounds of strongly convex ERM are offered by approaches based on the notion of uniform stability (Feldman and Vondrák, 2018, 2019; Bousquet et al., 2020; Klochkov and Zhivotovskiy, 2021). It was shown by Hardt et al. (2016) that the solution obtained via SGD is expected to be unformly stable for smooth convex or nonconvex loss functions. For nonsmooth convex losses, the stability induced generalization bounds have been established for SGD (Bassily et al., 2020; Lei and Ying, 2020). Through the lens of algorithmic stability theory, convergence rates of MSPP have been studied for nonsmooth and convex (Wang et al., 2017) or weakly convex (Deng and Gao, 2021) losses.
5 Experimental Results
In this section, we carry out a preliminary experimental study to demonstrate the speedup behavior of FedMSPP under varying minibatch sizes for achieving comparable test performances to FedProx. We also conventionally use FedAvg as a baseline algorithm for comparison.
5.1 Data and Models
Dataset  Model  # Devices  # Samples (Training) 

MNIST  2layer CNN  100  
FEMNIST  2layer CNN  50  
Sent140  2layer LSTM  261 
We compare the considered algorithms over the following three benchmark datasets popularly used for evaluating heterogenous FL approaches:

The MNIST (LeCun et al., 1998)
dataset of handwritten digits 09 is used for digit image classification with a two layer convolutional neural network (CNN). The model takes as input the images of size
, and first performs a 2layer ({1, 32, maxpooling}, {32, 64, maxpooling}) convolution followed by a fully connected (FC) layer. We use 63,000 images in which
are for training and the rest for test. The data are distributed over 100 devices such that each device has samples of only 2 digits. 
The FEMNIST (Li et al., 2020b) dataset is a subset of the 62class EMNIST (Cohen et al., 2017) database constructed by subsampling 10 lower case characters (’a’’j’). We study the performances of the considered algorithms for character image classification using the same two layer CNN as used for MNIST, which takes as input the images of size . We use 55,050 images in which are for training and the rest for test. The data are distributed over 50 devices, each of which has samples of 3 characters.

The Sent140 (Go et al., 2009)
dataset of text sentiment analysis on tweets is used for evaluating the considered algorithms for sentiment classification. The model we use is a two layer LSTM binary classifier containing 256 hidden units followed by a denselyconnected layer. The input is a sequence of 25 characters represented by a 300dimensional GloVe embedding
(Pennington et al., 2014) and the output is one character per training sample. We use for our experiment a total number of tweets from twitter accounts, each of which corresponds to a device. The training/test sample split is versus .
The statistics of the data and models in use are summarized in Table 2.
5.2 Implementation Details and Performance Metrics
We generally follow the instructions of Li et al. (2020b) for implementing FedProx, FedMSPP and FedAvg. More specifically, we use SGD as the local solver for FedProx, FedMSPP and FedAvg. For FedMSPP, we implement with three varying minibatch sizes on each dataset as shortly reported in the next subsection about results. The hyperparameters used in our implementation, such as number of communication rounds and number of local SGD epochs, are listed in Table 3.
Hyperparameter  MNIST  FEMNIST  Sent140 
#Communication rounds  200  300  300 
#Local SGD epochs  2  5  10 
Local SGD minibatch size  567  512  100 
Local SGD learning rate  0.25  0.06  0.1 
Strength of regularization  0.1  0.1  0.001 
Since the chief goal of this empirical study is to illustrate the benefit of FedMSPP for speeding up the convergence of FedProx, we use the numbers of data points and communication rounds needed for reaching the desired solution accuracy as performance metrics. The desired test accuracies are on MNIST, on FEMNIST, and on Sent140.
5.3 Results
In Figure 1, we show the numbers of data samples accessed by the considered algorithms to reach comparable test accuracies. For FedMSPP, we test with minibatch sizes on MNIST, on FEMNIST, and on Sent140. From this set of results we can observe that:

On all the three datasets in use, FedMSPP with varying minibatch sizes consistently needs significantly fewer samples than FedProx and FedAvg to reach the desired test accuracies.

FedMSPP with smaller minibatch size tends to have better sample efficiency.
Figure 2 shows the corresponding rounds of communication needed to reach comparable test accuracies. From this group results we can see that in most cases, FedMSPP just needs slightly increased rounds of communication than FedProx and FedAvg to reach comparable generalization accuracy.
Overall, our numerical results confirm that FedMSPP can be served as a safe and computationally more efficient replacement to FedProx on the considered heterogenous FL tasks.
6 Conclusion
In this paper, we have exposed three shortcomings of the prior convergence analysis for FedProx in unrealistic local dissimilarity assumptions, inapplicability to nonsmooth problems and expensive (and potentially imbalanced) computational cost of local update. In order to tackle these issues, we developed a novel convergence theory for the vanilla FedProx and its minibatch stochastic variant, FedMSPP, through the lens of algorithmic stability theory. At nutshell, our results reveal that with minimal modifications, FedProx is able to kill three birds with one stone: it enjoys favorable rates of convergence which are simultaneously invariant to local dissimilarity, applicable to smooth or nonsmooth problems, and scaling linearly with respect to local minibatch size and device sampling ratio for smooth problems. To the best of our knowledge, the present work is the first theoretical contribution that achieves all these appealing properties in a single FL framework.
Acknowledgement
XiaoTong Yuan was funded in part by the National Key Research and Development Program of China under Grant No. 2018AAA0100400 and in part by Natural Science Foundation of China (NSFC) under Grant No.61876090, No.61936005 and No.U21B2049.
References
 Asi and Duchi (2019) Hilal Asi and John C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM J. Optim., 29(3):2257–2290, 2019.
 Asi et al. (2020) Hilal Asi, Karan N. Chadha, Gary Cheng, and John C. Duchi. Minibatch stochastic approximate proximal point methods. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
 Bassily et al. (2020) Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
 Bhowmick et al. (2018) Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. Protection against reconstruction and its applications in private federated learning. arXiv preprint arXiv:1812.00984, 2018.
 Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, 2002.
 Bousquet et al. (2020) Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Proceedings of the Conference on Learning Theory (COLT), pages 610–626, Virtual Event [Graz, Austria], 2020.

Chen et al. (2020)
Xiangyi Chen, Xiaoyun Li, and Ping Li.
Toward communication efficient adaptive gradient method.
In
Proceedings of the ACMIMS Foundations of Data Science Conference (FODS)
, pages 119–128, Virtual Event, 2020.  Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and André van Schaik. EMNIST: extending MNIST to handwritten letters. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926, Anchorage, AK, 2017.
 Davis and Drusvyatskiy (2019) Damek Davis and Dmitriy Drusvyatskiy. Stochastic modelbased minimization of weakly convex functions. SIAM J. Optim., 29(1):207–239, 2019.
 Deng and Gao (2021) Qi Deng and Wenzhi Gao. Minibatch and momentum modelbased methods for stochastic weakly convex optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 23115–23127, virtual, 2021.
 Donevski et al. (2021) Igor Donevski, Jimmy Jessen Nielsen, and Petar Popovski. On addressing heterogeneity in federated learning for autonomous vehicles connected to a drone orchestrator. arXiv preprint arXiv:2108.02712, 2021.
 Elisseeff et al. (2005) André Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability of randomized learning algorithms. J. Mach. Learn. Res., 6:55–79, 2005.
 Feldman and Vondrák (2018) Vitaly Feldman and Jan Vondrák. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 9770–9780, Montréal, Canada, 2018.

Feldman and Vondrák (2019)
Vitaly Feldman and Jan Vondrák.
High probability generalization bounds for uniformly stable algorithms with nearly optimal rate.
In Proceedings of the Conference on Learning Theory (COLT), pages 1270–1279, Phoenix, AZ, 2019.  Go et al. (2009) Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009, 2009.
 Hard et al. (2020) Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio LopezMoreno, and Rajiv Mathews. Training keyword spotting models on noniid data with federated learning. In Proceedings of the 21st Annual Conference of the International Speech Communication Association (Interspeech), pages 4343–4347, Virtual Event, Shanghai, China, 2020.
 Hardt et al. (2016) Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of the 33nd International Conference on Machine Learning (ICML), pages 1225–1234, New York City, NY, 2016.
 He et al. (2021) Chaoyang He, Alay Dilipbhai Shah, Zhenheng Tang, Di Fan1Adarshan Naiynar Sivashunmugam, Keerti Bhogaraju, Mita Shimpi, Li Shen, Xiaowen Chu, Mahdi Soltanolkotabi, and Salman Avestimehr. FedCV: A federated learning framework for diverse computer vision tasks. arXiv preprint arXiv:2111.11066, 2021.
 Karimireddy et al. (2020) Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 5132–5143, Virtual Event, 2020.

Khaled et al. (2020)
Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik.
Tighter theory for local SGD on identical and heterogeneous data.
In
Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 4519–4529, Online [Palermo, Sicily, Italy], 2020.  Khanduri et al. (2021) Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, and Pramod K. Varshney. STEM: A stochastic twosided momentum algorithm achieving nearoptimal sample and communication complexities for federated learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 6050–6061, virtual, 2021.
 Klochkov and Zhivotovskiy (2021) Yegor Klochkov and Nikita Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate . In Advances in Neural Information Processing Systems (NeurIPS), pages 5065–5076, virtual, 2021.
 Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for ondevice intelligence. arXiv preprint arXiv:1610.02527, 2016.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
 Lei and Ying (2020) Yunwen Lei and Yiming Ying. Finegrained analysis of stability and generalization for stochastic gradient descent. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 5809–5819, Virtual Event, 2020.
 Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient minibatch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 661–670, New York, NY, 2014.
 Li et al. (2019) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. FedDANE: A federated newtontype method. In Proceedings of the 53rd Asilomar Conference on Signals, Systems, and Computers (Asilomar), pages 1227–1231, 2019.
 Li et al. (2020a) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag., 37(3):50–60, 2020a.
 Li et al. (2020b) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proceedings of Machine Learning and Systems (MLSys), Austin, TX, 2020b.
 Li et al. (2020c) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of FedAvg on noniid data. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 2020c.
 Li et al. (2022) Xiaoyun Li, Belhal Karimi, and Ping Li. On distributed adaptive optimization with gradient compression. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.
 Liang et al. (2019) Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local sgd with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.
 McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communicationefficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, Fort Lauderdale, FL, 2017.
 Mukherjee et al. (2006) Sayan Mukherjee, Partha Niyogi, Tomaso A. Poggio, and Ryan M. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math., 25(13):161–193, 2006.
 Nguyen et al. (2020) Hung T Nguyen, Vikash Sehwag, Seyyedali Hosseinalipour, Christopher G Brinton, Mung Chiang, and H Vincent Poor. Fastconvergent federated learning. IEEE Journal on Selected Areas in Communications, 39(1):201–218, 2020.
 Pathak and Wainwright (2020) Reese Pathak and Martin J. Wainwright. FedSplit: an algorithmic framework for fast federated optimization. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.

Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
Glove: Global vectors for word representation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1532–1543, Doha, Qatar, 2014.  Reddi et al. (2021) Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 2021.
 Rivasplata et al. (2018) Omar Rivasplata, Csaba Szepesvári, John ShaweTaylor, Emilio ParradoHernández, and Shiliang Sun. Pacbayes bounds for stable algorithms with instancedependent priors. In Advances in Neural Information Processing Systems (NeurIPS), pages 9234–9244, Montréal, Canada, 2018.
 ShalevShwartz et al. (2010) Shai ShalevShwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, 2010.
 Stich (2019) Sebastian U. Stich. Local SGD converges fast and communicates little. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, 2019.
 Stich and Karimireddy (2020) Sebastian U Stich and Sai Praneeth Karimireddy. The errorfeedback framework: Better rates for sgd with delayed gradients and compressed updates. J. Mach. Learn. Res., 21:237:1–36, 2020.
 Tong et al. (2020) Qianqian Tong, Guannan Liang, Tan Zhu, and Jinbo Bi. Federated nonconvex sparse learning. arXiv preprint arXiv:2101.00052, 2020.
 Wang et al. (2017) Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication efficient distributed stochastic optimization with minibatch prox. In Proceedings of the 30th Conference on Learning Theory (COLT), pages 1882–1919, Amsterdam, The Netherlands, 2017.
 Woodworth et al. (2020) Blake E. Woodworth, Kumar Kshitij Patel, and Nati Srebro. Minibatch vs local SGD for heterogeneous distributed learning. In Advances in Neural Information Processing Systems (NeurIPS), virtual, 2020.
 Yang et al. (2019) Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol., 10(2):12:1–12:19, 2019.

Yu et al. (2019)
Hao Yu, Sen Yang, and Shenghuo Zhu.
Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning.
In Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence (AAAI), pages 5693–5700, Honolulu, HI, 2019.  Yuan et al. (2021) Honglin Yuan, Manzil Zaheer, and Sashank J. Reddi. Federated composite optimization. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 12253–12266, Virtual Event, 2021.
 Zhang (2003) Tong Zhang. Leaveoneout bounds for kernel methods. Neural Comput., 15(6):1397–1437, 2003.
 Zhang et al. (2020) Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. FedPD: A federated learning framework with optimal rates and adaptivity to noniid data. arXiv preprint arXiv:2005.11418, 2020.
Appendix A Preliminaries
We present in this section some preliminary results on the classic algorithmic stability theory to be used in our analysis. Let us consider an algorithm that maps a training dataset to a model in a closed subset such that the following population risk function (with a slight abuse of notation) evaluated at the model is as small as possible:
The corresponding empirical risk is defined by
We denote by if a pair of datasets and differ in a single data point. The following concept of stability that serves as a powerful tool for analyzing the generalization bounds of learning algorithms [Elisseeff et al., 2005, Hardt et al., 2016, Bassily et al., 2020].
Definition 4 (Uniform Argument Stability).
Let be a learning algorithm that maps a dataset to a model . Then is said to have uniform stability if for every ,
The following basic lemma is about the uniform argument stability of an inexact regularized empirical risk minimization (ERM) estimator. See Appendix D.1 for its proof.
Lemma 1.
Assume that the loss function is Lipschitz with respect to its first argument. Suppose that the regularized objective is strongly convex for any . Consider the inexact estimator that satisfies the following for some :
Then has uniform argument stability with parameter .
We further need to use the following variant of EfronStein inequality to random vectorvalued functions [see, e.g., Lemma 6, Rivasplata et al., 2018].
Lemma 2 (EfronStein inequality for vectorvalued functions).
Let
be a set of i.i.d. random variables valued in
. Suppose that the function valued in a Hilbert space is measurable and satisfies the bounded differences property, i.e., the following inequality holds for any and any :Then it holds that
Based on the EfronStein inequality in Lemma 2, we can establish the following lemma which states the generalization bounds of a uniformly stable learning algorithm in terms of gradient. A proof of this result can be found in Appendix D.2.
Lemma 3.
Suppose that a learning algorithm has uniform stability. Assume that the loss function is Lipschitz and smooth with respect to its first argument. Then the following bounds hold:
Appendix B Proofs for Section 2
b.1 Proof of Theorem 1
Let . We define the following quantities
(B.1) 
The following elementary lemma is useful in our analysis.
Lemma 4.
Assume that for each , the loss function is Lipschitz. Then it holds that
Proof.
By uniform withoutreplacement sampling strategy we have
Then it follows that
where we have used the fact , the independence among the indices in and the Lipschitzness of losses. The desired bounds are proved. ∎
We also need the following lemma which quantifies the impact of local update precision to the gradient norm at the inexact solution.
Lemma 5.
Assume that for each , the loss function is smooth with respect to its first argument. Suppose that the local update oracle of FedProx is inexactly solved and . Then it holds that
Proof.
Recall . Since the loss functions are smooth and , is strongly convex and thus admits a global minimizer. Then we have
where in the last inequality is due to Definition 2. This implies the desired bound. ∎
Lemma 6.
Assume that for each , the loss function is Lipschitz and