1 Introduction
Large networks of remote devices, such as phones, vehicles, and wearable sensors, generate a wealth of data each day. Due to user privacy concerns and systems constraints (i.e., high communication costs, devicelevel computational constraints, and low availability amongst devices), federated learning has emerged as an increasingly attractive paradigm to push the training of statistical models in such networks to the edge (McMahan et al., 2017).
Optimization methods that allow for local updating and low participation have become the de facto solvers for federated learning (McMahan et al., 2017; Smith et al., 2017)
. These methods perform a variable number of local updates on a subset of devices to enable flexible and efficient communication patterns, e.g., compared to traditional distributed gradient descent or stochastic gradient descent (SGD). Of current federated optimization methods,
FedAvg (McMahan et al., 2017) has become stateoftheart for nonconvex federated learning. FedAvgworks simply by running some number of epochs,
, of SGD on a subset of the total devices at each communication round, and then averaging the resulting model updates via a central server.However, FedAvg was not designed to tackle the statistical heterogeneity inherent in federated settings; namely, that data may be nonidentically distributed across devices, with the number of data points per device varying significantly. In realistic statistically heterogeneous settings, FedAvg has been shown to diverge empirically (McMahan et al., 2017, Sec 3), and it also lacks theoretical convergence guarantees. Indeed, recent works exploring guarantees are limited to unrealistic scenarios, e.g., where (1) the data is either shared across devices or distributed in an IID (identically and independently distributed) manner, or (2) all devices are involved in communication at each round (Zhou & Cong, 2017; Stich, 2018; Wang & Joshi, 2018; Woodworth et al., 2018; Yu et al., 2018; Wang et al., 2018). While these assumptions simplify the analyses, they also violate key properties of realistic federated networks.
Contributions. In this work, we ask the following two questions: (1) Can we gain a principled understanding of FedAvg in realistic, statistically heterogeneous federated settings? (2) Can we devise an improved federated optimization algorithm, both theoretically and empirically? To this end, we propose a novel federated optimization framework, FedProx, which encompasses FedAvg. In order to characterize the convergence behavior of FedProx as a function of statistical heterogeneity, we introduce a novel device dissimilarity assumption. Under this assumption, we provide the first convergence guarantees for FedProx in practical heterogeneous data settings. Furthermore, through a set of experiments on numerous realworld federated datasets, we demonstrate that our theoretical assumptions reflect empirical performance, and that FedProx can improve the robustness and stability of convergence over FedAvg when data is heterogeneous across devices.
2 Related Work
Largescale distributed machine learning, particularly in data center settings, has motivated the development of numerous distributed optimization methods in the past decade (see, e.g., Boyd et al., 2010; Dekel et al., 2012; Dean et al., 2012; Zhang et al., 2013; Li et al., 2014; Shamir et al., 2014; Zhang et al., 2015; Richtárik & Takáč, 2016; Smith et al., 2018). However, as computing substrates such as phones, sensors, and wearable devices grow both in power and in popularity, it is increasingly attractive to learn statistical models directly over networks of distributed devices, as opposed to moving the data to the data center. This problem, known as federated learning, requires tackling novel challenges with privacy, heterogeneous data and devices, and massively distributed computational networks.
Recent optimization methods have been proposed that are tailored to the specific challenges in the federated setting. These methods have shown significant improvements over traditional distributed approaches like ADMM Boyd et al. (2010) or minibatch methods Dekel et al. (2012) by allowing for inexact local updating in order to balance communication vs. computation in large networks, and by allowing for a small subset of devices to be active at any communication round (McMahan et al., 2017; Smith et al., 2017; Lin et al., 2018). For example, Smith et al. (2017)
proposes a communicationefficient primaldual optimization method that learns separate but related models for each device through a multitask learning framework. Despite the theoretical guarantees and practical efficiency of the proposed method, such an approach is not generalizable to nonconvex problems, e.g. deep learning, where strong duality is no longer guaranteed. In the nonconvex setting, Federated Averaging (
FedAvg), a heuristic method based on averaging local Stochastic Gradient Descent (SGD) updates in the primal, has instead been shown to work well empirically
(McMahan et al., 2017).Unfortunately, FedAvg is quite challenging to analyze due to its local updating scheme, the fact that few devices are active at each round, and the issue that data is frequently distributed in a heterogeneous nature in federated networks. In particular, as each device in the network is generates its own local data, statistical heterogeneity is common with data being nonidentically distributed between devices. Recent works have made steps towards analyzing FedAvg in simpler, nonfederated settings. For instance, parallel SGD and related variants (Zhang et al., 2015; Zhou & Cong, 2017; Stich, 2018; Wang & Joshi, 2018; Woodworth et al., 2018), which make local updates similar to FedAvg, have been studied in the IID setting. However, the results rely on the premise that each local SGD is a copy of the same stochastic process (due to the IID assumption); this line of reasoning does not apply to the heterogeneous setting. Although some works Yu et al. (2018); Wang et al. (2018) have recently explored convergence guarantees in heterogeneous settings, they make the limiting assumption that all devices participate in each round of communication, which is often infeasible in realistic federated networks McMahan et al. (2017). Further, they rely on specific solvers to be used on each device (either SGD or GD), as compared to the solveragnostic framework proposed herein, and add additional assumptions of convexity Wang et al. (2018) or uniformly bounded gradients Yu et al. (2018) to derive convergence guarantees.
There are also several recent heuristic approaches that aim to tackle statistical heterogeneity, either by sharing the local device data or some serverside proxy data (Jeong et al., 2018; Zhao et al., 2018; Huang et al., 2018). However, these methods may be unrealistic in practical federated settings: in addition to imposing burdens on network bandwidth, sending local data to the server (Jeong et al., 2018) violates the key privacy assumption of federated learning, and sending globallyshared proxy data to all devices (Zhao et al., 2018; Huang et al., 2018) requires effort to carefully generate or collect such a dataset.
In this work, inspired by FedAvg, we propose a broader framework, FedProx, that is capable of handling heterogeneous federated data while maintaining similar privacy and computational benefits. We analyze the convergence behavior of the framework under a novel local dissimilarity assumption between local functions. Our assumption is inspired by the Kaczmarz method for solving linear system of equations (Kaczmarz, 1993), a similar assumption of which has been used to analyze variants of SGD for strongly convex problems in nondistributed settings (see, e.g., Schmidt & Roux, 2013). Our proposed framework allows for improved robustness and stability of convergence in heterogeneous federated networks.
3 Federated Optimization: Algorithms
In this section, we introduce the key ingredients behind recent methods for federated learning, including FedAvg, and we then outline our proposed framework, FedProx. Federated learning methods (e.g., McMahan et al., 2017; Smith et al., 2017; Lin et al., 2018) are designed to handle multiple devices^{1}^{1}1We use the term ‘device’ throughout the paper to describe entities in the network, e.g., nodes, clients, phones, sensors. collecting data and a central server coordinating the global learning objective across the network. In particular, the aim is to minimize the following global objective function:
(1) 
where is the number of devices, , and . In general, the local objectives ’s are given by local empirical risks over possibly differing data distributions , i.e., , with samples available locally at device . Hence, we can set , where is the total number of data points. In this work, we consider to be possibly nonconvex.
To reduce communication and handle systems constraints, a common theme in federated optimization methods is that on each device, a local objective function based on the device’s data is used as a surrogate for the global objective function. At each outer iteration, a subset of the devices are selected and local solvers are used to optimize the local objective functions on each of the selected devices. The devices then communicate their local model updates to the central server, which aggregates them and updates the global model accordingly. The key to allowing flexible performance in this scenario is that each of the local objectives can be solved inexactly. This allows the amount of local computation vs. communication to be tuned based on the number of local iterations that are performed (with additional local iterations corresponding to more exact local solutions). We introduce this notion formally below, as it will be utilized throughout the paper.
Definition 1 (inexact solution).
For a smooth convex function , and , we say is a inexact solution of , if , where . Note that a smaller corresponds to higher accuracy.
For full generality, we use inexactness in our analysis (Section 4) to measure the amount of local computation from each local solver. However, in our experiments (Section 5) we simply run an iterative local solver for some number of local epochs, , as in FedAvg (Algorithm 1). The number of local epochs can be seen as a proxy for inexactness, and it is straightforward (albeit notationally burdensome) to extend our analysis to directly cover this case by allowing to vary by iteration and device, similar to the analysis in (Smith et al., 2017).
3.1 Federated Averaging (FedAvg)
In Federated Averaging (FedAvg) (McMahan et al., 2017), the local surrogate of the global objective function at device is taken to be
, and the local solver is chosen to be stochastic gradient descent (SGD), which is homogeneous across devices in terms of the algorithm hyperparameters, i.e., the learning rate and the number of local epochs. At each round, a subset
of the total devices are selected and SGD is run locally for some number of epochs, , and then the resulting model updates are averaged. The details of FedAvg are summarized in Algorithm 1.McMahan et al. show empirically that it is crucial to tune the optimization hyperparameters for FedAvg properly in heterogeneous settings. In particular, carefully tuning the number of local epochs is critical in order for FedAvg to converge, as a larger number of local epochs allows local models to move further away from the initial global model, which can potentially lead to divergence. Intuitively, with dissimilar (heterogeneous) local objectives , a larger number of local epochs may lead each device towards the optima of its local objective as opposed to the global objective. Therefore, in a heterogeneous setting, where the local objectives may be quite different from the global, it is beneficial to restrict the amount of local deviation through a more principled tool than heuristically limiting the number of local epochs of some iterative solver. Indeed, a natural way to limit local model updates is to instead incorporate a constraint that penalizes large changes from the current model at the server. This observation serves as inspiration for FedProx, introduced below.
3.2 Proposed Framework: FedProx
Our proposed framework, FedProx, is similar to FedAvg in that a subset of devices are selected at each round, local updates are performed, and these updates are then averaged to form a global update. However, instead of just minimizing the local function , device uses its local solver of choice to approximately minimize the following surrogate objective :
(2) 
The proximal term in the above expression effectively limits the impact of local updates (by restricting them to be close to the initial model) without any need to manually tune the number of local epochs as in FedAvg. We summarize the steps of FedProx in Algorithm 2.
In our experiments (Section 5.2), we see the modified local subproblem in FedProx results in more robust and stable convergence compared to vanilla FedAvg for heterogeneous datasets. As we will see in Section 4, the usage of the proximal term in FedProx also makes it more amenable to theoretical analysis. Note that FedAvg is a special case of FedProx with , and where the local solver is specifically chosen to be SGD. Our proposed framework is significantly more general in this regard, as any local (possibly noniterative) solver can be used on each device.
Finally, we note here a connection to elastic averaging SGD (EASGD) Zhang et al. (2015), which was proposed as a way to train deep networks in the data center setting, and uses a similar proximal term in its objective. While the intuition is similar to EASGD (this term helps to prevent large deviations on each device/machine), EASGD employs a more complex moving average to update parameters, is limited to using SGD as a local solver, and has only been analyzed for simple quadratic problems.
4 FedProx: Convergence Analysis
FedAvg and FedProx are stochastic algorithms by nature; in each step of these algorithms, only a fraction of the devices are sampled to perform the update, and the updates performed on each device may be inexact. It is well known that in order for stochastic methods to converge to a stationary point, a decreasing stepsize is required. This is in contrast to nonstochastic methods, e.g. gradient descent, that can find a stationary point by employing a constant stepsize. In order to analyze the convergence behavior of methods with constant stepsize (which is what is usually implemented in practice) we need to be able to quantify the degree of dissimilarity among the local objective functions. This could be achieved by assuming the data to be IID, i.e., homogeneous across devices. Unfortunately, in realistic federated networks, this assumption is impractical. Thus, we propose a metric that specifically measures the dissimilarity among local functions (Section 4.1) and analyze FedProx under this assumption (Section 4.2).
4.1 Local dissimilarity
Here we introduce a measure of dissimilarity between the devices in a federated network, which is sufficient to prove convergence. This can also be satisfied via a simpler bounded variance assumption of the gradients (Corollary
4), which we explore in our experiments in Section 5.Definition 2 (local dissimilarity).
The local functions at are said to be locally dissimilar at if . We further define , when^{2}^{2}2As an exception we define , when , i.e. is a stationary solution that all the local functions agree on. .
Here denotes the expectation over devices with masses and (as in Equation 1). Definition 2 can be seen as a generalization of the IID assumption with bounded dissimilarity, while allowing for heterogeneity. As a sanity check, when all the local functions are the same, we have for all . However, in the federated setting, the data distributions are often heterogeneous and due to sampling discrepancies even if the samples are assumed to be IID. Let us also consider the case where ’s are associated with empirical risk objectives. If the samples on all the devices are homogeneous, i.e. they are sampled in an IID fashion, then as , it follows that for every as all the local functions converge to the same expected risk function in the large sample limit. Thus, and the larger the value of , the larger is the dissimilarity among the local functions. Using Definition 2, we now state our formal dissimilarity assumption, which we use in our convergence analysis. This assumption simply requires that the dissimilarity defined in Definition 2 is bounded.
Assumption 1 (Bounded dissimilarity).
For some , there exists a such that for all the points , .
For most practical machine learning problems, there is no need to solve the problem to arbitrarily accurate stationary solutions, i.e., is typically not very small. Indeed, it is wellknown that solving the problem beyond some threshold may even hurt generalization performance due to overfitting Yao et al. (2007). Although in practical federated learning problems the samples are not IID, they are still sampled from distributions that are not entirely unrelated (if this were the case, e.g., fitting a single global model across devices would be illadvised). Thus, it is reasonable to assume that the dissimilarity between local functions remains bounded throughout the training process.
4.2 FedProx Analysis
Using the bounded dissimilarity assumption (Assumption 1), we now analyze the amount of expected decrease in the objective when one step of FedProx is performed.
Theorem 3 (Nonconvex FedProx Convergence: local dissimilarity).
Let Assumption 1 hold. Assume the functions are nonconvex, Lipschitz smooth, and there exists , such that , with . Suppose that is not a stationary solution and the local functions are dissimilar, i.e. . If , , and in Algorithm 2 are chosen such that
then at iteration of Algorithm 2, we have the following expected decrease in the global objective:
where is the set of devices chosen at iteration .
We direct the reader to Appendix A for a detailed proof. The key steps include applying our notions of inexactness of each subproblem and bounded dissimilarity in the network, assuming only devices are active at each round. This last step in particular introduces , an expectation with respect to the choice of devices in round .
Theorem 3 uses the dissimilarity in Definition 2 to identify sufficient decrease at each iteration for FedProx. Below, we provide a corollary characterizing the performance with a more common (though slightly more restrictive) bounded variance assumption. This assumption is commonly employed, e.g., when analyzing methods such as SGD.
Corollary 4 (Bounded Variance Equivalence).
Let Assumption 1 hold. Then, in the case of bounded variance, i.e., , for any it follows that .
With Corollary 4 in place, we can restate the main result in Theorem 3 in terms of the bounded variance assumption.
Theorem 5 (NonConvex FedProx Convergence: Bounded Variance).
Let the assertions of Theorem 3 hold. In addition, let the iterate be such that , and let hold instead of the dissimilarity condition. If , and in Algorithm 2 are chosen such that
then at iteration of Algorithm 2, we have the following expected decrease in the global objective:
where is the set of devices chosen at iteration .
The proof of Theorem 5 follows from the proof of Theorem 3 by noting the relationship between the bounded variance assumption and the dissimilarity assumption as portrayed by Corollary 4. While the results thus far hold for nonconvex
, we can also characterize the convergence for the special case of convex loss functions with exact minimization in terms of local objectives.
Corollary 6 (Convergence: Convex Case).
Let the assertions of Theorem 3 hold. In addition, let ’s be convex and , i.e., all the local problems are solved exactly, if , then we can choose from which it follows that .
We next provide sufficient conditions that ensure in Theorems 35 so that sufficient decrease is attainable after each round.
Remark 7.
In order for in Theorem 3 to be positive, we need . Moreover, we also need . These conditions help to quantify the tradeoff between dissimilarity bound (B) and the algorithm parameters (, ).
Finally, we can use the above sufficient decrease to the characterize the rate of convergence to the set of approximate stationary solutions under the bounded dissimilarity assumption, Assumption 1. Note that these results hold for general nonconvex .
Theorem 8 (Convergence rate: FedProx).
Given some , assume that for , , and the assumptions of Theorem 3 hold at each iteration of FedProx. Moreover, . Then, after iterations of FedProx, we have .
Remark 9 (Comparison with SGD).
Note that FedProx achieves the same asymptotic convergence guarantee as SGD. In other words, under the bounded variance assumption, for small , if we replace with its upperbound in Corollary 4 and choose large enough, then the iteration complexity of FedProx when the subproblems are solved exactly and ’s are convex would be which is the same as SGD Ghadimi & Lan (2013).
To help provide context for the rate in Theorem 8, we compare it with SGD in the convex case in Remark 9. Note that small in Assumption 1 translates to larger . Corollary 6 suggests that, in order to solve the problem with increasingly higher accuracies using FedProx, one needs to increase appropriately. Moreover, in Corollary 6, if we plug in the upper bound for , under bounded variance assumption (see Corollary 4), the number of required steps to achieve accuracy is . Our analysis helps to characterize the performance and potential limitations of FedProx and similar methods when the local functions are dissimilar. In Section 5, we explore these ideas empirically. As a future direction, it would be interesting to quantify lower bounds for the convergence of methods such as FedProx/FedAvg in heterogeneous settings.
5 Experiments
We now present empirical results for the FedProx framework. In Section 5.2, we study the effect of statistical heterogeneity on the convergence of FedAvg and FedProx. We explore properties of the FedProx framework, such as the effect of and the local epochs , in Section 5.3. Finally, in Section 5.4, we show how empirical convergence is related to the bounded dissimilarity assumption (Assumption 1, Corollary 4) presented in Section 4. We provide thorough details of the experimental setup in Section 5.1 and Appendix D. All code, data, and experiments are publicly available at github.com/litian96/FedProx.
5.1 Experimental Details
We evaluate FedProx on diverse tasks, models, and realworld federated datasets. In order to characterize statistical heterogeneity and study its effect on convergence, we also evaluate on a set of synthetic data, which allows for more precise manipulation of heterogeneity.
Synthetic data. To generate synthetic data, we follow a similar setup to that described in (Shamir et al., 2014), additionally imposing heterogeneity among devices. In particular, for each device , we generate synthetic samples according to the model , . We model , , ; , where the covariance matrix is diagonal with
. Each element in the mean vector
is drawn from . Therefore, controls how much local models differ from each other and controls how much the local data at each device differs from that of other devices. We vary to generate three heterogeneous distributed datasets, Synthetic (), as shown in Figure 1. We also generate one IID dataset by setting the same on all devices and setting to follow the same distribution. Our goal is to learn a global and . Full details are given in Appendix D.1.Real data. We also explore five real datasets, summarized in Table 1. These datasets are curated from prior work in federated learning as well as recent federated learningrelated benchmarks McMahan et al. (2017); Caldas et al. (2018). We begin with a simple convex problem of classification with MNIST LeCun et al. (1998)
using multinomial logistic regression. To impose heterogeneity, we distribute the data among
devices such that each device has samples of only two digits and the number of samples per device follows a power law. We then study the more complex class Federated Extended MNIST Cohen et al. (2017); Caldas et al. (2018)(FEMNIST) dataset using the same model. Each device corresponds to a writer of the digits/characters. We also modify FEMNIST to create a more heterogeneous dataset, FEMNIST*. For the nonconvex setting, we consider a text sentiment analysis task on tweets from Sentiment140
Go et al. (2009)(Sent140) with an LSTM classifier, where each twitter account corresponds to a device. Finally, we consider data from
The Complete Works of William Shakespeare McMahan et al. (2017), using an LSTM for predicting the next character and associating each speaking role with a different device. Full details are provided in Appendix D.1.Dataset  Devices  Samples  Samples/device  

mean  stdev  
MNIST  1,000  69,035  69  106 
FEMNIST  900  305,654  340  107 
Shakespeare  143  517,706  3,620  4,115 
Sent140  5,726  215,829  38  19 
FEMNIST*  200  79,059  395  873 
Implementation. We implement FedAvg (Algorithm 1) and FedProx (Algorithm 2
) in Tensorflow
Abadi et al. (2015). In order to draw a fair comparison with FedAvg, we employ SGD as a local solver for FedProx, and adopt a slightly different device sampling scheme than that in Algorithms 1 and 2: sampling devices uniformly and then averaging the updates with weights proportional to the number of local data points (as originally proposed in McMahan et al. (2017)). While this sampling scheme is not supported by our analysis, we observe similar relative behavior of FedProx vs. FedAvg whether or not it is employed. Interestingly, we also observe that the sampling scheme proposed herein in fact results in more stable performance for both methods (see Appendix D.4, Figure 9). This suggests an additional benefit of the proposed framework.Setup. For each experiment, we tune the learning rate and ratio of active devices per round, and report results using the hyperparameters that perform best on FedAvg. We randomly split the data on each local device into 80% training set and 20% testing set. For each comparison, we set random seeds to make sure that the devices selected and data read at each round are the same across all runs. We report all metrics based on the global objective . Note that FedAvg () and FedProx () perform the same amount of work at each round when the number of local epochs, , is the same; we therefore report results in terms of rounds rather than FLOPS or wallclock time.
5.2 Effect of Statistical Heterogeneity
In Figure 1, we study how statistical heterogeneity affects convergence when the number of local epochs is large, using four synthetic datasets. We fix to be 20. From left to right, as the data becomes more heterogeneous, convergence becomes worse for FedProx with both = (FedAvg) and =. However, while may slow convergence for IID data, larger is particularly useful in heterogeneous settings, as is evident from Figure 1. This indicates that the modified subproblem introduced in FedProx can benefit practical federated settings with varying statistical heterogeneity. In the sections to follow, we see similar results in our nonsynthetic experiments.
5.3 Properties of FedProx Framework
The key parameters of FedProx that affect performance are the number of local epochs, , and the proximal term scaled by . Intuitively, large may cause local models to drift too far away from the initial starting point, thus leading to potential divergence, while large can restrict the trajectory of the iterates by constraining the iterates to be closer to that of the global model, potentially slowing convergence. We study FedProx under different values of and using the federated datasets described in Table 1.
Dependence on .
We explore the effect of in Figure 2. For each dataset, we set to be 1, 20, and 50 while keeping (FedProx reduces to FedAvg in this case) and plot the convergence in terms of the training loss. Note that we also observe similar trends in terms of test accuracy (all plots are provided in Appendix D.3). We see that large leads to divergence or instability on the MNIST and Shakespeare datasets. On FEMNIST and Sent140, nevertheless, larger speeds up the convergence. Based on conclusions drawn from Figure 1, we hypothesize this is due to the fact that the data distributed across devices after partitioning FEMNIST and Sent140 lack significant heterogeneity. We validate this hypothesis by observing instability on FEMNIST*, which is a skewed variant of the FEMNIST dataset. Moving forward, we therefore demonstrate the impact of using FEMNIST* instead of FEMNIST.
We note here that a potential approach to handle the observed divergence or instability of FedAvg with a large number of local epochs would be to just keep small. However, this requires to be carefully and heuristically tuned across the network, and precludes the possibility of exactly solving the local subproblems. A large number of local epochs, , may be particularly useful in practice when communication is expensive (which is common in federated networks), to help balance communication and computation. Indeed, in such situations increasing can improve stability. In Figure 5, e.g., we show that FedProx with a large (=) and an appropriate (=) leads to faster and more stable convergence in terms of communication rounds compared with =, = (slow convergence) and =, = (unstable convergence).
Dependence on .
We consider the effect of on convergence in Figure 3. For each experiment, in the case of , we compare the results between and the best . For three out of the four datasets (all but Sent140) we observe that the appropriate can increase the stability for unstable methods and can force divergent methods to converge. It also increases the accuracy in most cases (see ‘testing accuracy’ of Figure 6 and Figure 8 in Appendix D.3). In practice, can be chosen based on specific data characteristics and communication patterns. Future work includes developing methods to automatically set and tune this parameter for heterogeneous datasets, based, e.g., on the theoretical groundwork provided in this work.
5.4 Dissimilarity Measurement and Divergence
Finally, in Figure 4, we demonstrate that our Blocal dissimilarity measurement in Definition 2 captures the heterogeneity of datasets and is therefore an appropriate proxy of performance. In particular, we track the variance of gradients on each device, , which is lower bounded by (see Bounded Variance Equivalence Corollary 4). Empirically, either decreasing (Figure 7 in Appendix D.3) or increasing (Figure 4) leads to smaller dissimilarity among local functions. We also observe that the dissimilarity metric is consistent with the training loss. Therefore, smaller dissimilarity indicates better convergence, which can be enforced by setting appropriately. Full results tracking (for all experiments performed) are provided in Appendix D.3.
6 Conclusion
In this work, we have proposed FedProx, a distributed optimization framework that tackles the statistical heterogeneity inherent in federated networks. We have formalized statistical heterogeneity through a novel device dissimilarity assumption which allowed us to characterize the convergence of FedProx. Our empirical evaluation across a suite of federated datasets has validated our theoretical analysis and demonstrated that the FedProx framework can improve convergence behavior in realistic heterogeneous federated networks.
Acknowledgement.
We thank Jakub Konečný, Brendan McMahan, Nathan Srebro, and Jianyu Wang for their helpful discussions. This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.
References
 Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Boyd et al. (2010) Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2010.
 Caldas et al. (2018) Caldas, S., Wu, P., Li, T., Konečnỳ, J., McMahan, H. B., Smith, V., and Talwalkar, A. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
 Cohen et al. (2017) Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
 Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q. V., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pp. 1223–1231, 2012.
 Dekel et al. (2012) Dekel, O., GiladBachrach, R., Shamir, O., and Xiao, L. Optimal Distributed Online Prediction Using MiniBatches. Journal of Machine Learning Research, 13:165–202, 2012.
 Ghadimi & Lan (2013) Ghadimi, S. and Lan, G. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Go et al. (2009) Go, A., Bhayani, R., and Huang, L. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12), 2009.
 Huang et al. (2018) Huang, L., Yin, Y., Fu, Z., Zhang, S., Deng, H., and Liu, D. Loadaboost: Lossbased adaboost federated machine learning on medical data. arXiv preprint arXiv:1811.12629, 2018.
 Jeong et al. (2018) Jeong, E., Oh, S., Kim, H., Park, J., Bennis, M., and Kim, S.L. Communicationefficient ondevice machine learning: Federated distillation and augmentation under noniid private data. arXiv preprint arXiv:1811.11479, 2018.
 Kaczmarz (1993) Kaczmarz, S. Approximate solution of systems of linear equations. International Journal of Control, 57(6):1269–1271, 1993.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Li et al. (2014) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27, 2014.
 Lin et al. (2018) Lin, T., Stich, S. U., and Jaggi, M. Don’t use large minibatches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.

McMahan et al. (2017)
McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y.
Communicationefficient learning of deep networks from decentralized
data.
In
International Conference on Artificial Intelligence and Statistics
, 2017. 
Pennington et al. (2014)
Pennington, J., Socher, R., and Manning, C.
Glove: Global vectors for word representation.
In
Empirical Methods in Natural Language Processing
, pp. 1532–1543, 2014.  Richtárik & Takáč (2016) Richtárik, P. and Takáč, M. Distributed coordinate descent method for learning with big data. Journal of Machine Learning Research, 17:1–25, 2016.
 Schmidt & Roux (2013) Schmidt, M. and Roux, N. L. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
 Shamir et al. (2014) Shamir, O., Srebro, N., and Zhang, T. Communicationefficient distributed optimization using an approximate newtontype method. In International Conference on Machine Learning, pp. 1000–1008, 2014.
 Smith et al. (2017) Smith, V., Chiang, C.K., Sanjabi, M., and Talwalkar, A. S. Federated multitask learning. In Advances in Neural Information Processing Systems, pp. 4424–4434, 2017.
 Smith et al. (2018) Smith, V., Forte, S., Ma, C., Takac, M., Jordan, M. I., and Jaggi, M. Cocoa: A general framework for communicationefficient distributed optimization. Journal of Machine Learning Research, 18:1–47, 2018.
 Stich (2018) Stich, S. U. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
 Wang & Joshi (2018) Wang, J. and Joshi, G. Cooperative sgd: A unified framework for the design and analysis of communicationefficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
 Wang et al. (2018) Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C., He, T., and Chan, K. Adaptive federated learning in resource constrained edge computing systems. arXiv preprint arXiv:1804.05271, 2018.
 Woodworth et al. (2018) Woodworth, B., Wang, J., McMahan, B., and Srebro, N. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. arXiv preprint arXiv:1805.10222, 2018.
 Yao et al. (2007) Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
 Yu et al. (2018) Yu, H., Yang, S., and Zhu, S. Parallel restarted sgd for nonconvex optimization with faster convergence and less communication. In AAAI Conference on Artificial Intelligence, 2018.
 Zhang et al. (2015) Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693, 2015.
 Zhang et al. (2013) Zhang, Y., Duchi, J. C., and Wainwright, M. J. Communicationefficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321–3363, 2013.
 Zhao et al. (2018) Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. Federated learning with noniid data. arXiv preprint arXiv:1806.00582, 2018.
 Zhou & Cong (2017) Zhou, F. and Cong, G. On the convergence properties of a step averaging stochastic gradient descent algorithm for nonconvex optimization. arXiv preprint arXiv:1708.01012, 2017.
Appendix A Proof of Theorem 3
Proof.
Using our notion of inexactness for each local solver (Definition 1), we can define such that:
(3) 
Now let us define . Based on this definition, we know
(4) 
Let us define and . Then, due to the strong convexity of , we have
(5) 
Note that once again, due to the strong convexity of , we know that . Now we can use the triangle inequality to get
(6) 
Therefore,
(7) 
where the last inequality is due to the bounded dissimilarity assumption.
Now let us define such that , i.e. . We can bound :
(8) 
where the last inequality is also due to bounded dissimilarity assumption. Based on the LLipschitz smoothness of and Taylor expansion, we have
(9) 
From the above inequality it follows that if we set the penalty parameter large enough, we can get a decrease in the objective value of which is proportional to . However, this is not the way that the algorithm works. In the algorithm, we only use devices that are chosen randomly to approximate . So, in order to find the , we use local Lipschitz continuity of the function .
(10) 
where is the local Lipschitz continuity constant of function and we have
(11) 
Therefore, if we take expectation with respect to the choice of devices in round we need to bound
(12) 
where . Note that the expectation is taken over the random choice of devices to update.
(13) 
From (7), we have that . Moreover,
(14) 
and
(15) 
where the first inequality is a result of devices being chosen randomly to get and the last inequality is due to bounded dissimilarity assumption. If we replace these bounds in (13) we get
(16) 
Combining (9), (12), (10) and (16) and using the notation we get
∎
Appendix B Proof of Corollary 4
We have,
Appendix C Proof of Corollary 6
In the convex case, where and , if , i.e., all subproblems are solved accurately, we can get a decrease proportional to if . In such a case if we assume , then we can write
(17) 
In this case, if we choose we get
(18) 
Note that the expectation in (18) is a conditional expectation conditioned on the previous iterate. Taking expectation of both sides, and telescoping, we have that the number of iterations to at least generate one solution with squared norm of gradient less than is .
Appendix D Experimental Details
d.1 Datasets and Models
Here we provide full details on the datasets and models used in our experiments. We curate a diverse set of nonsynthetic datasets, including those used in prior work on federated learning McMahan et al. (2017), and some proposed in LEAF, a benchmark for federated settings Caldas et al. (2018). We also create synthetic data to directly test the effect of heterogeneity on convergence, as in Section 5.1.

Synthetic: We set =(0,0), (0.5,0.5) and (1,1) respectively to generate three nonidentical distributed datasets (Figure 1). In the IID data, we set the same on all devices and to follow the same distribution where each element in the mean vector is drawn from and is diagonal with . For all synthetic datasets, there are devices in total and the number of samples on each device follows a power law.

MNIST: We study image classification of handwritten digits 09 in MNIST LeCun et al. (1998) using multinomial logistic regression. To simulate a heterogeneous setting, we distribute the data among devices such that each device has samples of only digits and the number of samples per device follows a power law. The input of the model is a flattened 784dimensional (28 28) image, and the output is a class label between 0 and 9.

FEMNIST: We study an image classification problem on the class EMNIST dataset Cohen et al. (2017) using multinomial logistic regression. Each device corresponds to a writer of the digits/characters in EMNIST. We call this federated version of EMNIST FEMNIST. The input of the model is a flattened 784dimensional (28 28) image, and the output is a class label between 0 and 61.

Shakespeare: This is a dataset built from The Complete Works of William Shakespeare McMahan et al. (2017). Each speaking role in a play represents a different device. We use a two layer LSTM classifier containing 100 hidden units with a 8D embedding layer. The task is next character prediction and there are 80 classes of characters in total. The model takes as input a sequence of 80 characters, embeds each of the character into a learned 8 dimensional space and outputs one character per training sample after 2 LSTM layers and a denselyconnected layer.

Sent140: In nonconvex settings, we consider a text sentiment analysis task on tweets from Sentiment140 Go et al. (2009) (Sent140) with a two layer LSTM binary classifier containing 256 hidden units with pretrained 300D GloVe embedding (Pennington et al., 2014). Each twitter account corresponds to a device. The model takes as input a sequence of 25 characters, embeds each of the character into a 300 dimensional space by looking up Glove and outputs one character per training sample after 2 LSTM layers and a denselyconnected layer.

FEMNIST*: We generate FEMNIST* by subsampling lower case characters from FEMNIST and distributing only 20 classes to each device. There are 200 devices in total. The model is the same as the one used on FEMNIST.
d.2 Implementation Details
(Machines) We simulate the federated learning setup (1 server and devices) on a commodity machine with 2 Intel Xeon E52650 v4 CPUs and 8 NVidia 1080Ti GPUs.
(Hyperparameters) For each dataset, we tune the ratio of active clients per round from {0.01, 0.05, 0.1} on FedAvg. For synthetic datasets, roughly 10% of the devices are active at each round. For MNIST, FEMNIST, Shakespeare, Sent140 and FEMNIST*, the number of active devices () are 1%, 5%, 10%, 1% and 5% respectively. We also do a grid search on the learning rate based on FedAvg. We do not decay the learning rate through all rounds. For all synthetic data experiments, the learning rate is 0.01. For MNIST, FEMNIST, Shakespeare, Sent140 and FEMNIST*, we use the learning rates of 0.03, 0.003, 0.8, 0.3 and 0.003. We use a batch size of 10 for all experiments.
(Libraries) All code is implemented in Tensorflow Abadi et al. (2015) Version 1.10.1. Please see our anonymzied code submission for full details.
d.3 Full Experiments
We present testing accuracy, training loss and dissimilarity measurements of all the experiments in Figure 6, Figure 7 and Figure 8.
d.4 FedProx with two device sampling schemes
We show the training loss, testing accuracy and dissimilarity measurement of FedProx using two different device sampling schemes in Figure 9.