On the Convergence of Federated Optimization in Heterogeneous Networks

12/14/2018 ∙ by Anit Kumar Sahu, et al. ∙ 0

The burgeoning field of federated learning involves training machine learning models in massively distributed networks, and requires the development of novel distributed optimization techniques. Federated averaging () is the leading optimization method for training non-convex models in this setting, exhibiting impressive empirical performance. However, the behavior of is not well understood, particularly when considering data heterogeneity across devices in terms of sample sizes and underlying data distributions. In this work, we ask the following two questions: (1) Can we gain a principled understanding of in realistic federated settings? (2) Given our improved understanding, can we devise an improved federated optimization algorithm? To this end, we propose and introduce , which is similar in spirit to , but more amenable to theoretical analysis. We characterize the convergence of under a novel device similarity assumption.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large networks of remote devices, such as phones, vehicles, and wearable sensors, generate a wealth of data each day. Due to user privacy concerns and systems constraints (i.e., high communication costs, device-level computational constraints, and low availability amongst devices), federated learning has emerged as an increasingly attractive paradigm to push the training of statistical models in such networks to the edge (McMahan et al., 2017).

Optimization methods that allow for local updating and low participation have become the de facto solvers for federated learning (McMahan et al., 2017; Smith et al., 2017)

. These methods perform a variable number of local updates on a subset of devices to enable flexible and efficient communication patterns, e.g., compared to traditional distributed gradient descent or stochastic gradient descent (SGD). Of current federated optimization methods,

FedAvg (McMahan et al., 2017) has become state-of-the-art for non-convex federated learning. FedAvg

works simply by running some number of epochs,

, of SGD on a subset of the total devices at each communication round, and then averaging the resulting model updates via a central server.

However, FedAvg was not designed to tackle the statistical heterogeneity inherent in federated settings; namely, that data may be non-identically distributed across devices, with the number of data points per device varying significantly. In realistic statistically heterogeneous settings, FedAvg has been shown to diverge empirically (McMahan et al., 2017, Sec 3), and it also lacks theoretical convergence guarantees. Indeed, recent works exploring guarantees are limited to unrealistic scenarios, e.g., where (1) the data is either shared across devices or distributed in an IID (identically and independently distributed) manner, or (2) all devices are involved in communication at each round (Zhou & Cong, 2017; Stich, 2018; Wang & Joshi, 2018; Woodworth et al., 2018; Yu et al., 2018; Wang et al., 2018). While these assumptions simplify the analyses, they also violate key properties of realistic federated networks.

Contributions. In this work, we ask the following two questions: (1) Can we gain a principled understanding of FedAvg in realistic, statistically heterogeneous federated settings? (2) Can we devise an improved federated optimization algorithm, both theoretically and empirically? To this end, we propose a novel federated optimization framework, FedProx, which encompasses FedAvg. In order to characterize the convergence behavior of FedProx as a function of statistical heterogeneity, we introduce a novel device dissimilarity assumption. Under this assumption, we provide the first convergence guarantees for FedProx in practical heterogeneous data settings. Furthermore, through a set of experiments on numerous real-world federated datasets, we demonstrate that our theoretical assumptions reflect empirical performance, and that FedProx can improve the robustness and stability of convergence over FedAvg when data is heterogeneous across devices.

2 Related Work

Large-scale distributed machine learning, particularly in data center settings, has motivated the development of numerous distributed optimization methods in the past decade (see, e.g., Boyd et al., 2010; Dekel et al., 2012; Dean et al., 2012; Zhang et al., 2013; Li et al., 2014; Shamir et al., 2014; Zhang et al., 2015; Richtárik & Takáč, 2016; Smith et al., 2018). However, as computing substrates such as phones, sensors, and wearable devices grow both in power and in popularity, it is increasingly attractive to learn statistical models directly over networks of distributed devices, as opposed to moving the data to the data center. This problem, known as federated learning, requires tackling novel challenges with privacy, heterogeneous data and devices, and massively distributed computational networks.

Recent optimization methods have been proposed that are tailored to the specific challenges in the federated setting. These methods have shown significant improvements over traditional distributed approaches like ADMM Boyd et al. (2010) or mini-batch methods Dekel et al. (2012) by allowing for inexact local updating in order to balance communication vs. computation in large networks, and by allowing for a small subset of devices to be active at any communication round (McMahan et al., 2017; Smith et al., 2017; Lin et al., 2018). For example, Smith et al. (2017)

proposes a communication-efficient primal-dual optimization method that learns separate but related models for each device through a multi-task learning framework. Despite the theoretical guarantees and practical efficiency of the proposed method, such an approach is not generalizable to non-convex problems, e.g. deep learning, where strong duality is no longer guaranteed. In the non-convex setting, Federated Averaging (

FedAvg

), a heuristic method based on averaging local Stochastic Gradient Descent (SGD) updates in the primal, has instead been shown to work well empirically 

(McMahan et al., 2017).

Unfortunately, FedAvg is quite challenging to analyze due to its local updating scheme, the fact that few devices are active at each round, and the issue that data is frequently distributed in a heterogeneous nature in federated networks. In particular, as each device in the network is generates its own local data, statistical heterogeneity is common with data being non-identically distributed between devices. Recent works have made steps towards analyzing FedAvg in simpler, non-federated settings. For instance, parallel SGD and related variants (Zhang et al., 2015; Zhou & Cong, 2017; Stich, 2018; Wang & Joshi, 2018; Woodworth et al., 2018), which make local updates similar to FedAvg, have been studied in the IID setting. However, the results rely on the premise that each local SGD is a copy of the same stochastic process (due to the IID assumption); this line of reasoning does not apply to the heterogeneous setting. Although some works Yu et al. (2018); Wang et al. (2018) have recently explored convergence guarantees in heterogeneous settings, they make the limiting assumption that all devices participate in each round of communication, which is often infeasible in realistic federated networks McMahan et al. (2017). Further, they rely on specific solvers to be used on each device (either SGD or GD), as compared to the solver-agnostic framework proposed herein, and add additional assumptions of convexity Wang et al. (2018) or uniformly bounded gradients Yu et al. (2018) to derive convergence guarantees.

There are also several recent heuristic approaches that aim to tackle statistical heterogeneity, either by sharing the local device data or some server-side proxy data (Jeong et al., 2018; Zhao et al., 2018; Huang et al., 2018). However, these methods may be unrealistic in practical federated settings: in addition to imposing burdens on network bandwidth, sending local data to the server (Jeong et al., 2018) violates the key privacy assumption of federated learning, and sending globally-shared proxy data to all devices (Zhao et al., 2018; Huang et al., 2018) requires effort to carefully generate or collect such a dataset.

In this work, inspired by FedAvg, we propose a broader framework, FedProx, that is capable of handling heterogeneous federated data while maintaining similar privacy and computational benefits. We analyze the convergence behavior of the framework under a novel local dissimilarity assumption between local functions. Our assumption is inspired by the Kaczmarz method for solving linear system of equations (Kaczmarz, 1993), a similar assumption of which has been used to analyze variants of SGD for strongly convex problems in non-distributed settings (see, e.g., Schmidt & Roux, 2013). Our proposed framework allows for improved robustness and stability of convergence in heterogeneous federated networks.

3 Federated Optimization: Algorithms

In this section, we introduce the key ingredients behind recent methods for federated learning, including FedAvg, and we then outline our proposed framework, FedProx. Federated learning methods (e.g., McMahan et al., 2017; Smith et al., 2017; Lin et al., 2018) are designed to handle multiple devices111We use the term ‘device’ throughout the paper to describe entities in the network, e.g., nodes, clients, phones, sensors. collecting data and a central server coordinating the global learning objective across the network. In particular, the aim is to minimize the following global objective function:

(1)

where is the number of devices, , and . In general, the local objectives ’s are given by local empirical risks over possibly differing data distributions , i.e., , with samples available locally at device . Hence, we can set , where is the total number of data points. In this work, we consider to be possibly non-convex.

To reduce communication and handle systems constraints, a common theme in federated optimization methods is that on each device, a local objective function based on the device’s data is used as a surrogate for the global objective function. At each outer iteration, a subset of the devices are selected and local solvers are used to optimize the local objective functions on each of the selected devices. The devices then communicate their local model updates to the central server, which aggregates them and updates the global model accordingly. The key to allowing flexible performance in this scenario is that each of the local objectives can be solved inexactly. This allows the amount of local computation vs. communication to be tuned based on the number of local iterations that are performed (with additional local iterations corresponding to more exact local solutions). We introduce this notion formally below, as it will be utilized throughout the paper.

Definition 1 (-inexact solution).

For a smooth convex function , and , we say is a -inexact solution of , if , where . Note that a smaller corresponds to higher accuracy.

For full generality, we use -inexactness in our analysis (Section 4) to measure the amount of local computation from each local solver. However, in our experiments (Section 5) we simply run an iterative local solver for some number of local epochs, , as in FedAvg (Algorithm 1). The number of local epochs can be seen as a proxy for -inexactness, and it is straightforward (albeit notationally burdensome) to extend our analysis to directly cover this case by allowing to vary by iteration and device, similar to the analysis in (Smith et al., 2017).

3.1 Federated Averaging (FedAvg)

In Federated Averaging (FedAvg(McMahan et al., 2017), the local surrogate of the global objective function at device is taken to be

, and the local solver is chosen to be stochastic gradient descent (SGD), which is homogeneous across devices in terms of the algorithm hyperparameters, i.e., the learning rate and the number of local epochs. At each round, a subset

of the total devices are selected and SGD is run locally for some number of epochs, , and then the resulting model updates are averaged. The details of FedAvg are summarized in Algorithm 1.

INPUT: , , , , , , , ;
forall  do
        Server chooses a subset of devices at random (each device

is chosen with probability

);
        Server sends to all chosen devices;
        Each device updates for epochs of SGD on with step-size to obtain ;
        Each chosen device sends back to the server;
        Server aggregates the ’s as ;
       
Algorithm 1 Federated Averaging (FedAvg)

McMahan et al. show empirically that it is crucial to tune the optimization hyperparameters for FedAvg properly in heterogeneous settings. In particular, carefully tuning the number of local epochs is critical in order for FedAvg to converge, as a larger number of local epochs allows local models to move further away from the initial global model, which can potentially lead to divergence. Intuitively, with dissimilar (heterogeneous) local objectives , a larger number of local epochs may lead each device towards the optima of its local objective as opposed to the global objective. Therefore, in a heterogeneous setting, where the local objectives may be quite different from the global, it is beneficial to restrict the amount of local deviation through a more principled tool than heuristically limiting the number of local epochs of some iterative solver. Indeed, a natural way to limit local model updates is to instead incorporate a constraint that penalizes large changes from the current model at the server. This observation serves as inspiration for FedProx, introduced below.

3.2 Proposed Framework: FedProx

Our proposed framework, FedProx, is similar to FedAvg in that a subset of devices are selected at each round, local updates are performed, and these updates are then averaged to form a global update. However, instead of just minimizing the local function , device uses its local solver of choice to approximately minimize the following surrogate objective :

(2)

The proximal term in the above expression effectively limits the impact of local updates (by restricting them to be close to the initial model) without any need to manually tune the number of local epochs as in FedAvg. We summarize the steps of FedProx in Algorithm 2.

INPUT: , , , , , , , ;
forall  do
        Server chooses a subset of devices at random (each device is chosen with equal probability );
        Server sends to all chosen devices;
        Each chosen device finds a which is a -inexact minimizer of: ;
        Each chosen device sends back to the server;
        Server aggregates the ’s as ;
       
Algorithm 2 FedProx (Proposed Framework)

In our experiments (Section 5.2), we see the modified local subproblem in FedProx results in more robust and stable convergence compared to vanilla FedAvg for heterogeneous datasets. As we will see in Section 4, the usage of the proximal term in FedProx also makes it more amenable to theoretical analysis. Note that FedAvg is a special case of FedProx with , and where the local solver is specifically chosen to be SGD. Our proposed framework is significantly more general in this regard, as any local (possibly non-iterative) solver can be used on each device.

Finally, we note here a connection to elastic averaging SGD (EASGD) Zhang et al. (2015), which was proposed as a way to train deep networks in the data center setting, and uses a similar proximal term in its objective. While the intuition is similar to EASGD (this term helps to prevent large deviations on each device/machine), EASGD employs a more complex moving average to update parameters, is limited to using SGD as a local solver, and has only been analyzed for simple quadratic problems.

4 FedProx: Convergence Analysis

FedAvg and FedProx are stochastic algorithms by nature; in each step of these algorithms, only a fraction of the devices are sampled to perform the update, and the updates performed on each device may be inexact. It is well known that in order for stochastic methods to converge to a stationary point, a decreasing step-size is required. This is in contrast to non-stochastic methods, e.g. gradient descent, that can find a stationary point by employing a constant step-size. In order to analyze the convergence behavior of methods with constant step-size (which is what is usually implemented in practice) we need to be able to quantify the degree of dissimilarity among the local objective functions. This could be achieved by assuming the data to be IID, i.e., homogeneous across devices. Unfortunately, in realistic federated networks, this assumption is impractical. Thus, we propose a metric that specifically measures the dissimilarity among local functions (Section 4.1) and analyze FedProx under this assumption (Section 4.2).

4.1 Local dissimilarity

Here we introduce a measure of dissimilarity between the devices in a federated network, which is sufficient to prove convergence. This can also be satisfied via a simpler bounded variance assumption of the gradients (Corollary

4), which we explore in our experiments in Section 5.

Definition 2 (-local dissimilarity).

The local functions at are said to be -locally dissimilar at if . We further define , when222As an exception we define , when , i.e. is a stationary solution that all the local functions agree on. .

Here denotes the expectation over devices with masses and (as in Equation 1). Definition 2 can be seen as a generalization of the IID assumption with bounded dissimilarity, while allowing for heterogeneity. As a sanity check, when all the local functions are the same, we have for all . However, in the federated setting, the data distributions are often heterogeneous and due to sampling discrepancies even if the samples are assumed to be IID. Let us also consider the case where ’s are associated with empirical risk objectives. If the samples on all the devices are homogeneous, i.e. they are sampled in an IID fashion, then as , it follows that for every as all the local functions converge to the same expected risk function in the large sample limit. Thus, and the larger the value of , the larger is the dissimilarity among the local functions. Using Definition 2, we now state our formal dissimilarity assumption, which we use in our convergence analysis. This assumption simply requires that the dissimilarity defined in Definition 2 is bounded.

Assumption 1 (Bounded dissimilarity).

For some , there exists a such that for all the points , .

For most practical machine learning problems, there is no need to solve the problem to arbitrarily accurate stationary solutions, i.e., is typically not very small. Indeed, it is well-known that solving the problem beyond some threshold may even hurt generalization performance due to overfitting Yao et al. (2007). Although in practical federated learning problems the samples are not IID, they are still sampled from distributions that are not entirely unrelated (if this were the case, e.g., fitting a single global model across devices would be ill-advised). Thus, it is reasonable to assume that the dissimilarity between local functions remains bounded throughout the training process.

4.2 FedProx Analysis

Using the bounded dissimilarity assumption (Assumption 1), we now analyze the amount of expected decrease in the objective when one step of FedProx is performed.

Theorem 3 (Non-convex FedProx Convergence: -local dissimilarity).

Let Assumption 1 hold. Assume the functions are non-convex, -Lipschitz smooth, and there exists , such that , with . Suppose that is not a stationary solution and the local functions are -dissimilar, i.e. . If , , and in Algorithm 2 are chosen such that

then at iteration of Algorithm 2, we have the following expected decrease in the global objective:

where is the set of devices chosen at iteration .

We direct the reader to Appendix A for a detailed proof. The key steps include applying our notions of -inexactness of each subproblem and bounded dissimilarity in the network, assuming only devices are active at each round. This last step in particular introduces , an expectation with respect to the choice of devices in round .

Theorem 3 uses the dissimilarity in Definition 2 to identify sufficient decrease at each iteration for FedProx. Below, we provide a corollary characterizing the performance with a more common (though slightly more restrictive) bounded variance assumption. This assumption is commonly employed, e.g., when analyzing methods such as SGD.

Corollary 4 (Bounded Variance Equivalence).

Let Assumption 1 hold. Then, in the case of bounded variance, i.e., , for any it follows that .

With Corollary 4 in place, we can restate the main result in Theorem 3 in terms of the bounded variance assumption.

Theorem 5 (Non-Convex FedProx Convergence: Bounded Variance).

Let the assertions of Theorem 3 hold. In addition, let the iterate be such that , and let hold instead of the dissimilarity condition. If , and in Algorithm 2 are chosen such that

then at iteration of Algorithm 2, we have the following expected decrease in the global objective:

where is the set of devices chosen at iteration .

The proof of Theorem 5 follows from the proof of Theorem 3 by noting the relationship between the bounded variance assumption and the dissimilarity assumption as portrayed by Corollary 4. While the results thus far hold for non-convex

, we can also characterize the convergence for the special case of convex loss functions with exact minimization in terms of local objectives.

Corollary 6 (Convergence: Convex Case).

Let the assertions of Theorem 3 hold. In addition, let ’s be convex and , i.e., all the local problems are solved exactly, if , then we can choose from which it follows that .

We next provide sufficient conditions that ensure in Theorems 3-5 so that sufficient decrease is attainable after each round.

Remark 7.

In order for in Theorem 3 to be positive, we need . Moreover, we also need . These conditions help to quantify the trade-off between dissimilarity bound (B) and the algorithm parameters (, ).

Finally, we can use the above sufficient decrease to the characterize the rate of convergence to the set of approximate stationary solutions under the bounded dissimilarity assumption, Assumption 1. Note that these results hold for general non-convex .

Theorem 8 (Convergence rate: FedProx).

Given some , assume that for , , and the assumptions of Theorem 3 hold at each iteration of FedProx. Moreover, . Then, after iterations of FedProx, we have .

Remark 9 (Comparison with SGD).

Note that FedProx achieves the same asymptotic convergence guarantee as SGD. In other words, under the bounded variance assumption, for small , if we replace with its upper-bound in Corollary 4 and choose large enough, then the iteration complexity of FedProx when the subproblems are solved exactly and ’s are convex would be which is the same as SGD Ghadimi & Lan (2013).

To help provide context for the rate in Theorem 8, we compare it with SGD in the convex case in Remark 9. Note that small in Assumption 1 translates to larger . Corollary 6 suggests that, in order to solve the problem with increasingly higher accuracies using FedProx, one needs to increase appropriately. Moreover, in Corollary 6, if we plug in the upper bound for , under bounded variance assumption (see Corollary 4), the number of required steps to achieve accuracy is . Our analysis helps to characterize the performance and potential limitations of FedProx and similar methods when the local functions are dissimilar. In Section 5, we explore these ideas empirically. As a future direction, it would be interesting to quantify lower bounds for the convergence of methods such as FedProx/FedAvg in heterogeneous settings.

5 Experiments

We now present empirical results for the FedProx framework. In Section 5.2, we study the effect of statistical heterogeneity on the convergence of FedAvg and FedProx. We explore properties of the FedProx framework, such as the effect of and the local epochs , in Section 5.3. Finally, in Section 5.4, we show how empirical convergence is related to the bounded dissimilarity assumption (Assumption 1, Corollary 4) presented in Section 4. We provide thorough details of the experimental setup in Section 5.1 and Appendix D. All code, data, and experiments are publicly available at github.com/litian96/FedProx.

Figure 1: Effect of data heterogeneity on convergence. We show training loss (see testing accuracy and dissimilarity metric in Appendix, Figure 6) on four synthetic datasets whose heterogeneity increases from left to right. Note that corresponds to FedAvg. Increasing heterogeneity leads to worse convergence as the number of local epochs remains fixed, but setting can help to combat this.
Figure 2: Effect of increasing on real federated datasets where (corresponds to FedAvg

). Too many local updates can cause divergence or instability for heterogeneous datasets. Note that FEMNIST* is a more skewed version of FEMNIST.

Figure 3: Effect of on four real datasets. The setting corresponds to FedAvg. By setting appropriately, FedProx can increase the stability of convergence and can enable otherwise divergent methods to converge.
Figure 4: The dissimilarity measurement (variance of gradients) on four federated datasets. This metric captures statistical heterogeneity and is consistent with training loss (Figure 3). Smaller dissimilarity indicates better convergence.

5.1 Experimental Details

We evaluate FedProx on diverse tasks, models, and real-world federated datasets. In order to characterize statistical heterogeneity and study its effect on convergence, we also evaluate on a set of synthetic data, which allows for more precise manipulation of heterogeneity.

Synthetic data. To generate synthetic data, we follow a similar setup to that described in (Shamir et al., 2014), additionally imposing heterogeneity among devices. In particular, for each device , we generate synthetic samples according to the model , . We model , , ; , where the covariance matrix is diagonal with

. Each element in the mean vector

is drawn from . Therefore, controls how much local models differ from each other and controls how much the local data at each device differs from that of other devices. We vary to generate three heterogeneous distributed datasets, Synthetic (), as shown in Figure 1. We also generate one IID dataset by setting the same on all devices and setting to follow the same distribution. Our goal is to learn a global and . Full details are given in Appendix D.1.

Real data. We also explore five real datasets, summarized in Table 1. These datasets are curated from prior work in federated learning as well as recent federated learning-related benchmarks McMahan et al. (2017); Caldas et al. (2018). We begin with a simple convex problem of classification with MNIST LeCun et al. (1998)

using multinomial logistic regression. To impose heterogeneity, we distribute the data among

devices such that each device has samples of only two digits and the number of samples per device follows a power law. We then study the more complex -class Federated Extended MNIST Cohen et al. (2017); Caldas et al. (2018)

(FEMNIST) dataset using the same model. Each device corresponds to a writer of the digits/characters. We also modify FEMNIST to create a more heterogeneous dataset, FEMNIST*. For the non-convex setting, we consider a text sentiment analysis task on tweets from Sentiment140 

Go et al. (2009)

(Sent140) with an LSTM classifier, where each twitter account corresponds to a device. Finally, we consider data from

The Complete Works of William Shakespeare McMahan et al. (2017), using an LSTM for predicting the next character and associating each speaking role with a different device. Full details are provided in Appendix D.1.

Dataset Devices Samples Samples/device
mean stdev
MNIST 1,000 69,035 69 106
FEMNIST 900 305,654 340 107
Shakespeare 143 517,706 3,620 4,115
Sent140 5,726 215,829 38 19
FEMNIST* 200 79,059 395 873
Table 1: Statistics of Federated Datasets

Implementation. We implement FedAvg (Algorithm 1) and FedProx (Algorithm 2

) in Tensorflow

Abadi et al. (2015). In order to draw a fair comparison with FedAvg, we employ SGD as a local solver for FedProx, and adopt a slightly different device sampling scheme than that in Algorithms 1 and 2: sampling devices uniformly and then averaging the updates with weights proportional to the number of local data points (as originally proposed in McMahan et al. (2017)). While this sampling scheme is not supported by our analysis, we observe similar relative behavior of FedProx vs. FedAvg whether or not it is employed. Interestingly, we also observe that the sampling scheme proposed herein in fact results in more stable performance for both methods (see Appendix D.4, Figure 9). This suggests an additional benefit of the proposed framework.

Setup. For each experiment, we tune the learning rate and ratio of active devices per round, and report results using the hyperparameters that perform best on FedAvg. We randomly split the data on each local device into 80% training set and 20% testing set. For each comparison, we set random seeds to make sure that the devices selected and data read at each round are the same across all runs. We report all metrics based on the global objective . Note that FedAvg () and FedProx () perform the same amount of work at each round when the number of local epochs, , is the same; we therefore report results in terms of rounds rather than FLOPS or wall-clock time.

5.2 Effect of Statistical Heterogeneity

In Figure 1, we study how statistical heterogeneity affects convergence when the number of local epochs is large, using four synthetic datasets. We fix to be 20. From left to right, as the data becomes more heterogeneous, convergence becomes worse for FedProx with both = (FedAvg) and =. However, while may slow convergence for IID data, larger is particularly useful in heterogeneous settings, as is evident from Figure 1. This indicates that the modified subproblem introduced in FedProx can benefit practical federated settings with varying statistical heterogeneity. In the sections to follow, we see similar results in our non-synthetic experiments.

5.3 Properties of FedProx Framework

The key parameters of FedProx that affect performance are the number of local epochs, , and the proximal term scaled by . Intuitively, large may cause local models to drift too far away from the initial starting point, thus leading to potential divergence, while large can restrict the trajectory of the iterates by constraining the iterates to be closer to that of the global model, potentially slowing convergence. We study FedProx under different values of and using the federated datasets described in Table 1.

Dependence on .

We explore the effect of in Figure 2. For each dataset, we set to be 1, 20, and 50 while keeping (FedProx reduces to FedAvg in this case) and plot the convergence in terms of the training loss. Note that we also observe similar trends in terms of test accuracy (all plots are provided in Appendix D.3). We see that large leads to divergence or instability on the MNIST and Shakespeare datasets. On FEMNIST and Sent140, nevertheless, larger speeds up the convergence. Based on conclusions drawn from Figure 1, we hypothesize this is due to the fact that the data distributed across devices after partitioning FEMNIST and Sent140 lack significant heterogeneity. We validate this hypothesis by observing instability on FEMNIST*, which is a skewed variant of the FEMNIST dataset. Moving forward, we therefore demonstrate the impact of using FEMNIST* instead of FEMNIST.

We note here that a potential approach to handle the observed divergence or instability of FedAvg with a large number of local epochs would be to just keep small. However, this requires to be carefully and heuristically tuned across the network, and precludes the possibility of exactly solving the local subproblems. A large number of local epochs, , may be particularly useful in practice when communication is expensive (which is common in federated networks), to help balance communication and computation. Indeed, in such situations increasing can improve stability. In Figure 5, e.g., we show that FedProx with a large (=) and an appropriate (=) leads to faster and more stable convergence in terms of communication rounds compared with =, = (slow convergence) and =, = (unstable convergence).

Dependence on .

We consider the effect of on convergence in Figure 3. For each experiment, in the case of , we compare the results between and the best . For three out of the four datasets (all but Sent140) we observe that the appropriate can increase the stability for unstable methods and can force divergent methods to converge. It also increases the accuracy in most cases (see ‘testing accuracy’ of Figure 6 and Figure 8 in Appendix D.3). In practice, can be chosen based on specific data characteristics and communication patterns. Future work includes developing methods to automatically set and tune this parameter for heterogeneous datasets, based, e.g., on the theoretical groundwork provided in this work.

Figure 5: FedProx can provide faster and more stable convergence in communication-constraint environments (those requiring large ) with appropriate .

5.4 Dissimilarity Measurement and Divergence

Finally, in Figure 4, we demonstrate that our B-local dissimilarity measurement in Definition 2 captures the heterogeneity of datasets and is therefore an appropriate proxy of performance. In particular, we track the variance of gradients on each device, , which is lower bounded by (see Bounded Variance Equivalence Corollary 4). Empirically, either decreasing (Figure 7 in Appendix D.3) or increasing (Figure 4) leads to smaller dissimilarity among local functions. We also observe that the dissimilarity metric is consistent with the training loss. Therefore, smaller dissimilarity indicates better convergence, which can be enforced by setting appropriately. Full results tracking (for all experiments performed) are provided in Appendix D.3.

6 Conclusion

In this work, we have proposed FedProx, a distributed optimization framework that tackles the statistical heterogeneity inherent in federated networks. We have formalized statistical heterogeneity through a novel device dissimilarity assumption which allowed us to characterize the convergence of FedProx. Our empirical evaluation across a suite of federated datasets has validated our theoretical analysis and demonstrated that the FedProx framework can improve convergence behavior in realistic heterogeneous federated networks.

Acknowledgement.

We thank Jakub Konečný, Brendan McMahan, Nathan Srebro, and Jianyu Wang for their helpful discussions. This work was supported in part by DARPA FA875017C0141, the National Science Foundation grants IIS1705121 and IIS1838017, an Okawa Grant, a Google Faculty Award, an Amazon Web Services Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency.

References

Appendix A Proof of Theorem 3

Proof.

Using our notion of -inexactness for each local solver (Definition 1), we can define such that:

(3)

Now let us define . Based on this definition, we know

(4)

Let us define and . Then, due to the -strong convexity of , we have

(5)

Note that once again, due to the -strong convexity of , we know that . Now we can use the triangle inequality to get

(6)

Therefore,

(7)

where the last inequality is due to the bounded dissimilarity assumption.

Now let us define such that , i.e. . We can bound :

(8)

where the last inequality is also due to bounded dissimilarity assumption. Based on the L-Lipschitz smoothness of and Taylor expansion, we have

(9)

From the above inequality it follows that if we set the penalty parameter large enough, we can get a decrease in the objective value of which is proportional to . However, this is not the way that the algorithm works. In the algorithm, we only use devices that are chosen randomly to approximate . So, in order to find the , we use local Lipschitz continuity of the function .

(10)

where is the local Lipschitz continuity constant of function and we have

(11)

Therefore, if we take expectation with respect to the choice of devices in round we need to bound

(12)

where . Note that the expectation is taken over the random choice of devices to update.

(13)

From (7), we have that . Moreover,

(14)

and

(15)

where the first inequality is a result of devices being chosen randomly to get and the last inequality is due to bounded dissimilarity assumption. If we replace these bounds in (13) we get

(16)

Combining (9), (12), (10) and (16) and using the notation we get

Appendix B Proof of Corollary 4

We have,

Appendix C Proof of Corollary 6

In the convex case, where and , if , i.e., all subproblems are solved accurately, we can get a decrease proportional to if . In such a case if we assume , then we can write

(17)

In this case, if we choose we get

(18)

Note that the expectation in (18) is a conditional expectation conditioned on the previous iterate. Taking expectation of both sides, and telescoping, we have that the number of iterations to at least generate one solution with squared norm of gradient less than is .

Appendix D Experimental Details

d.1 Datasets and Models

Here we provide full details on the datasets and models used in our experiments. We curate a diverse set of non-synthetic datasets, including those used in prior work on federated learning McMahan et al. (2017), and some proposed in LEAF, a benchmark for federated settings Caldas et al. (2018). We also create synthetic data to directly test the effect of heterogeneity on convergence, as in Section 5.1.

  • Synthetic: We set =(0,0), (0.5,0.5) and (1,1) respectively to generate three non-identical distributed datasets (Figure 1). In the IID data, we set the same on all devices and to follow the same distribution where each element in the mean vector is drawn from and is diagonal with . For all synthetic datasets, there are devices in total and the number of samples on each device follows a power law.

  • MNIST: We study image classification of handwritten digits 0-9 in MNIST LeCun et al. (1998) using multinomial logistic regression. To simulate a heterogeneous setting, we distribute the data among devices such that each device has samples of only digits and the number of samples per device follows a power law. The input of the model is a flattened 784-dimensional (28 28) image, and the output is a class label between 0 and 9.

  • FEMNIST: We study an image classification problem on the -class EMNIST dataset Cohen et al. (2017) using multinomial logistic regression. Each device corresponds to a writer of the digits/characters in EMNIST. We call this federated version of EMNIST FEMNIST. The input of the model is a flattened 784-dimensional (28 28) image, and the output is a class label between 0 and 61.

  • Shakespeare: This is a dataset built from The Complete Works of William Shakespeare McMahan et al. (2017). Each speaking role in a play represents a different device. We use a two layer LSTM classifier containing 100 hidden units with a 8D embedding layer. The task is next character prediction and there are 80 classes of characters in total. The model takes as input a sequence of 80 characters, embeds each of the character into a learned 8 dimensional space and outputs one character per training sample after 2 LSTM layers and a densely-connected layer.

  • Sent140: In non-convex settings, we consider a text sentiment analysis task on tweets from Sentiment140 Go et al. (2009) (Sent140) with a two layer LSTM binary classifier containing 256 hidden units with pretrained 300D GloVe embedding (Pennington et al., 2014). Each twitter account corresponds to a device. The model takes as input a sequence of 25 characters, embeds each of the character into a 300 dimensional space by looking up Glove and outputs one character per training sample after 2 LSTM layers and a densely-connected layer.

  • FEMNIST*: We generate FEMNIST* by subsampling lower case characters from FEMNIST and distributing only 20 classes to each device. There are 200 devices in total. The model is the same as the one used on FEMNIST.

d.2 Implementation Details

(Machines) We simulate the federated learning setup (1 server and devices) on a commodity machine with 2 Intel Xeon E5-2650 v4 CPUs and 8 NVidia 1080Ti GPUs.

(Hyperparameters) For each dataset, we tune the ratio of active clients per round from {0.01, 0.05, 0.1} on FedAvg. For synthetic datasets, roughly 10% of the devices are active at each round. For MNIST, FEMNIST, Shakespeare, Sent140 and FEMNIST*, the number of active devices () are 1%, 5%, 10%, 1% and 5% respectively. We also do a grid search on the learning rate based on FedAvg. We do not decay the learning rate through all rounds. For all synthetic data experiments, the learning rate is 0.01. For MNIST, FEMNIST, Shakespeare, Sent140 and FEMNIST*, we use the learning rates of 0.03, 0.003, 0.8, 0.3 and 0.003. We use a batch size of 10 for all experiments.

(Libraries) All code is implemented in Tensorflow Abadi et al. (2015) Version 1.10.1. Please see our anonymzied code submission for full details.

d.3 Full Experiments

We present testing accuracy, training loss and dissimilarity measurements of all the experiments in Figure 6, Figure 7 and Figure 8.

Figure 6: Training loss, testing accuracy and dissimilarity measurement for experiments in Figure 1
Figure 7: Training loss, testing accuracy and dissimilarity measurement for experiments in Figure 2
Figure 8: Training loss, testing accuracy and dissimilarity measurement for experiments in Figure 3

d.4 FedProx with two device sampling schemes

We show the training loss, testing accuracy and dissimilarity measurement of FedProx using two different device sampling schemes in Figure 9.

Figure 9: Differences between two sampling schemes in terms of training loss, testing accuracy and dissimilarity measurement. Sampling devices with a probability proportional to the number of local data points and then simply averaging local models performs slightly better than uniformly sampling devices and averaging the local models with weights proportional to the number of local data points. Under either sampling scheme, the settings with demonstrate more stable performance than settings with .