Fair Resource Allocation in Federated Learning

by   Tian Li, et al.
Carnegie Mellon University

Federated learning involves training statistical models in massive, heterogeneous networks. Naively minimizing an aggregate loss function in such a network may disproportionately advantage or disadvantage some of the devices. In this work, we propose q-Fair Federated Learning (q-FFL), a novel optimization objective inspired by resource allocation in wireless networks that encourages a more fair (i.e., lower-variance) accuracy distribution across devices in federated networks. To solve q-FFL, we devise a communication-efficient method, q-FedAvg, that is suited to federated networks. We validate both the effectiveness of q-FFL and the efficiency of q-FedAvg on a suite of federated datasets, and show that q-FFL (along with q-FedAvg) outperforms existing baselines in terms of the resulting fairness, flexibility, and efficiency.


page 1

page 2

page 3

page 4


Provably Fair Federated Learning via Bounded Group Loss

In federated learning, fair prediction across various protected groups (...

FedHeN: Federated Learning in Heterogeneous Networks

We propose a novel training recipe for federated learning with heterogen...

On In-network learning. A Comparative Study with Federated and Split Learning

In this paper, we consider a problem in which distributively extracted f...

Latency Fairness Optimization on Wireless Networks through Deep Reinforcement Learning

In this paper, we propose a novel deep reinforcement learning framework ...

Efficient Federated Meta-Learning over Multi-Access Wireless Networks

Federated meta-learning (FML) has emerged as a promising paradigm to cop...

FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update

Federated learning aims to protect users' privacy while performing data ...

Federated Multi-Task Learning for Competing Constraints

In addition to accuracy, fairness and robustness are two critical concer...

Code Repositories


Fair Resource Allocation in Federated Learning (ICLR '20)

view repo

1 Introduction

With the growing prevalence of IoT-type devices, data is frequently collected and processed outside of the data center and directly on distributed devices, such as sensors, wearable devices, or mobile phones. Federated learning

is a promising learning paradigm for this setting that pushes machine learning model training to the edge 

[24]. Federated learning methods aim to address key challenges such as user privacy, expensive communication, and device variability.

In federated learning, the goal is typically to fit a model to data generated by a network of devices via some empirical risk minimization objective. The number of devices in such networks is generally large—ranging from hundreds to millions. Naively minimizing the average loss in such a massive network may disproportionately advantage or disadvantage the model performance on some of the devices. Indeed, although the accuracy may be high on average, there is no accuracy guarantee for individual devices in the network. This is exacerbated by the fact that the data are often heterogeneous across devices both in terms of size and distribution. In this work, we therefore ask: Can we devise an efficient federated optimization method to encourage a more fair distribution of the model performance across devices in federated networks?

There have been tremendous recent interests in developing fair methods for machine learning [see, e.g., 6, 9]. However, methods that could help improve fairness of the accuracy distribution in distributed settings are typically proposed for a much smaller number of devices, and may be impractical in federated networks due to the number of involved constraints [6]. Recent work that has been proposed specifically for the federated setting has also only been applied at small scales (2-3 groups/devices), and lacks flexibility by optimizing only the performance of the single worst device [26].

In this work, we propose -FFL, a novel optimization objective that addresses fairness issues in federated learning. Inspired by work in fair resource allocation for wireless networks, -FFL minimizes an aggregate reweighted loss parameterized by such that the devices with higher loss are given higher relative weight to encourage less variance (i.e., more fairness) in the accuracy distribution. Adaptively minimizing such a modified objective avoids the burden of hand-crafting fairness constraints, and results in a flexible framework in which the objective can be tuned depending on the desired amount of fairness. In addition, we propose a lightweight and scalable distributed method, -FedAvg, to solve -FFL, which carefully accounts for important characteristics of the federated setting such as communication-efficiency and low participation of devices [4, 24].

Contributions. We summarize our contributions as follows. First, we propose -FFL, a novel objective that can improve the fairness of the accuracy distribution in federated learning. Second, we design a scalable method, -FedAvg, that can efficiently solve the proposed objective in massive federated networks. Finally, through extensive experiments on federated datasets with both convex and non-convex models, we demonstrate the fairness and flexibility of -FFL and the efficiency of -FedAvg compared with existing baselines. Empirically, -FFL is able to reduce the variance of accuracies across devices by 45% on average while maintaining the same overall average accuracy.

2 Related Work

Fairness in Machine Learning. Fairness is a broad topic that has received much recent attention in the machine learning community. There are several widespread approaches to address fairness, in which fairness is typically defined as the protection of some specific attribute(s) (e.g., [17]). Two common approaches are to preprocess the data to remove information about the protected attribute [13]

, or to post-process the model by adjusting the prediction threshold after classifiers are trained 

[12, 17]. Another set of works optimize an objective subject to some fairness constraints during training time [3, 6, 18, 43, 46, 47, 9]. Our work also enforces fairness during training, though we define fairness as the variance of the accuracy distribution across devices in federated learning (Section 3), as opposed to the protection of a specific attribute. Although some work defines equal error rates among specific groups as a notion of fairness [46, 6], our goal is not to optimize for the same accuracy across all devices due to the heterogeneous nature of federated settings. Cotter et al. [6] uses a notion of ‘minimum accuracy’ as one special case of ‘rate constraints’, which is conceptually similar to our goal. However, it requires one optimization constraint for each device/group, which would result in thousands to millions of constraints in the federated setting.

In federated settings, Mohri et al. [26] recently proposed a minimax optimization scheme, Agnostic Federated Learning (AFL), which optimizes for the performance of the single worst device.111The notion of ‘group’ in [26] is the same as the notion of ‘device’ used here. This method has only been applied at small scales (for a handful of groups/devices). Compared to AFL, our proposed objective is more flexible as it can be tuned based on the desired amount of fairness; -FFL in fact generalizes AFL as -FFL with a large enough is equivalent to AFL. We demonstrate the improved flexibility and scalability of -FFL compared to AFL empirically in Section 4.

Fairness in Resource Allocation. Fair resource allocation has been extensively studied in fields such as network management [10, 16, 21, 28] and wireless communications [11, 27, 34, 37]. In these contexts, the problem is defined as allocating a scarce shared resource, e.g., communication time or power, among many users. In these cases directly maximizing utilities such as total throughput usually leads to unfair allocations where some users receive poor service. As a service provider, it is important to improve the quality of service for all users while maintaining overall throughput. For this reason several popular fairness measurements have been proposed to balance between fairness and total throughput, including Jain’s index [19], entropy [33], max-min/min-max fairness [31], and proportional fairness [20]. A unified framework is captured through -fairness [22, 25], in which the network manager can tune the emphasis on fairness by changing a single parameter, .

To draw an analogy between federated learning and the problem of resource allocation, one can think of the global model as a resource that is meant to serve the users (or devices). In this sense, it is natural to ask similar questions about the fairness of the service that users receive and use similar tools to promote fairness. Despite this, we are unaware of any works that use -fairness from resource allocation to modify training objectives in machine learning. Inspired by the -fairness metric, we propose a similarly modified objective function, -Fair Federated Learning (-FFL), to encourage a more fair accuracy distribution across devices in the context of federated training. Similar to the -fairness metric, our -FFL objective is flexible enough to enable trade-offs between fairness and other traditional metrics such as average accuracy by changing the parameter . In Section 4, we demonstrate empirically that the use of -FFL as an objective in federated learning enables a more fair test accuracy distribution among the devices.

Federated and Distributed Optimization. To devise a practical fairness solution for the federated setting, it is critical to design methods for efficiently solving the proposed objective. Federated learning faces challenges such as expensive communication, systems heterogeneity (e.g., variability in hardware or network connection) and statistical heterogeneity (i.e., differing local data distributions per device), making it distinct from classical distributed optimization [32, 35, 39]. In order to reduce communication, as well as to tolerate heterogeneity, methods that allow for local updating and low participation among devices have become de facto solvers for this setting [23, 24, 38]. We incorporate recent advancements in this field when designing methods to solve the -FFL objective (Section 3.3).

3 Fair Federated Learning

In this section, we formally define the classical federated learning objective and methods, and introduce our proposed notion of fairness. We then introduce -FFL, a novel objective that encourages a more fair accuracy distribution across all devices (Section 3.2). Finally, in Section 3.3, we describe -FedAvg, an efficient distributed method to solve the -FFL objective in federated settings.

3.1 Preliminaries: Classical Federated Learning, Fairness, and FedAvg

Federated learning algorithms involve hundreds to millions of remote devices learning locally on their device-generated data and communicating with a central server periodically to reach a global consensus. In particular, the goal is typically to minimize the following objective function:


where is the total number of devices, , and . The local objective ’s can be defined by empirical risks over local data, i.e., , where is the number of samples available locally. We can set to be , where is the total number of samples to fit a traditional empirical risk minimization-type objective over the entire dataset.

  Input: , , , , , , ,
  for  do
     Server randomly chooses a subset of devices (each device

is chosen with probability

     Server sends to all chosen devices
     Each device updates for epochs of SGD on with step-size to obtain
     Each chosen device sends back to the server
     Server aggregates the ’s as
  end for
Algorithm 1 Federated Averaging [24] (FedAvg)

Most prior work solves (1) by sampling a subset of devices with probabilities proportional to

at each round, and then applying an optimizer such as stochastic gradient descent (SGD) locally. These

local updating methods enable flexible and efficient communication by running an optimizer for a variable number of iterations locally on each device, e.g., compared to traditional mini-batch methods, which would simply calculate a subset of the gradients [40, 41, 44, 45]. FedAvg [24], summarized in Algorithm 1, is one of the leading methods to solve (1). The method runs simply by having each selected device apply epochs of SGD locally and then averaging the resulting models.

Unfortunately, solving problem (1) in this manner can implicitly introduce unfairness between different devices. For instance, the learned model may be biased towards devices with larger numbers of data points, or (if weighting devices equally), to commonly occurring groups of devices. More formally, we define our desired fairness criteria for federated learning below.

Definition 1 (Fairness of performance distribution).

For trained models and , we say that model provides a more fair solution to the federated learning objective (1) than model if the variance of the performance of model on the devices, , is smaller than the variance of the performance of model on the devices, i.e., .

In this work, we take ‘performance’, , to be the testing accuracy of applying the trained model on the test data for device . We note that a tension exists between the variance of the final testing accuracy distribution and the average testing accuracy across devices. In general, our goal is to reduce the variance while maintaining the same (or similar) average accuracy.

3.2 The objective: -Fair Federated Learning (-Ffl)

A natural idea to achieve fairness as defined in (1) would be to reweight the objective—assigning higher weight to devices with poor performance, so that the distribution of accuracies in the network reduces in variance. Note that this re-weighting must be done dynamically, as the performance of the devices depends on the model being trained, which cannot be evaluated a priori. Drawing inspiration from -fairness, a utility function used in fair resource allocation in wireless networks, we propose the following objective. For given local non-negative cost functions and parameter , we define the -Fair Federated Learning (-FFL) objective as:


where denotes to the power of . Here, is a parameter that tunes the amount of fairness we wish to impose. Setting does not encourage fairness beyond the classical federated learning objective (1). A larger means that we emphasize devices with higher local empirical losses, , thus reducing the variance of the training accuracy distribution and potentially inducing fairness in accordance with Definition 1. with a large enough reduces to classical max-min fairness [26], as the device with the worst performance (largest loss) will dominate the objective. We note that while the term in the denominator in (2) may be absorbed in , we include it as it is standard in the -fairness literature and helps to ease notation in the following sections.

In our experiments (Section 4.2), we show that under the -FFL objective, we can obtain fairer results for federated datasets in terms of both the training and testing accuracy distributions. For completeness, we provide additional background on -fairness in Appendix A.

3.3 The solver: FedAvg-style -Fair Federated Learning (-FedAvg)

In this section, we provide methods to solve -FFL. We start by giving a fair but less efficient method, -FedSGD, to illustrate the main techniques we use in terms of solving the -FFL problem (2). We then provide a more efficient counterpart, -FedAvg, by considering key properties of federated algorithms such as local updating schemes. These proposed methods closely mirror traditional distributed optimization methods—mini-batch SGD and federated averaging (FedAvg)—but with step-sizes and subproblems carefully chosen in accordance with the -FFL problem (2).

Hyperparameter tuning: and step-sizes. In devising a method to solve -FFL (2), we begin by noting that it is crucial to first determine how to set . In practice, can be tuned based on the desired amount of fairness (with larger inducing more fairness). As we describe in our experiments (Section 4.2), it is therefore common to train a family of objectives for different values so that a practitioner can explore the trade-off between accuracy and fairness for the application at hand.

One concern with solving such a family of objectives is that the training costs can increase significantly. In particular, to optimize -FFL in a scalable fashion, we rely on gradient-based methods, where the step-size inversely depends on the Lipchitz constant of the function’s gradient, which is often unknown and selected via grid search [14, 29]. As we intend to optimize -FFL for various values of , the Lipchitz constant will change as we change —requiring step-size tuning for all values of

. This can quickly cause the search space to explode. To overcome this issue, we propose estimating the local Lipchitz constant of the gradient for the family of

-FFL objectives by using the Lipchitz constant we infer via grid search on . This allows us to dynamically adjust the step-size of our gradient-based optimization method for the -FFL objective, avoiding the manual tuning for each . In Lemma 2 we formalize the relation between the Lipschitz constant, , for and .

Lemma 2.

If the non-negative function has a Lipchitz gradient with constant , then for any and at any point ,


is an upper-bound for the local Lipchitz constant of the gradient of at point .


At any point , we can compute the Hessian as:


As a result, . ∎

A first approach: -FedSGD. Our first fair federated learning method, -FedSGD, is an extension of the well-known federated mini-batch SGD (FedSGD) method [24]. -FedSGD uses a dynamic step-size based on Lemma 2 instead of the normal fixed step-size of FedSGD. In each step of -FedSGD, a subset of the devices are selected, and for each device in this subset, and are computed at the current iterate and communicated to the central node. This information is used to adjust the weight for combining the updates from each device based on Lemma 2. The details of -FedSGD are summarized in Algorithm 2. It is important to note that to run -FedSGD with different values of , we only need to estimate once (for ) and can then re-use it for all values of .

1:  Input: , , , , , ,
2:  for  do
3:     Server selects a subset of devices at random (each device is chosen with prob. )
4:     Server sends to all selected devices
5:     Each selected device computes:
6:     Each selected device sends and back to the server
7:     Server updates as:
8:  end for
Algorithm 2 -FedSGD
1:  Input: , , , , , , , ,
2:  for  do
3:     Server selects a subset of devices at random (each device is chosen with prob. )
4:     Server sends to all selected devices
5:     Each selected device updates for epochs of SGD on with step-size to obtain
6:     Each selected device computes:
7:     Each selected device sends and back to the server
8:     Server updates as:
9:  end for
Algorithm 3 -FedAvg

Improving communication-efficiency: -FedAvg. In federated settings, communication-efficient schemes using local stochastic solvers (such as FedAvg, described in Section 3.1) have been shown to significantly improve convergence speed [24]. Using stochastic (as opposed to batch) methods locally is important as it enables flexibility in terms of local computation vs. communication. Unfortunately, it is not straightforward to simply apply FedAvg to problem (2) when , as the term prevents the use of local SGD. To address this, we propose instead optimizing locally. This is reasonable due to the fact that minimizing is equivalent to minimizing (when and ). However, if we combine these updates by simple averaging, similar to FedAvg, it would optimize (1) and not (2). Instead, we combine the local updates using the weights inferred via Lemma 2, similar to -FedSGD. In particular, we replace the gradient of the local functions, , in the -FedSGD

steps with the local update vectors that are obtained by running SGD locally on device

. This allows us to extend the local updating technique of FedAvg to the -FFL objective (2).

We provide additional details on -FedAvg in Algorithm 3. As we will see empirically, due to the local updating, -FedAvg can solve -FFL objective more efficiently than -FedSGD in most cases. Similar to -FedSGD, it also does not require re-tuning the step-size when changes.

4 Evaluation

We now present empirical results of the proposed objective, -FFL, and proposed methods, -FedAvg and -FedSGD. We describe our experimental setup including datasets used in Section 4.1. We then demonstrate the improved fairness of -FFL in Section 4.2, and compare -FFL with several baseline fairness methods in Section 4.3. Finally, we show the efficiency of -FedAvg compared with -FedSGD in Section 4.4. All code, data, and experiments are publicly available at github.com/litian96/fair_flearn.

4.1 Experimental setups

Federated Datasets. We explore one synthetic and three non-synthetic federated datasets, using both convex and non-convex models in our experiments. The datasets are curated from prior work in federated learning [24, 38, 23] as well as recent federated learning benchmarks [5]. In particular, we first study a synthetic dataset similar to that in [36] and impose additional heterogeneity amongst 1,000 devices. We then investigate a Vehicle dataset consisting of acoustic, seismic, and infrared sensor data collected from a distributed network of 23 sensors [8]. We model each sensor as a device and train a linear SVM to predict between AAV-type and DW-type vehicles. In non-convex settings, we study tweets from 1,101 accounts curated from Sentiment140 [15]

(Sent140) where each Twitter account corresponds to a device. We use an LSTM classifier for text sentiment analysis. Finally, we explore text data built from

The Complete Works of William Shakespeare [24, 42] where each speaking role is associated with a device. We randomly subsample 31 devices, and use an LSTM to predict the next character. Full details of the datasets are given in Appendix B.1.


We implement all code in TensorFlow 

[2], simulating a federated network with one server and devices. We provide full details in Appendix B.2, and all hyperparameter values are given in Appendix B.2.2.

Figure 1: -FFL leads to fairer test accuracy distributions. With , the distributions shift towards the center as low accuracies increase at the cost of decreasing high accuracies on some devices. Setting =0 corresponds to the original objective Equation (1)). The selected values for on the four datasets, as well as distribution statistics, are shown in Table 1.
Dataset Objective Average Worst 10% Best 10% Variance
Synthetic 80.8% .9% 18.8% 5.0% 100.0% 0.0% 724 72
79.0% 1.2% 31.1% 1.8% 100.0% 0.0% 472 14
Vehicle 87.3% .5% 43.0% 1.0% 95.7% 1.0% 291 18
87.7% .7% 69.9% .6% 94.0% .9% 48 5
Sent140 65.1% 4.8% 15.9% 4.9% 100.0% 0.0% 697 132
66.5% .2% 23.0% 1.4% 100.0% 0.0% 509 30
Shakespeare 51.1% .3% 39.7% 2.8% 72.9% 6.7% 82 41
52.1% .3% 42.1% 2.1% 69.0% 4.4% 54 27
Table 1: Statistics of the test accuracy distribution for -FFL. By setting , the accuracy of the worst 10% devices is increased at the cost of possibly decreasing the accuracy of the best 10% devices. While the average accuracy remains similar, the variance of the final accuracy distribution decreases.

4.2 Fairness of -Ffl

In our first experiments, we verify that the proposed objective -FFL leads to more fair solutions (according to Definition 1) for federated data. In Figure 1, we compare the final testing accuracy distributions of two objectives ( and a tuned value of ) averaged across 5 random shuffles of each dataset. We observe that while the average testing accuracy remains fairly consistent, the objectives with result in more centered (i.e., fair) testing accuracy distributions with lower variance. In particular, while maintaining roughly the same average accuracy, -FFL reduces the variance of accuracies across all devices by 45% on average. We further report the worst and best 10% testing accuracies and the variance of the final accuracies in Table 1. Comparing and , we see that the average testing accuracy remains almost unchanged with the proposed objective despite significant reductions in variance. We see similar results on training accuracy distributions in Figure 4 and Table 4, Appendix B.3. Here, the average accuracy is with respect to all data points, not all devices. We observe similar results with respect to devices, as shown in Table 5, Appendix B.3.

Choosing . As discussed in Section 3.3, a natural question is to determine how should be tuned in the -FFL objective. The framework is flexible in that it allows one to choose to tradeoff between reduced variance of the accuracy distribution and a high average accuracy. The larger is, the more fairness could be imposed, though the average accuracy may potentially suffer. In general, this value can be tuned based on the data/application at hand and the desired amount of fairness. In particular, a reasonable approach in practice would be to run Algorithm 3 with multiple ’s in parallel to obtain multiple final global models, and then select amongst these based on performance (e.g., accuracy) on the validation data. Rather than selecting just one optimal from this procedure, each device could also pick a device-specific model based on their validation data. We show additional performance improvements with this device-specific strategy in Table 6 in Appendix B.3. Finally, we note that one potential issue is that increasing the value of may slow the speed of convergence. However, for values of that result in more fair results on our datasets, we do not observe significant convergence slowdown, as shown in Figure 5, Appendix B.3.

4.3 Comparison with other baselines

Next, we compare -FFL

with other baselines that are likely to impose fairness in federated networks. One heuristic is to weight each data point equally, which reduces to the original objective in (

1) (i.e., -FFL with ) and has been investigated in Section 4.2. We additionally compare with two alternatives: weighting devices equally when sampling devices, and weighting devices adversarially, namely, optimizing for the performance of the device with the largest loss, as proposed in [26].

Weighting devices equally. We compare -FFL with uniform sampling schemes and report testing accuracy in Figure 2. A table with the final accuracies and variances is given in the appendix in Table 8. While the ‘weighting each device equally’ heuristic tends to outperform our method in training accuracy distributions (Figure 6 and Table 7 in Appendix B.3), we see that our method produces more fair solutions in terms of testing accuracies. One explanation for this is that uniform sampling is a static method and can easily overfit to devices with very few data points, whereas -FFL has better generalization properties due to its dynamic nature.

Figure 2: -FFL () compared with uniform sampling. In terms of testing accuracy, our objective produces more fair solutions than uniform sampling. Distribution statistics are provided in Table 8.
Adult Fashion MNIST
Objectives average Dr. non-Dr. average shirt pullover T-shirt
-FFL, =0 83.2% .1% 69.9% .4% 83.3% .1% 78.8% .2% 66.0% .7% 84.5% .8% 85.9% .7%
AFL 82.5% .5% 73.0% 2.2% 82.6% .5% 77.8% 1.2% 71.4% 4.2% 81.0% 3.6% 82.1% 3.9%
-FFL, >0 82.6% .1% 74.1% .6% 82.7% .1% 77.8% .2% 74.2% .3% 78.9% .4% 80.4% .6%
-FFL, > 82.3% .1% 74.4% .9% 82.4% .1% 77.1% .4% 74.7% .9% 77.9% .4% 78.7% .6%
Table 2: Our objective compared with the baseline of weighing devices adversarially (AFL [26]) on two public datasets used in AFL. -FFL outperforms AFL on the worst testing accuracy of both datasets. The tunable parameter controls how much fairness we would like to achieve — larger induces less variance. Each accuracy is averaged across 5 runs with different random initializations.
Figure 3: For a fixed objective (i.e., the same ), the convergence of -FedAvg (Algorithm 3), -FedSGD (Algorithm 2), and FedSGD. For -FedAvg and -FedSGD, we tune a best step-size on and apply that step-size to solve -FFL with any . For FedSGD, we tune the step-size directly. We observe that (1) in most cases, -FedAvg converges faster in terms of communication rounds; (2) the proposed -FedSGD solver achieves similar performance compared with a best tuned FedSGD.

Weighting devices adversarially. We further compare with AFL [26], which is the only work we are aware of that aims to address fairness issues in federated learning. We implement a non-stochastic version of AFL where all devices are selected and updated each round, and perform grid search on the AFL hyperparameters, and . Full details of the implementation and hyperparameters (e.g., values of and ) are provided in Appendix B.2.2. In order to draw a fair comparison, we modify Algorithm 3 by sampling all devices and letting each of them run gradient descent at each round, using the same public datasets (Adult and Fashion MNIST) as in [26]. We note that, as opposed to AFL, -FFL is flexible depending on the amount of fairness desired, with larger leading to smaller accuracy variance. As discussed, -FFL generalizes AFL in this regard, as AFL is equivalent to -FFL with a large enough , where the device with the largest local loss dominates the global objective. In Table 2, we observe that -FFL can actually achieve higher testing accuracy on the device with the worst performance than AFL when is set appropriately. Interestingly, we also observe that -FFL converges faster in terms of communication rounds compared with AFL to obtain similar performance (Appendix B.3), which we suspect is due to the non-smoothness of the AFL objective.

4.4 Efficiency of -FedAvg

Finally, we show the efficiency of -FedAvg by comparing Algorithm 3 with its non-local-updating baseline -FedSGD (Algorithm 2) with the same objective (same values as in Table 1). At each communication round, -FedAvg runs one epoch of local updates on each selected device, while -FedSGD runs gradient descent using the local training data. In Figure 3, -FedAvg converges faster than -FedSGD in terms of communication rounds in most cases due to its local updating scheme. The slower convergence of -FedAvg compared with -FedSGD on the synthetic dataset may be due to the fact that when local data distributions are highly heterogeneous, local updating schemes may allow local models to move too far away from the initial global model, potentially hurting convergence; see Figure 8 in Appendix B.3 for more details. We also compare our solver -FedSGD with FedSGD with a best-tuned step-size. -FedSGD has similar performance with FedSGD, which indicates that (the inverse of) our estimated Lipchitz constant on is as good as a best tuned fixed step size. We can reuse this estimation for different ’s instead of manually re-tuning it when changes. We note here that number of rounds is a reasonable metric for comparison between these methods as they process the same amount of data and perform an equivalent amount of communication at each round. Both proposed methods -FedAvg and -FedSGD can be easily integrated into existing implementations of federated learning algorithms such as TensorFlow Federated [1].

5 Conclusion

In this work, we propose -FFL, a novel optimization objective inspired by fair resource allocation strategies in wireless networks that encourages more fair accuracy distributions in federated learning. We develop an efficient and scalable method -FedAvg to solve this objective that is amenable to current federated optimization frameworks. Through our extensive experiments on federated datasets, we validate the resulting fairness, flexibility, and efficiency of our proposed approaches compared with existing baselines.


We thank Sebastian Caldas, Neel Guha, Anit Kumar Sahu, Eric Tan, and Samuel Yeom for their helpful comments. This work was supported in part by the National Science Foundation grant IIS1838017, a Google Faculty Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.


  • [1] Tensorflow federated: Machine learning on decentralized data. URL https://www.tensorflow.org/federated.
  • Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. K. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In Operating Systems Design and Implementation, pages 265–283, 2016.
  • Agarwal et al. [2018] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pages 60–69, 2018.
  • Bonawitz et al. [2019] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselande. Towards federated learning at scale: System design. In Conference on Systems and Machine Learning, 2019.
  • Caldas et al. [2018] S. Caldas, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018.
  • Cotter et al. [2019] A. Cotter, H. Jiang, S. Wang, T. Narayan, M. Gupta, S. You, and K. Sridharan. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research, 2019.
  • Dashti et al. [2013] M. Dashti, P. Azmi, and K. Navaie. Harmonic mean rate fairness for cognitive radio networks with heterogeneous traffic. Transactions on Emerging Telecommunications Technologies, 24(2):185–195, 2013.
  • Duarte and Hu [2004] M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7):826–838, 2004.
  • Dwork et al. [2012] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science, pages 214–226, 2012.
  • Ee and Bajcsy [2004] C. T. Ee and R. Bajcsy. Congestion control and fairness for many-to-one routing in sensor networks. In International Conference on Embedded Networked Sensor Systems, pages 148–161, 2004.
  • Eryilmaz and Srikant [2006] A. Eryilmaz and R. Srikant. Joint congestion control, routing, and mac for stability and fairness in wireless networks. IEEE Journal on Selected Areas in Communications, 24(8):1514–1524, 2006.
  • Feldman [2015] M. Feldman. Computational fairness: Preventing machine-learned discrimination. 2015.
  • Feldman et al. [2015] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In International Conference on Knowledge Discovery and Data Mining, pages 259–268, 2015.
  • Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • Go et al. [2009] A. Go, R. Bhayani, and L. Huang. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12), 2009.
  • Hahne [1991] E. L. Hahne. Round-robin scheduling for max-min fairness in data networks. IEEE Journal on Selected Areas in communications, 9(7):1024–1039, 1991.
  • Hardt et al. [2016] M. Hardt, E. Price, and N. Srebro.

    Equality of opportunity in supervised learning.

    In Advances in Neural Information Processing Systems, pages 3315–3323, 2016.
  • Hashimoto et al. [2018] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938, 2018.
  • Jain et al. [1984] R. K. Jain, D.-M. W. Chiu, and W. R. Hawe. A quantitative measure of fairness and discrimination. Eastern Research Laboratory, Digital Equipment Corporation, Hudson, MA, 1984.
  • Kelly [1997] F. Kelly. Charging and rate control for elastic traffic. European Transactions on Telecommunications, 8(1):33–37, 1997.
  • Kelly et al. [1998] F. P. Kelly, A. K. Maulloo, and D. K. Tan. Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research society, 49(3):237–252, 1998.
  • Lan et al. [2010] T. Lan, D. Kao, M. Chiang, and A. Sabharwal. An axiomatic theory of fairness in network resource allocation. In Conference on Information Communications, pages 1343–1351, 2010.
  • Li et al. [2018] T. Li, A. K. Sahu, M. Sanjabi, M. Zaheer, A. Talwalkar, and V. Smith. Federated optimization for heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
  • McMahan et al. [2017] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas. Communication-efficient learning of deep networks from decentralized data. In

    International Conference on Artificial Intelligence and Statistics

    , 2017.
  • Mo and Walrand [2000] J. Mo and J. Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on networking, (5):556–567, 2000.
  • Mohri et al. [2019] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In International Conference on Machine Learning, 2019.
  • Nandagopal et al. [2000] T. Nandagopal, T.-E. Kim, X. Gao, and V. Bharghavan. Achieving mac layer fairness in wireless packet networks. In International Conference on Mobile Computing and Networking, pages 87–98, 2000.
  • Neely et al. [2008] M. J. Neely, E. Modiano, and C.-P. Li. Fairness and optimal stochastic control for heterogeneous networks. IEEE/ACM Transactions On Networking, 16(2):396–409, 2008.
  • Nesterov [2013] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013.
  • Pennington et al. [2014] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In

    Empirical Methods in Natural Language Processing

    , pages 1532–1543, 2014.
  • Radunovic and Le Boudec [2007] B. Radunovic and J.-Y. Le Boudec. A unified framework for max-min and min-max fairness with applications. IEEE/ACM Transactions on Networking, 15(5):1073–1083, 2007.
  • Recht et al. [2011] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
  • Rényi et al. [1961] A. Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1961.
  • Sanjabi et al. [2014] M. Sanjabi, M. Razaviyayn, and Z.-Q. Luo. Optimal joint base station assignment and beamforming for heterogeneous networks. IEEE Transactions on Signal Processing, 62(8):1950–1961, 2014.
  • Shalev-Shwartz and Zhang [2013] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 378–385, 2013.
  • Shamir et al. [2014] O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, pages 1000–1008, 2014.
  • Shi et al. [2014] H. Shi, R. V. Prasad, E. Onur, and I. Niemegeers. Fairness in wireless networks: Issues, measures and challenges. IEEE Communications Surveys and Tutorials, 16(1):5–24, 2014.
  • Smith et al. [2017] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, pages 4424–4434, 2017.
  • Smith et al. [2018] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
  • Stich [2019] S. U. Stich. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2019.
  • Wang and Joshi [2018] J. Wang and G. Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
  • [42] William Shakespeare. The Complete Works of William Shakespeare. Publicly available at //www.gutenberg.org/ebooks/100.
  • Woodworth et al. [2017] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920–1953, 2017.
  • Woodworth et al. [2018] B. E. Woodworth, J. Wang, A. Smith, B. McMahan, and N. Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems, pages 8496–8506, 2018.
  • Yu et al. [2019] H. Yu, S. Yang, and S. Zhu. Parallel restarted sgd for non-convex optimization with faster convergence and less communication. In AAAI Conference on Artificial Intelligence, 2019.
  • Zafar et al. [2017a] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In International Conference on World Wide Web, pages 1171–1180, 2017a.
  • Zafar et al. [2017b] M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classification. In International Conference on Artificial Intelligence and Statistics, pages 962–970, 2017b.

Appendix A -fairness and -Ffl

As discussed in Section 2, while it is natural to consider the -fairness framework for machine learning, we are unaware of any work that uses -fairness to modify machine learning training objectives. We provide additional details on the framework below; for further background on -fairness and fairness in resource allocation more generally, we defer the reader to [37, 25].

-fairness [22, 25] is a popular fairness metric widely-used in resource allocation problems. The framework defines a family of overall utility functions that can be derived by summing up the following function of the individual utilities of the users in the network:

Here represents the individual utility of some specific user given allocated resources (e.g., bandwidth). The goal is to find a resource allocation strategy to maximize the sum of the individual utilities. This family of functions includes a wide range of popular fair resource allocation strategies. In particular, the above function represents zero fairness with , proportional fairness [20] with , harmonic mean fairness [7] with , and max-min fairness [31] with .

Note that in federated learning, we are dealing with costs and not utilities. Thus, max-min in resource allocation corresponds to min-max in our setting. With this analogy, it is clear that in our proposed objective -FFL (2), the case where corresponds to min-max fairness since it is optimizing for the worst performing device, similar to what was proposed in [26]. Also, corresponds to zero fairness, which reduces to the original FedAvg objective (1). In resource allocation problems, can be tuned for trade-offs between fairness and system efficiency. In federated settings, can be tuned based on the desired level of fairness (i.e., lower variance of accuracy distributions) and other performance metrics such as the overall accuracy. For instance, in Table 2 in Section 4.3, we demonstrate on two datasets that as increases, the overall average accuracy decreases slightly while the worst accuracies are increased significantly and the variance of the accuracies decreases.

Appendix B Experimental Details

b.1 Datasets and Models

We provide full details on the datasets and models used in our experiments. The statistics of four federated datasets are summarized in Table 3. We report total number of devices, total number of samples, and mean and deviation in the sizes of total data points on each device. Additional details on the datasets and models are described below.

  • [leftmargin=*]

  • Synthetic: We follow a similar set up as that in [36] and impose additional heterogeneity. The model is , , and the goal is to learn a global and . Samples and local models on each device satisfies , , ; , where the covariance matrix is diagonal with . Each element in is drawn from . There are 100 devices in total and the number of samples on each devices follows a power law.

  • Vehicle222http://www.ecs.umass.edu/~mduarte/Software.html: We use the same Vehicle Sensor (Vehicle) dataset as [38], modelling each sensor as a device. Each sample has a 100-dimension feature and a binary label indicating whether this sample is on an AAV-type or DW-type vehicle. We train a linear SVM. We tune the hyperparameters in SVM and report the best configuration.

  • Sent140: This dataset is a collection of tweets from Sentiment140 [15] (Sent140). The task is text sentiment analysis which we model as a binary classification problem. The model takes as input a 25-word sequence, embeds each word into a 300-dimensional space using pretrained Glove [30], and outputs a binary label after two LSTM layers and one densely-connected layer.

  • Shakespeare: This dataset is built from The Complete Works of William Shakespeare [24, 42]. Each speaking role in the plays is associated with a device. We subsample 31 speaking roles to train a deep model for next character prediction. The model takes as input an 80-character sequence, embeds each character into a learnt 8-dimensional space, and outputs one character after two LSTM layers and one densely-connected layer.

Dataset Devices Samples Samples/device
mean stdev
Synthetic 100 12,697 127 73
Vehicle 23 43,695 1,899 349
Sent140 1,101 58,170 53 32
Shakespeare 31 116,214 3,749 6,912
Table 3: Statistics of Federated Datasets

b.2 Implementation Details

b.2.1 Machines & Softwares

We simulate the federated setting (one server and devices) on a server with 2 Intel Xeon E5-2650 v4 CPUs and 8 NVidia 1080Ti GPUs. We implement all code in TensorFlow [2] Version 1.10.1.
Please see github.com/litian96/fair_flearn for full details.

b.2.2 Hyperparameters

We randomly split data on each local device into 80% training set, 10% testing set, and 10% validation set. We tune an optimal 333By optimal we mean the setting where the variance of accuracy decreases the most, while keeping the overall average accuracy unchanged. from

on the validation set and report accuracy distributions on the testing set. For each dataset, we repeat this process for five randomly selected train/test/validation splits, and report the mean and standard deviation across these five runs where applicable. For Synthetic, Vehicle, Sent140, and Shakespeare, optimal

values are 1, 5, 1, and 0.001, respectively. For all datasets, we randomly sample 10 devices each round. We tune the learning rate and batch size on FedAvg and use the same learning rate and batch size for all -FedAvg experiments of that dataset. The learning rates for Synthetic, Vehicle, Sent140, and Shakespeare are 0.1, 0.01, 0.03, and 0.8, respectively. The batch sizes for Synthetic, Vehicle, Sent140, and Shakespeare are 10, 64, 32, and 10. In comparing -FedAvg’s efficiency with -FedSGD, we also tune a best learning rate for -FedSGD methods on =0. For each comparison, we fix devices selected and mini-batch orders across all runs. We stop training when the training loss does not decrease for 10 rounds. When running AFL methods, we search for a best and such that AFL achieves the highest testing accuracy on the device with the highest loss within a fixed number of rounds. For Adult, we use and ; for Fashion MNIST, we use and . We use the same as step-sizes for -FedAvg on Adult and Fashion MNIST. In Table 2, for -FFL on Adult and for -FFL on Fashion MNIST. The number of local epochs is fixed to 1 whenever we do local updates.

b.3 Additional Experiments

Fairness of -FFL with respect to training accuracy. The empirical results in Section 4 are with respect to testing accuracy. As a sanity check, we show that -FFL also results in more fair training accuracy distributions in Figure 4 and Table 4.

Figure 4: -FFL results in more centered (i.e., fair) training accuracy distributions across devices.
Dataset Objective Average Worst 10% Best 10% Variance
Synthetic 81.7% .3% 23.6% 1.1% 100.0% 0.0% 597 10
78.9% .2% 41.8% 1.0% 96.8% .5% 292 11
Vehicle 87.5% .2% 49.5% 10.2% 94.9% .7% 237 97
87.8% .5% 71.3% 2.2% 93.1% 1.4% 37 12
Sent140 69.8% .8% 36.9% 3.1% 94.4% 1.1% 278 44
68.2% .6% 46.0 % .3% 88.8% .8% 143 4
Shakespeare 72.7% .8% 46.4% 1.4% 79.7% .9% 116 8
66.7% 1.2% 48.0% .4% 71.2% 1.9% 56 9
Table 4: More statistics showing that -FFL results in more fair training accuracy distributions. We see that under the -FFL (>0) objective, the worst 10% training accuracy increases, and variance of training accuracies is smaller.

Average testing accuracy with respect to devices. In Section 4.2, we show that -FFL leads to more fair accuracy distributions while maintaining approximately the same testing accuracies. Note that we report average testing accuracy with respect to all data points in Table 1. However, we observe similar results on average accuracy with respect to all devices between and objectives, as shown in Table 5.

Dataset Objective Accuracy w.r.t. Data Points Accuracy w.r.t. Devices
Synthetic 80.8% .9% 77.3% .6%
79.0% 1.2% 76.3% 1.7%
Vehicle 87.3% .5% 85.6% .4%
87.7% .7% 86.5% .7%
Sent140 65.1% 4.8% 64.6% 4.5%
66.5% .2% 66.2% .2%
Shakespeare 51.1% .3% 61.4% 2.7%
52.1% .3% 60.0% .5%
Table 5: Average testing accuracy under -FFL objectives. We show that the resulting solutions of =0 and >0 objectives have approximately the same accuracies both with respect to all data points and with respect to all devices.

Device-specific . In these experiments, we explore a device-specific strategy for selecting in -FFL. We solve -FFL with in parallel. After training, each device selects the best resulting model based on the validation data and tests the performance of the model using the testing set. We report the results in terms of testing accuracy in Table 6. Interestingly, using this device-specific strategy the average accuracy in fact increases while the accuracy variance is reduced, in comparison with . We note that this strategy does induce more local computation and additional communication load at each round. However, it does not increase the number of communication rounds if run in parallel.

Dataset Objective Average Worst 10% Best 10% Variance
Vehicle =0 87.3% .5% 43.0% 1.0% 95.7% 1.0% 291 18
=5 87.7% .7% 69.9% .6% 94.0% .9% 48 5
multiple ’s 88.5% .3% 70.0% 2.0% 95.8% .6% 52 7
Shakespeare =0 51.1% .3% 39.7% 2.8% 72.9% 6.7% 82 41
=.001 52.1% .3% 42.1% 2.1% 69.0% 4.4% 54 27
multiple ’s 52.0 1.5% % 41.0% 4.3% 72.0% 4.8% 72 32
Table 6: Effects of running -FFL with several ’s in parallel. Multiple global models (corresponding to different ’s) are maintained independently during the training process. While this adds additional local computation and more communication load per round, the device-specific strategy has the added benefit of increasing the accuracies of devices with worst 10% accuracies and devices with best 10% accuracies simultaneously.

Convergence speed of -FFL. In Section 4.2, we show that our solver -FedAvg using local updating schemes converges significantly faster than -FedSGD. A natural question one might ask is: will the -FFL >0 objective slow the convergence compared with FedAvg? We empirically investigate this on the four datasets. We use -FedAvg to solve -FFL, and compare it with FedAvg (i.e., solving -FFL with ). As demonstrated in Figure 5, the values that result in more fair solutions also do not significantly slowdown convergence.

Figure 5: The convergence speed of -FFL compared with FedAvg. We plot the distance to the highest accuracy achieved versus communication rounds. Although -FFL with >0 is a more difficult optimization problem, for the values we choose that could lead to more fair results, the convergence speed is comparable to that of .

Comparison with uniform sampling. In Figure 6 and Table 7, we show that in terms of training accuracies, the uniform sampling heuristic outperforms -FFL (as opposed to the testing accuracy results in Section 4). We suspect that this is because the uniform sampling baseline is a static method and is likely to overfit to those devices with few samples. In additional to Figure 2 in Section 4.3, we also report the average testing accuracy with respect to data points, best 10%, worst 10% accuracies, and the variance in Table 8.

Figure 6: -FFL () compared with uniform sampling in training accuracy. We see that in most cases uniform sampling has higher (and more fair) training accuracies due to the fact that it is overfitting to devices with few samples.
Dataset Objective Average Worst 10% Best 10% Variance
Synthetic uniform 83.5% .2% 42.6% 1.4% 100.0% 0.0% 366 17
78.9% .2% 41.8% 1.0% 96.8% .5% 292 11
Vehicle uniform 87.3% .3% 46.6% .8% 94.8% .5% 261 10
87.8% .5% 71.3% 2.2% 93.1% 1.4% 122 12
Sent140 uniform 69.1% .5% 42.2% 1.1% 91.0% 1.3% 188 19
68.2% .6% 46.0 % .3% 88.8% .8% 143 4
Shakespeare uniform 57.7% 1.5% 54.1% 1.7% 72.4% 3.2% 32 7
66.7% 1.2% 48.0% .4% 71.2% 1.9% 56 9
Table 7: More statistics showing that uniform sampling outperforms -FFL in terms of training accuracies. We observe that uniform sampling could result in more fair training accuracy distributions with smaller variance in some cases.
Dataset Objective Average Worst 10% Best 10% Variance
Synthetic uniform 82.2% 1.1% 30.0% .4% 100.0% 0.0% 525 47
79.0% 1.2% 31.1% 1.8% 100.0% 0.0% 472 14
Vehicle uniform 86.8% .3% 45.4% .3% 95.4% .7% 267 7
87.7% 0.7% 69.9% .6% 94.0% .9% 48 5
Sent140 uniform 66.6% 2.6% 21.1% 1.9% 100.0% 0.0% 560 19
66.5% .2% 23.0 % 1.4% 100.0% 0.0% 509 30
Shakespeare uniform 50.9% .4% 41.0% 3.7% 70.6% 5.4% 71 38
52.1% .3% 42.1% 2.1% 69.0% 4.4% 54 27
Table 8: More statistics indicating the resulting fairness of -FFL compared with the uniform sampling baseline. Again, we observe that the testing accuracy of the worst 10% devices tends to increase, and the variance of the final testing accuracies is smaller.

Efficiency of -FFL compared with AFL. One added benefit of -FFL is that it leads to faster convergence than AFL—even when we use non-local-updating methods for both objectives. In Figure 7, we show with respect to the final testing accuracy for the single worst device (i.e., the objective that AFL is trying to optimize), -FFL converges faster than AFL. As the number of devices increases (from Fashion MNIST to Vehicle), the performance gap between AFL and -FFL becomes larger because AFL introduces larger variance.

Figure 7: -FFL is more efficient than AFL. With the worst device achieving the same final testing accuracy, -FFL converges faster than AFL. For Vehicle (with 23 devices) as opposed to Fashion MNIST (with 3 devices), we see that the performance gap is larger. We run full gradient descent at each round for both methods.

Efficiency of -FedAvg under different data heterogeneity. As discussed in Section 4.4, one potential cause for the slower convergence of -FedAvg on the synthetic dataset may be that local updating schemes could hurt convergence when local data distributions are highly heterogeneous. Although it has been shown that applying updates locally results in significantly faster convergence in terms of communication rounds [24, 39], which is consistent with our observation on most datasets, we note that when data is highly heterogeneous, local updating may hurt convergence. We validate this by creating an IID synthetic dataset (Synthetic-IID) where local data on each device follow the same global distribution. We call the synthetic dataset used in Section 4 Synthetic-Non-IID. We also create a hybrid dataset (Synthetic-Hybrid) where half of the total devices are assigned IID data from the same distribution, and half of the total devices are assigned data from different distributions. We observe that if data is perfectly IID, -FedAvg is more efficient than -FedSGD. As data become more heterogeneous, -FedAvg converges more slowly than -FedSGD in terms of communication rounds. For all three synthetic datasets, we repeat the process of tuning a best constant step-size for FedSGD and observe similar results as before — our dynamic solver -FedSGD behaves similarly (or sometimes outperforms) a best hand-tuned FedSGD.

Figure 8: Convergence of -FedAvg compared with -FedSGD under different data heterogeneity. When data distributions are heterogeneous, it is possible that -FedAvg converges more slowly than -FedSGD. Again, the proposed dynamic solver -FedSGD performs similarly (or better) than a best tuned fixed-step-size FedSGD.