fair_flearn
Fair Resource Allocation in Federated Learning (ICLR '20)
view repo
Federated learning involves training statistical models in massive, heterogeneous networks. Naively minimizing an aggregate loss function in such a network may disproportionately advantage or disadvantage some of the devices. In this work, we propose q-Fair Federated Learning (q-FFL), a novel optimization objective inspired by resource allocation in wireless networks that encourages a more fair (i.e., lower-variance) accuracy distribution across devices in federated networks. To solve q-FFL, we devise a communication-efficient method, q-FedAvg, that is suited to federated networks. We validate both the effectiveness of q-FFL and the efficiency of q-FedAvg on a suite of federated datasets, and show that q-FFL (along with q-FedAvg) outperforms existing baselines in terms of the resulting fairness, flexibility, and efficiency.
READ FULL TEXT VIEW PDFFair Resource Allocation in Federated Learning (ICLR '20)
With the growing prevalence of IoT-type devices, data is frequently collected and processed outside of the data center and directly on distributed devices, such as sensors, wearable devices, or mobile phones. Federated learning
is a promising learning paradigm for this setting that pushes machine learning model training to the edge
[24]. Federated learning methods aim to address key challenges such as user privacy, expensive communication, and device variability.In federated learning, the goal is typically to fit a model to data generated by a network of devices via some empirical risk minimization objective. The number of devices in such networks is generally large—ranging from hundreds to millions. Naively minimizing the average loss in such a massive network may disproportionately advantage or disadvantage the model performance on some of the devices. Indeed, although the accuracy may be high on average, there is no accuracy guarantee for individual devices in the network. This is exacerbated by the fact that the data are often heterogeneous across devices both in terms of size and distribution. In this work, we therefore ask: Can we devise an efficient federated optimization method to encourage a more fair distribution of the model performance across devices in federated networks?
There have been tremendous recent interests in developing fair methods for machine learning [see, e.g., 6, 9]. However, methods that could help improve fairness of the accuracy distribution in distributed settings are typically proposed for a much smaller number of devices, and may be impractical in federated networks due to the number of involved constraints [6]. Recent work that has been proposed specifically for the federated setting has also only been applied at small scales (2-3 groups/devices), and lacks flexibility by optimizing only the performance of the single worst device [26].
In this work, we propose -FFL, a novel optimization objective that addresses fairness issues in federated learning. Inspired by work in fair resource allocation for wireless networks, -FFL minimizes an aggregate reweighted loss parameterized by such that the devices with higher loss are given higher relative weight to encourage less variance (i.e., more fairness) in the accuracy distribution. Adaptively minimizing such a modified objective avoids the burden of hand-crafting fairness constraints, and results in a flexible framework in which the objective can be tuned depending on the desired amount of fairness. In addition, we propose a lightweight and scalable distributed method, -FedAvg, to solve -FFL, which carefully accounts for important characteristics of the federated setting such as communication-efficiency and low participation of devices [4, 24].
Contributions. We summarize our contributions as follows. First, we propose -FFL, a novel objective that can improve the fairness of the accuracy distribution in federated learning. Second, we design a scalable method, -FedAvg, that can efficiently solve the proposed objective in massive federated networks. Finally, through extensive experiments on federated datasets with both convex and non-convex models, we demonstrate the fairness and flexibility of -FFL and the efficiency of -FedAvg compared with existing baselines. Empirically, -FFL is able to reduce the variance of accuracies across devices by 45% on average while maintaining the same overall average accuracy.
Fairness in Machine Learning. Fairness is a broad topic that has received much recent attention in the machine learning community. There are several widespread approaches to address fairness, in which fairness is typically defined as the protection of some specific attribute(s) (e.g., [17]). Two common approaches are to preprocess the data to remove information about the protected attribute [13]
, or to post-process the model by adjusting the prediction threshold after classifiers are trained
[12, 17]. Another set of works optimize an objective subject to some fairness constraints during training time [3, 6, 18, 43, 46, 47, 9]. Our work also enforces fairness during training, though we define fairness as the variance of the accuracy distribution across devices in federated learning (Section 3), as opposed to the protection of a specific attribute. Although some work defines equal error rates among specific groups as a notion of fairness [46, 6], our goal is not to optimize for the same accuracy across all devices due to the heterogeneous nature of federated settings. Cotter et al. [6] uses a notion of ‘minimum accuracy’ as one special case of ‘rate constraints’, which is conceptually similar to our goal. However, it requires one optimization constraint for each device/group, which would result in thousands to millions of constraints in the federated setting.In federated settings, Mohri et al. [26] recently proposed a minimax optimization scheme, Agnostic Federated Learning (AFL), which optimizes for the performance of the single worst device.^{1}^{1}1The notion of ‘group’ in [26] is the same as the notion of ‘device’ used here. This method has only been applied at small scales (for a handful of groups/devices). Compared to AFL, our proposed objective is more flexible as it can be tuned based on the desired amount of fairness; -FFL in fact generalizes AFL as -FFL with a large enough is equivalent to AFL. We demonstrate the improved flexibility and scalability of -FFL compared to AFL empirically in Section 4.
Fairness in Resource Allocation. Fair resource allocation has been extensively studied in fields such as network management [10, 16, 21, 28] and wireless communications [11, 27, 34, 37]. In these contexts, the problem is defined as allocating a scarce shared resource, e.g., communication time or power, among many users. In these cases directly maximizing utilities such as total throughput usually leads to unfair allocations where some users receive poor service. As a service provider, it is important to improve the quality of service for all users while maintaining overall throughput. For this reason several popular fairness measurements have been proposed to balance between fairness and total throughput, including Jain’s index [19], entropy [33], max-min/min-max fairness [31], and proportional fairness [20]. A unified framework is captured through -fairness [22, 25], in which the network manager can tune the emphasis on fairness by changing a single parameter, .
To draw an analogy between federated learning and the problem of resource allocation, one can think of the global model as a resource that is meant to serve the users (or devices). In this sense, it is natural to ask similar questions about the fairness of the service that users receive and use similar tools to promote fairness. Despite this, we are unaware of any works that use -fairness from resource allocation to modify training objectives in machine learning. Inspired by the -fairness metric, we propose a similarly modified objective function, -Fair Federated Learning (-FFL), to encourage a more fair accuracy distribution across devices in the context of federated training. Similar to the -fairness metric, our -FFL objective is flexible enough to enable trade-offs between fairness and other traditional metrics such as average accuracy by changing the parameter . In Section 4, we demonstrate empirically that the use of -FFL as an objective in federated learning enables a more fair test accuracy distribution among the devices.
Federated and Distributed Optimization. To devise a practical fairness solution for the federated setting, it is critical to design methods for efficiently solving the proposed objective. Federated learning faces challenges such as expensive communication, systems heterogeneity (e.g., variability in hardware or network connection) and statistical heterogeneity (i.e., differing local data distributions per device), making it distinct from classical distributed optimization [32, 35, 39]. In order to reduce communication, as well as to tolerate heterogeneity, methods that allow for local updating and low participation among devices have become de facto solvers for this setting [23, 24, 38]. We incorporate recent advancements in this field when designing methods to solve the -FFL objective (Section 3.3).
In this section, we formally define the classical federated learning objective and methods, and introduce our proposed notion of fairness. We then introduce -FFL, a novel objective that encourages a more fair accuracy distribution across all devices (Section 3.2). Finally, in Section 3.3, we describe -FedAvg, an efficient distributed method to solve the -FFL objective in federated settings.
Federated learning algorithms involve hundreds to millions of remote devices learning locally on their device-generated data and communicating with a central server periodically to reach a global consensus. In particular, the goal is typically to minimize the following objective function:
(1) |
where is the total number of devices, , and . The local objective ’s can be defined by empirical risks over local data, i.e., , where is the number of samples available locally. We can set to be , where is the total number of samples to fit a traditional empirical risk minimization-type objective over the entire dataset.
Most prior work solves (1) by sampling a subset of devices with probabilities proportional to
at each round, and then applying an optimizer such as stochastic gradient descent (SGD) locally. These
local updating methods enable flexible and efficient communication by running an optimizer for a variable number of iterations locally on each device, e.g., compared to traditional mini-batch methods, which would simply calculate a subset of the gradients [40, 41, 44, 45]. FedAvg [24], summarized in Algorithm 1, is one of the leading methods to solve (1). The method runs simply by having each selected device apply epochs of SGD locally and then averaging the resulting models.Unfortunately, solving problem (1) in this manner can implicitly introduce unfairness between different devices. For instance, the learned model may be biased towards devices with larger numbers of data points, or (if weighting devices equally), to commonly occurring groups of devices. More formally, we define our desired fairness criteria for federated learning below.
For trained models and , we say that model provides a more fair solution to the federated learning objective (1) than model if the variance of the performance of model on the devices, , is smaller than the variance of the performance of model on the devices, i.e., .
In this work, we take ‘performance’, , to be the testing accuracy of applying the trained model on the test data for device . We note that a tension exists between the variance of the final testing accuracy distribution and the average testing accuracy across devices. In general, our goal is to reduce the variance while maintaining the same (or similar) average accuracy.
A natural idea to achieve fairness as defined in (1) would be to reweight the objective—assigning higher weight to devices with poor performance, so that the distribution of accuracies in the network reduces in variance. Note that this re-weighting must be done dynamically, as the performance of the devices depends on the model being trained, which cannot be evaluated a priori. Drawing inspiration from -fairness, a utility function used in fair resource allocation in wireless networks, we propose the following objective. For given local non-negative cost functions and parameter , we define the -Fair Federated Learning (-FFL) objective as:
(2) |
where denotes to the power of . Here, is a parameter that tunes the amount of fairness we wish to impose. Setting does not encourage fairness beyond the classical federated learning objective (1). A larger means that we emphasize devices with higher local empirical losses, , thus reducing the variance of the training accuracy distribution and potentially inducing fairness in accordance with Definition 1. with a large enough reduces to classical max-min fairness [26], as the device with the worst performance (largest loss) will dominate the objective. We note that while the term in the denominator in (2) may be absorbed in , we include it as it is standard in the -fairness literature and helps to ease notation in the following sections.
In this section, we provide methods to solve -FFL. We start by giving a fair but less efficient method, -FedSGD, to illustrate the main techniques we use in terms of solving the -FFL problem (2). We then provide a more efficient counterpart, -FedAvg, by considering key properties of federated algorithms such as local updating schemes. These proposed methods closely mirror traditional distributed optimization methods—mini-batch SGD and federated averaging (FedAvg)—but with step-sizes and subproblems carefully chosen in accordance with the -FFL problem (2).
Hyperparameter tuning: and step-sizes. In devising a method to solve -FFL (2), we begin by noting that it is crucial to first determine how to set . In practice, can be tuned based on the desired amount of fairness (with larger inducing more fairness). As we describe in our experiments (Section 4.2), it is therefore common to train a family of objectives for different values so that a practitioner can explore the trade-off between accuracy and fairness for the application at hand.
One concern with solving such a family of objectives is that the training costs can increase significantly. In particular, to optimize -FFL in a scalable fashion, we rely on gradient-based methods, where the step-size inversely depends on the Lipchitz constant of the function’s gradient, which is often unknown and selected via grid search [14, 29]. As we intend to optimize -FFL for various values of , the Lipchitz constant will change as we change —requiring step-size tuning for all values of
. This can quickly cause the search space to explode. To overcome this issue, we propose estimating the local Lipchitz constant of the gradient for the family of
-FFL objectives by using the Lipchitz constant we infer via grid search on . This allows us to dynamically adjust the step-size of our gradient-based optimization method for the -FFL objective, avoiding the manual tuning for each . In Lemma 2 we formalize the relation between the Lipschitz constant, , for and .If the non-negative function has a Lipchitz gradient with constant , then for any and at any point ,
(3) |
is an upper-bound for the local Lipchitz constant of the gradient of at point .
At any point , we can compute the Hessian as:
(4) |
As a result, . ∎
A first approach: -FedSGD. Our first fair federated learning method, -FedSGD, is an extension of the well-known federated mini-batch SGD (FedSGD) method [24]. -FedSGD uses a dynamic step-size based on Lemma 2 instead of the normal fixed step-size of FedSGD. In each step of -FedSGD, a subset of the devices are selected, and for each device in this subset, and are computed at the current iterate and communicated to the central node. This information is used to adjust the weight for combining the updates from each device based on Lemma 2. The details of -FedSGD are summarized in Algorithm 2. It is important to note that to run -FedSGD with different values of , we only need to estimate once (for ) and can then re-use it for all values of .
Improving communication-efficiency: -FedAvg. In federated settings, communication-efficient schemes using local stochastic solvers (such as FedAvg, described in Section 3.1) have been shown to significantly improve convergence speed [24]. Using stochastic (as opposed to batch) methods locally is important as it enables flexibility in terms of local computation vs. communication. Unfortunately, it is not straightforward to simply apply FedAvg to problem (2) when , as the term prevents the use of local SGD. To address this, we propose instead optimizing locally. This is reasonable due to the fact that minimizing is equivalent to minimizing (when and ). However, if we combine these updates by simple averaging, similar to FedAvg, it would optimize (1) and not (2). Instead, we combine the local updates using the weights inferred via Lemma 2, similar to -FedSGD. In particular, we replace the gradient of the local functions, , in the -FedSGD
steps with the local update vectors that are obtained by running SGD locally on device
. This allows us to extend the local updating technique of FedAvg to the -FFL objective (2).We provide additional details on -FedAvg in Algorithm 3. As we will see empirically, due to the local updating, -FedAvg can solve -FFL objective more efficiently than -FedSGD in most cases. Similar to -FedSGD, it also does not require re-tuning the step-size when changes.
We now present empirical results of the proposed objective, -FFL, and proposed methods, -FedAvg and -FedSGD. We describe our experimental setup including datasets used in Section 4.1. We then demonstrate the improved fairness of -FFL in Section 4.2, and compare -FFL with several baseline fairness methods in Section 4.3. Finally, we show the efficiency of -FedAvg compared with -FedSGD in Section 4.4. All code, data, and experiments are publicly available at github.com/litian96/fair_flearn.
Federated Datasets. We explore one synthetic and three non-synthetic federated datasets, using both convex and non-convex models in our experiments. The datasets are curated from prior work in federated learning [24, 38, 23] as well as recent federated learning benchmarks [5]. In particular, we first study a synthetic dataset similar to that in [36] and impose additional heterogeneity amongst 1,000 devices. We then investigate a Vehicle dataset consisting of acoustic, seismic, and infrared sensor data collected from a distributed network of 23 sensors [8]. We model each sensor as a device and train a linear SVM to predict between AAV-type and DW-type vehicles. In non-convex settings, we study tweets from 1,101 accounts curated from Sentiment140 [15]
(Sent140) where each Twitter account corresponds to a device. We use an LSTM classifier for text sentiment analysis. Finally, we explore text data built from
The Complete Works of William Shakespeare [24, 42] where each speaking role is associated with a device. We randomly subsample 31 devices, and use an LSTM to predict the next character. Full details of the datasets are given in Appendix B.1.Implementation.
We implement all code in TensorFlow
[2], simulating a federated network with one server and devices. We provide full details in Appendix B.2, and all hyperparameter values are given in Appendix B.2.2.Dataset | Objective | Average | Worst 10% | Best 10% | Variance |
Synthetic | 80.8% .9% | 18.8% 5.0% | 100.0% 0.0% | 724 72 | |
79.0% 1.2% | 31.1% 1.8% | 100.0% 0.0% | 472 14 | ||
Vehicle | 87.3% .5% | 43.0% 1.0% | 95.7% 1.0% | 291 18 | |
87.7% .7% | 69.9% .6% | 94.0% .9% | 48 5 | ||
Sent140 | 65.1% 4.8% | 15.9% 4.9% | 100.0% 0.0% | 697 132 | |
66.5% .2% | 23.0% 1.4% | 100.0% 0.0% | 509 30 | ||
Shakespeare | 51.1% .3% | 39.7% 2.8% | 72.9% 6.7% | 82 41 | |
52.1% .3% | 42.1% 2.1% | 69.0% 4.4% | 54 27 |
In our first experiments, we verify that the proposed objective -FFL leads to more fair solutions (according to Definition 1) for federated data. In Figure 1, we compare the final testing accuracy distributions of two objectives ( and a tuned value of ) averaged across 5 random shuffles of each dataset. We observe that while the average testing accuracy remains fairly consistent, the objectives with result in more centered (i.e., fair) testing accuracy distributions with lower variance. In particular, while maintaining roughly the same average accuracy, -FFL reduces the variance of accuracies across all devices by 45% on average. We further report the worst and best 10% testing accuracies and the variance of the final accuracies in Table 1. Comparing and , we see that the average testing accuracy remains almost unchanged with the proposed objective despite significant reductions in variance. We see similar results on training accuracy distributions in Figure 4 and Table 4, Appendix B.3. Here, the average accuracy is with respect to all data points, not all devices. We observe similar results with respect to devices, as shown in Table 5, Appendix B.3.
Choosing . As discussed in Section 3.3, a natural question is to determine how should be tuned in the -FFL objective. The framework is flexible in that it allows one to choose to tradeoff between reduced variance of the accuracy distribution and a high average accuracy. The larger is, the more fairness could be imposed, though the average accuracy may potentially suffer. In general, this value can be tuned based on the data/application at hand and the desired amount of fairness. In particular, a reasonable approach in practice would be to run Algorithm 3 with multiple ’s in parallel to obtain multiple final global models, and then select amongst these based on performance (e.g., accuracy) on the validation data. Rather than selecting just one optimal from this procedure, each device could also pick a device-specific model based on their validation data. We show additional performance improvements with this device-specific strategy in Table 6 in Appendix B.3. Finally, we note that one potential issue is that increasing the value of may slow the speed of convergence. However, for values of that result in more fair results on our datasets, we do not observe significant convergence slowdown, as shown in Figure 5, Appendix B.3.
Next, we compare -FFL
with other baselines that are likely to impose fairness in federated networks. One heuristic is to weight each data point equally, which reduces to the original objective in (
1) (i.e., -FFL with ) and has been investigated in Section 4.2. We additionally compare with two alternatives: weighting devices equally when sampling devices, and weighting devices adversarially, namely, optimizing for the performance of the device with the largest loss, as proposed in [26].Weighting devices equally. We compare -FFL with uniform sampling schemes and report testing accuracy in Figure 2. A table with the final accuracies and variances is given in the appendix in Table 8. While the ‘weighting each device equally’ heuristic tends to outperform our method in training accuracy distributions (Figure 6 and Table 7 in Appendix B.3), we see that our method produces more fair solutions in terms of testing accuracies. One explanation for this is that uniform sampling is a static method and can easily overfit to devices with very few data points, whereas -FFL has better generalization properties due to its dynamic nature.
Adult | Fashion MNIST | ||||||
Objectives | average | Dr. | non-Dr. | average | shirt | pullover | T-shirt |
-FFL, =0 | 83.2% .1% | 69.9% .4% | 83.3% .1% | 78.8% .2% | 66.0% .7% | 84.5% .8% | 85.9% .7% |
AFL | 82.5% .5% | 73.0% 2.2% | 82.6% .5% | 77.8% 1.2% | 71.4% 4.2% | 81.0% 3.6% | 82.1% 3.9% |
-FFL, >0 | 82.6% .1% | 74.1% .6% | 82.7% .1% | 77.8% .2% | 74.2% .3% | 78.9% .4% | 80.4% .6% |
-FFL, > | 82.3% .1% | 74.4% .9% | 82.4% .1% | 77.1% .4% | 74.7% .9% | 77.9% .4% | 78.7% .6% |
Weighting devices adversarially. We further compare with AFL [26], which is the only work we are aware of that aims to address fairness issues in federated learning. We implement a non-stochastic version of AFL where all devices are selected and updated each round, and perform grid search on the AFL hyperparameters, and . Full details of the implementation and hyperparameters (e.g., values of and ) are provided in Appendix B.2.2. In order to draw a fair comparison, we modify Algorithm 3 by sampling all devices and letting each of them run gradient descent at each round, using the same public datasets (Adult and Fashion MNIST) as in [26]. We note that, as opposed to AFL, -FFL is flexible depending on the amount of fairness desired, with larger leading to smaller accuracy variance. As discussed, -FFL generalizes AFL in this regard, as AFL is equivalent to -FFL with a large enough , where the device with the largest local loss dominates the global objective. In Table 2, we observe that -FFL can actually achieve higher testing accuracy on the device with the worst performance than AFL when is set appropriately. Interestingly, we also observe that -FFL converges faster in terms of communication rounds compared with AFL to obtain similar performance (Appendix B.3), which we suspect is due to the non-smoothness of the AFL objective.
Finally, we show the efficiency of -FedAvg by comparing Algorithm 3 with its non-local-updating baseline -FedSGD (Algorithm 2) with the same objective (same values as in Table 1). At each communication round, -FedAvg runs one epoch of local updates on each selected device, while -FedSGD runs gradient descent using the local training data. In Figure 3, -FedAvg converges faster than -FedSGD in terms of communication rounds in most cases due to its local updating scheme. The slower convergence of -FedAvg compared with -FedSGD on the synthetic dataset may be due to the fact that when local data distributions are highly heterogeneous, local updating schemes may allow local models to move too far away from the initial global model, potentially hurting convergence; see Figure 8 in Appendix B.3 for more details. We also compare our solver -FedSGD with FedSGD with a best-tuned step-size. -FedSGD has similar performance with FedSGD, which indicates that (the inverse of) our estimated Lipchitz constant on is as good as a best tuned fixed step size. We can reuse this estimation for different ’s instead of manually re-tuning it when changes. We note here that number of rounds is a reasonable metric for comparison between these methods as they process the same amount of data and perform an equivalent amount of communication at each round. Both proposed methods -FedAvg and -FedSGD can be easily integrated into existing implementations of federated learning algorithms such as TensorFlow Federated [1].
In this work, we propose -FFL, a novel optimization objective inspired by fair resource allocation strategies in wireless networks that encourages more fair accuracy distributions in federated learning. We develop an efficient and scalable method -FedAvg to solve this objective that is amenable to current federated optimization frameworks. Through our extensive experiments on federated datasets, we validate the resulting fairness, flexibility, and efficiency of our proposed approaches compared with existing baselines.
We thank Sebastian Caldas, Neel Guha, Anit Kumar Sahu, Eric Tan, and Samuel Yeom for their helpful comments. This work was supported in part by the National Science Foundation grant IIS1838017, a Google Faculty Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.
Equality of opportunity in supervised learning.
In Advances in Neural Information Processing Systems, pages 3315–3323, 2016.International Conference on Artificial Intelligence and Statistics
, 2017.Empirical Methods in Natural Language Processing
, pages 1532–1543, 2014.As discussed in Section 2, while it is natural to consider the -fairness framework for machine learning, we are unaware of any work that uses -fairness to modify machine learning training objectives. We provide additional details on the framework below; for further background on -fairness and fairness in resource allocation more generally, we defer the reader to [37, 25].
-fairness [22, 25] is a popular fairness metric widely-used in resource allocation problems. The framework defines a family of overall utility functions that can be derived by summing up the following function of the individual utilities of the users in the network:
Here represents the individual utility of some specific user given allocated resources (e.g., bandwidth). The goal is to find a resource allocation strategy to maximize the sum of the individual utilities. This family of functions includes a wide range of popular fair resource allocation strategies. In particular, the above function represents zero fairness with , proportional fairness [20] with , harmonic mean fairness [7] with , and max-min fairness [31] with .
Note that in federated learning, we are dealing with costs and not utilities. Thus, max-min in resource allocation corresponds to min-max in our setting. With this analogy, it is clear that in our proposed objective -FFL (2), the case where corresponds to min-max fairness since it is optimizing for the worst performing device, similar to what was proposed in [26]. Also, corresponds to zero fairness, which reduces to the original FedAvg objective (1). In resource allocation problems, can be tuned for trade-offs between fairness and system efficiency. In federated settings, can be tuned based on the desired level of fairness (i.e., lower variance of accuracy distributions) and other performance metrics such as the overall accuracy. For instance, in Table 2 in Section 4.3, we demonstrate on two datasets that as increases, the overall average accuracy decreases slightly while the worst accuracies are increased significantly and the variance of the accuracies decreases.
We provide full details on the datasets and models used in our experiments. The statistics of four federated datasets are summarized in Table 3. We report total number of devices, total number of samples, and mean and deviation in the sizes of total data points on each device. Additional details on the datasets and models are described below.
[leftmargin=*]
Synthetic: We follow a similar set up as that in [36] and impose additional heterogeneity. The model is , , and the goal is to learn a global and . Samples and local models on each device satisfies , , ; , where the covariance matrix is diagonal with . Each element in is drawn from . There are 100 devices in total and the number of samples on each devices follows a power law.
Vehicle^{2}^{2}2http://www.ecs.umass.edu/~mduarte/Software.html: We use the same Vehicle Sensor (Vehicle) dataset as [38], modelling each sensor as a device. Each sample has a 100-dimension feature and a binary label indicating whether this sample is on an AAV-type or DW-type vehicle. We train a linear SVM. We tune the hyperparameters in SVM and report the best configuration.
Sent140: This dataset is a collection of tweets from Sentiment140 [15] (Sent140). The task is text sentiment analysis which we model as a binary classification problem. The model takes as input a 25-word sequence, embeds each word into a 300-dimensional space using pretrained Glove [30], and outputs a binary label after two LSTM layers and one densely-connected layer.
Shakespeare: This dataset is built from The Complete Works of William Shakespeare [24, 42]. Each speaking role in the plays is associated with a device. We subsample 31 speaking roles to train a deep model for next character prediction. The model takes as input an 80-character sequence, embeds each character into a learnt 8-dimensional space, and outputs one character after two LSTM layers and one densely-connected layer.
Dataset | Devices | Samples | Samples/device | |
mean | stdev | |||
Synthetic | 100 | 12,697 | 127 | 73 |
Vehicle | 23 | 43,695 | 1,899 | 349 |
Sent140 | 1,101 | 58,170 | 53 | 32 |
Shakespeare | 31 | 116,214 | 3,749 | 6,912 |
We simulate the federated setting (one server and devices) on a server with 2 Intel Xeon E5-2650 v4 CPUs and 8 NVidia 1080Ti GPUs.
We implement all code in TensorFlow [2] Version 1.10.1.
Please see github.com/litian96/fair_flearn for full details.
We randomly split data on each local device into 80% training set, 10% testing set, and 10% validation set. We tune an optimal ^{3}^{3}3By optimal we mean the setting where the variance of accuracy decreases the most, while keeping the overall average accuracy unchanged. from
on the validation set and report accuracy distributions on the testing set. For each dataset, we repeat this process for five randomly selected train/test/validation splits, and report the mean and standard deviation across these five runs where applicable. For Synthetic, Vehicle, Sent140, and Shakespeare, optimal
values are 1, 5, 1, and 0.001, respectively. For all datasets, we randomly sample 10 devices each round. We tune the learning rate and batch size on FedAvg and use the same learning rate and batch size for all -FedAvg experiments of that dataset. The learning rates for Synthetic, Vehicle, Sent140, and Shakespeare are 0.1, 0.01, 0.03, and 0.8, respectively. The batch sizes for Synthetic, Vehicle, Sent140, and Shakespeare are 10, 64, 32, and 10. In comparing -FedAvg’s efficiency with -FedSGD, we also tune a best learning rate for -FedSGD methods on =0. For each comparison, we fix devices selected and mini-batch orders across all runs. We stop training when the training loss does not decrease for 10 rounds. When running AFL methods, we search for a best and such that AFL achieves the highest testing accuracy on the device with the highest loss within a fixed number of rounds. For Adult, we use and ; for Fashion MNIST, we use and . We use the same as step-sizes for -FedAvg on Adult and Fashion MNIST. In Table 2, for -FFL on Adult and for -FFL on Fashion MNIST. The number of local epochs is fixed to 1 whenever we do local updates.Fairness of -FFL with respect to training accuracy. The empirical results in Section 4 are with respect to testing accuracy. As a sanity check, we show that -FFL also results in more fair training accuracy distributions in Figure 4 and Table 4.
Dataset | Objective | Average | Worst 10% | Best 10% | Variance |
Synthetic | 81.7% .3% | 23.6% 1.1% | 100.0% 0.0% | 597 10 | |
78.9% .2% | 41.8% 1.0% | 96.8% .5% | 292 11 | ||
Vehicle | 87.5% .2% | 49.5% 10.2% | 94.9% .7% | 237 97 | |
87.8% .5% | 71.3% 2.2% | 93.1% 1.4% | 37 12 | ||
Sent140 | 69.8% .8% | 36.9% 3.1% | 94.4% 1.1% | 278 44 | |
68.2% .6% | 46.0 % .3% | 88.8% .8% | 143 4 | ||
Shakespeare | 72.7% .8% | 46.4% 1.4% | 79.7% .9% | 116 8 | |
66.7% 1.2% | 48.0% .4% | 71.2% 1.9% | 56 9 |
Average testing accuracy with respect to devices. In Section 4.2, we show that -FFL leads to more fair accuracy distributions while maintaining approximately the same testing accuracies. Note that we report average testing accuracy with respect to all data points in Table 1. However, we observe similar results on average accuracy with respect to all devices between and objectives, as shown in Table 5.
Dataset | Objective | Accuracy w.r.t. Data Points | Accuracy w.r.t. Devices |
Synthetic | 80.8% .9% | 77.3% .6% | |
79.0% 1.2% | 76.3% 1.7% | ||
Vehicle | 87.3% .5% | 85.6% .4% | |
87.7% .7% | 86.5% .7% | ||
Sent140 | 65.1% 4.8% | 64.6% 4.5% | |
66.5% .2% | 66.2% .2% | ||
Shakespeare | 51.1% .3% | 61.4% 2.7% | |
52.1% .3% | 60.0% .5% |
Device-specific . In these experiments, we explore a device-specific strategy for selecting in -FFL. We solve -FFL with in parallel. After training, each device selects the best resulting model based on the validation data and tests the performance of the model using the testing set. We report the results in terms of testing accuracy in Table 6. Interestingly, using this device-specific strategy the average accuracy in fact increases while the accuracy variance is reduced, in comparison with . We note that this strategy does induce more local computation and additional communication load at each round. However, it does not increase the number of communication rounds if run in parallel.
Dataset | Objective | Average | Worst 10% | Best 10% | Variance |
Vehicle | =0 | 87.3% .5% | 43.0% 1.0% | 95.7% 1.0% | 291 18 |
=5 | 87.7% .7% | 69.9% .6% | 94.0% .9% | 48 5 | |
multiple ’s | 88.5% .3% | 70.0% 2.0% | 95.8% .6% | 52 7 | |
Shakespeare | =0 | 51.1% .3% | 39.7% 2.8% | 72.9% 6.7% | 82 41 |
=.001 | 52.1% .3% | 42.1% 2.1% | 69.0% 4.4% | 54 27 | |
multiple ’s | 52.0 1.5% % | 41.0% 4.3% | 72.0% 4.8% | 72 32 |
Convergence speed of -FFL. In Section 4.2, we show that our solver -FedAvg using local updating schemes converges significantly faster than -FedSGD. A natural question one might ask is: will the -FFL >0 objective slow the convergence compared with FedAvg? We empirically investigate this on the four datasets. We use -FedAvg to solve -FFL, and compare it with FedAvg (i.e., solving -FFL with ). As demonstrated in Figure 5, the values that result in more fair solutions also do not significantly slowdown convergence.
Comparison with uniform sampling. In Figure 6 and Table 7, we show that in terms of training accuracies, the uniform sampling heuristic outperforms -FFL (as opposed to the testing accuracy results in Section 4). We suspect that this is because the uniform sampling baseline is a static method and is likely to overfit to those devices with few samples. In additional to Figure 2 in Section 4.3, we also report the average testing accuracy with respect to data points, best 10%, worst 10% accuracies, and the variance in Table 8.
Dataset | Objective | Average | Worst 10% | Best 10% | Variance |
Synthetic | uniform | 83.5% .2% | 42.6% 1.4% | 100.0% 0.0% | 366 17 |
78.9% .2% | 41.8% 1.0% | 96.8% .5% | 292 11 | ||
Vehicle | uniform | 87.3% .3% | 46.6% .8% | 94.8% .5% | 261 10 |
87.8% .5% | 71.3% 2.2% | 93.1% 1.4% | 122 12 | ||
Sent140 | uniform | 69.1% .5% | 42.2% 1.1% | 91.0% 1.3% | 188 19 |
68.2% .6% | 46.0 % .3% | 88.8% .8% | 143 4 | ||
Shakespeare | uniform | 57.7% 1.5% | 54.1% 1.7% | 72.4% 3.2% | 32 7 |
66.7% 1.2% | 48.0% .4% | 71.2% 1.9% | 56 9 |
Dataset | Objective | Average | Worst 10% | Best 10% | Variance |
Synthetic | uniform | 82.2% 1.1% | 30.0% .4% | 100.0% 0.0% | 525 47 |
79.0% 1.2% | 31.1% 1.8% | 100.0% 0.0% | 472 14 | ||
Vehicle | uniform | 86.8% .3% | 45.4% .3% | 95.4% .7% | 267 7 |
87.7% 0.7% | 69.9% .6% | 94.0% .9% | 48 5 | ||
Sent140 | uniform | 66.6% 2.6% | 21.1% 1.9% | 100.0% 0.0% | 560 19 |
66.5% .2% | 23.0 % 1.4% | 100.0% 0.0% | 509 30 | ||
Shakespeare | uniform | 50.9% .4% | 41.0% 3.7% | 70.6% 5.4% | 71 38 |
52.1% .3% | 42.1% 2.1% | 69.0% 4.4% | 54 27 |
Efficiency of -FFL compared with AFL. One added benefit of -FFL is that it leads to faster convergence than AFL—even when we use non-local-updating methods for both objectives. In Figure 7, we show with respect to the final testing accuracy for the single worst device (i.e., the objective that AFL is trying to optimize), -FFL converges faster than AFL. As the number of devices increases (from Fashion MNIST to Vehicle), the performance gap between AFL and -FFL becomes larger because AFL introduces larger variance.
Efficiency of -FedAvg under different data heterogeneity. As discussed in Section 4.4, one potential cause for the slower convergence of -FedAvg on the synthetic dataset may be that local updating schemes could hurt convergence when local data distributions are highly heterogeneous. Although it has been shown that applying updates locally results in significantly faster convergence in terms of communication rounds [24, 39], which is consistent with our observation on most datasets, we note that when data is highly heterogeneous, local updating may hurt convergence. We validate this by creating an IID synthetic dataset (Synthetic-IID) where local data on each device follow the same global distribution. We call the synthetic dataset used in Section 4 Synthetic-Non-IID. We also create a hybrid dataset (Synthetic-Hybrid) where half of the total devices are assigned IID data from the same distribution, and half of the total devices are assigned data from different distributions. We observe that if data is perfectly IID, -FedAvg is more efficient than -FedSGD. As data become more heterogeneous, -FedAvg converges more slowly than -FedSGD in terms of communication rounds. For all three synthetic datasets, we repeat the process of tuning a best constant step-size for FedSGD and observe similar results as before — our dynamic solver -FedSGD behaves similarly (or sometimes outperforms) a best hand-tuned FedSGD.