Leaf: A Benchmark for Federated Settings
Modern federated networks, such as those comprised of wearable devices, mobile phones, or autonomous vehicles, generate massive amounts of data each day. This wealth of data can help to learn models that can improve the user experience on each device. However, learning in federated settings presents new challenges at all stages of the machine learning pipeline. As the machine learning community begins to tackle these challenges, we are at a critical time to ensure that developments made in this area are grounded in real-world assumptions. To this end, we propose LEAF, a modular benchmarking framework for learning in federated settings. LEAF includes a suite of open-source federated datasets, a rigorous evaluation framework, and a set of reference implementations, all geared toward capturing the obstacles and intricacies of practical federated environments.READ FULL TEXT VIEW PDF
Federated learning is proposed as a machine learning setting to enable
This paper presents and characterizes an Open Application Repository for...
Federated learning (FL) is a new machine learning paradigm, the goal of ...
A number of projects in computer science around the world have contribut...
The burgeoning field of federated learning involves training machine lea...
Federated Learning enables visual models to be trained on-device, bringi...
Leaf: A Benchmark for Federated Settings
With data increasingly being generated on federated networks of remote devices, there is growing interest in empowering on-device applications with models that make use of such data [16, 17, 23]. Learning on data generated in federated networks, however, introduces several new obstacles:
Statistical: Data is generated on each device in a heterogeneous manner, with each device associated with a different (though perhaps related) underlying data generating distribution. Moreover, the number of data points typically varies significantly across devices.
Systems: The number of devices in federated scenarios is typically order of magnitudes larger than the number of nodes in a typical distributed settings, such as datacenter computing. Each device may have significant constraints in terms of storage, computational, and communication capacities, and these capacities may also differ across devices due to variability in hardware, network connection, and power. Thus, federated settings may suffer from communication bottlenecks that dwarf those encountered in traditional distributed datacenter settings, and may require faster on-device inference.
Privacy and Security: Finally, the sensitive nature of personally-generated data requires methods that operate on federated data to balance privacy and security concerns with more traditional considerations such as statistical accuracy, scalability, and efficiency.
Recent works have proposed diverse ways of dealing with these challenges, but many of these efforts fall short when it comes to their experimental evaluation. As an example, consider the federated learning paradigm, which focuses on training models directly on federated networks [16, 23, 21]. Experimental works focused on federated learning broadly utilize three types of datasets: (1) datasets that do not provide a realistic model of a federated scenario and yet are commonly used, e.g., artificial partitions of MNIST, MNIST-fashion or CIFAR-10 [16, 10, 7, 3, 9, 25, 27]; (2) realistic but proprietary federated datasets, e.g., data from an unnamed social network in , crowdsourced voice commands in , and proprietary data by Huawei in ; and (3) realistic federated datasets that are derived from publicly available data, but which are not straightforward to reproduce, e.g., FaceScrub in , Shakespeare in  and Reddit in [10, 18, 3].
In this work, we aim to bridge the gap between datasets that are popular and accessible for benchmarking, and those that realistically capture the characteristics of a federated scenario but are either proprietary or difficult to process. Moreover, beyond just establishing a suite of federated datasets, we propose clear methodology for evaluating methods and reproducing results. To this end, we present , a modular benchmarking framework geared towards learning in massively distributed federated networks of remote devices. We note that while federated learning is a canonical application for , the framework in fact encompasses a wide range of potential learning settings, such as on-device learning or model inference, multi-task learning, meta learning, transfer learning, life-long learning, and the development of personalized learning models with implications to fairness. For example, the multi-task learning (MTL) community can benefit from by considering each device as a different task. ’s datasets would then allow researchers and practitioners to test MTL methods in regimes with large numbers of tasks and samples, contrary to traditional MTL datasets (e.g., the popularLandmine Detection [30, 20, 29, 23], Computer Survey [2, 1, 11] and Inner London Education Authority School [20, 14, 1, 2, 11] datasets, each with at most than tasks). Similarly, these devices could also be naturally interpreted as tasks in meta-learning settings, in contrast to the artificially generated tasks considered in popular benchmarks such as Omniglot [12, 6, 26, 24] and miniImageNet [22, 6, 26, 24].
is an open-source111All code and documentation can be found at https://talwalkarlab.github.io/leaf/. benchmarking framework for federated settings. It consists of (1) a suite of open-source datasets, (2) an array of statistical and systems metrics, and (3) a set of reference implementations. As shown in Figure 1, ’s modular design allows these three components to be easily incorporated into diverse experimental pipelines. We now detail ’s core components.
We have curated a suite of realistic federated datasets. We focus on datasets where (1) the data has a natural keyed generation process (where each key refers to a particular device); (2) the data is generated from networks of thousands to millions of devices; and (3) the number of data points is skewed across devices. Currently, consists of three datasets:
We provide statistics on these datasets in Table 1. In , we provide all necessary pre-processing scripts for each dataset, as well as small/full versions for prototyping and final testing. Moving forward, we plan to add datasets from different domains (e.g. audio, video) and to increase the range of machine learning tasks (e.g. text to speech, translation, compression, etc.).
|Dataset||Number of devices||Total samples||Samples per device|
Rigorous evaluation metrics are required to appropriately assess how a learning solution behaves in federated scenarios. Currently, establishes an initial set of metrics chosen specifically for this purpose. For example, we introduce metrics that better capture the entire distribution of performance across devices: performance at the 10th and 90th percentiles and performance stratified by natural hierarchies in the data (e.g. “play” in the case of the Shakespeare dataset). We also introduce metrics that account for the amount of computing resources needed from the edge devices in terms of number of FLOPS and number of bytes downloaded/uploaded. Finally, also recognizes the importance of specifying how the accuracy is weighted across devices, e.g., whether every device is equally important, or every data point equally important (implying that power users/devices get preferential treatment). Notably, consideringstratified systems and accuracy metrics is particularly important in order to evaluate whether a method will systematically exclude groups of users (e.g., because they have lower end devices) and/or will underperform for segments of the population (e.g., because they produce less data).
Reference implementations: In order to facilitate reproducibility, also contains a set of reference implementations of algorithms geared towards federated scenarios. Currently, this set is limited to the federated learning paradigm, and in particular includes reference implementations of minibatch SGD and FedAvg . Moving forward we aim equip with implementations for additional methods and paradigms with the help of the broader research community.
We now show a glimpse of in action. In particular, we highlight two of ’s characteristics:
enables reproducible science: To demonstrate the reproducibility enabled via , we focus on qualitatively reproducing the results that  obtained on the Shakespeare dataset for a next character prediction task. In particular, it was noted that for this particular dataset, the FedAvg method surprisingly diverges
as the number of local epochs increases. This is therefore a critical setting to understand before deploying methods such as FedAvg. To show how allows for rapid prototyping of this scenario, we use the reference FedAvg implementation and subsampledevices (around of the total) in our Shakespeare data (which can be easily done through our framework). Results are shown in Figure 2, where we indeed see similar divergence behavior in terms of the training loss as we increase the number of epochs.
provides granular systems and statistical metrics: As illustrated in Figure 3, our proposed systems and statistical metrics are important to consider when serving multiple clients simultaneously. For statistical metrics, we show the effect of varying the minimum number of samples per user in Sentiment140 (which we denote as ). We see that, while median performance degrades only slightly with data-deficient users (i.e., ), the 25th percentile (bottom of box) degrades dramatically. Meanwhile, for systems metrics, we run minibatch SGD and FedAvg for FEMNIST and calculate the systems budget needed to reach an accuracy threshold of . We characterize the budget in terms of total number of FLOPS across all devices and total number of bytes uploaded to network. Our results demonstrate the improved systems profile of FedAvg when it comes to the communication vs. local computation trade-off, though we note that in general methods may vary across these two dimensions, and it is thus important to consider both aspects depending on the problem at hand.
We present , a modular benchmarking framework for learning in federated settings, or ecosystems marked by massively distributed networks of devices. Learning paradigms applicable in such settings include federated learning, multi-task learning, meta-learning, and on-device learning. allows researchers and practitioners in these domains to reason about new proposed solutions under more realistic assumptions than previous benchmarks. We intend to keep up to date with new datasets, metrics and open-source solutions in order to foster informed and grounded progress in this field.
The mnist database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998.