LEAF: A Benchmark for Federated Settings

by   Sebastian Caldas, et al.
Carnegie Mellon University

Modern federated networks, such as those comprised of wearable devices, mobile phones, or autonomous vehicles, generate massive amounts of data each day. This wealth of data can help to learn models that can improve the user experience on each device. However, learning in federated settings presents new challenges at all stages of the machine learning pipeline. As the machine learning community begins to tackle these challenges, we are at a critical time to ensure that developments made in this area are grounded in real-world assumptions. To this end, we propose LEAF, a modular benchmarking framework for learning in federated settings. LEAF includes a suite of open-source federated datasets, a rigorous evaluation framework, and a set of reference implementations, all geared toward capturing the obstacles and intricacies of practical federated environments.


page 1

page 2

page 3

page 4


Evaluation Framework For Large-scale Federated Learning

Federated learning is proposed as a machine learning setting to enable d...

The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems

This paper presents and characterizes an Open Application Repository for...

Motley: Benchmarking Heterogeneity and Personalization in Federated Learning

Personalized federated learning considers learning models unique to each...

An Isolated Data Island Benchmark Suite for Federated Learning

Federated learning (FL) is a new machine learning paradigm, the goal of ...

On the Convergence of Federated Optimization in Heterogeneous Networks

The burgeoning field of federated learning involves training machine lea...

Addressing modern and practical challenges in machine learning: A survey of online federated and transfer learning

Online federated learning (OFL) and online transfer learning (OTL) are t...

Production federated keyword spotting via distillation, filtering, and joint federated-centralized training

We trained a keyword spotting model using federated learning on real use...

Code Repositories


Leaf: A Benchmark for Federated Settings

view repo

1 Introduction

With data increasingly being generated on federated networks of remote devices, there is growing interest in empowering on-device applications with models that make use of such data [16, 17, 23]. Learning on data generated in federated networks, however, introduces several new obstacles:

Statistical:   Data is generated on each device in a heterogeneous manner, with each device associated with a different (though perhaps related) underlying data generating distribution. Moreover, the number of data points typically varies significantly across devices.

Systems:   The number of devices in federated scenarios is typically order of magnitudes larger than the number of nodes in a typical distributed settings, such as datacenter computing. Each device may have significant constraints in terms of storage, computational, and communication capacities, and these capacities may also differ across devices due to variability in hardware, network connection, and power. Thus, federated settings may suffer from communication bottlenecks that dwarf those encountered in traditional distributed datacenter settings, and may require faster on-device inference.

Privacy and Security:   Finally, the sensitive nature of personally-generated data requires methods that operate on federated data to balance privacy and security concerns with more traditional considerations such as statistical accuracy, scalability, and efficiency.

Recent works have proposed diverse ways of dealing with these challenges, but many of these efforts fall short when it comes to their experimental evaluation. As an example, consider the federated learning paradigm, which focuses on training models directly on federated networks [16, 23, 21]. Experimental works focused on federated learning broadly utilize three types of datasets: (1) datasets that do not provide a realistic model of a federated scenario and yet are commonly used, e.g., artificial partitions of MNIST, MNIST-fashion or CIFAR-10 [16, 10, 7, 3, 9, 25, 27]; (2) realistic but proprietary federated datasets, e.g., data from an unnamed social network in [16], crowdsourced voice commands in [15], and proprietary data by Huawei in [4]; and (3) realistic federated datasets that are derived from publicly available data, but which are not straightforward to reproduce, e.g., FaceScrub in [19], Shakespeare in [16] and Reddit in [10, 18, 3].

In this work, we aim to bridge the gap between datasets that are popular and accessible for benchmarking, and those that realistically capture the characteristics of a federated scenario but are either proprietary or difficult to process. Moreover, beyond just establishing a suite of federated datasets, we propose clear methodology for evaluating methods and reproducing results. To this end, we present , a modular benchmarking framework geared towards learning in massively distributed federated networks of remote devices. We note that while federated learning is a canonical application for , the framework in fact encompasses a wide range of potential learning settings, such as on-device learning or model inference, multi-task learning, meta learning, transfer learning, life-long learning, and the development of personalized learning models with implications to fairness. For example, the multi-task learning (MTL) community can benefit from   by considering each device as a different task. ’s datasets would then allow researchers and practitioners to test MTL methods in regimes with large numbers of tasks and samples, contrary to traditional MTL datasets (e.g., the popular

Landmine Detection  [30, 20, 29, 23], Computer Survey [2, 1, 11] and Inner London Education Authority School [20, 14, 1, 2, 11] datasets, each with at most than tasks). Similarly, these devices could also be naturally interpreted as tasks in meta-learning settings, in contrast to the artificially generated tasks considered in popular benchmarks such as Omniglot [12, 6, 26, 24] and miniImageNet [22, 6, 26, 24].


is an open-source111All code and documentation can be found at https://talwalkarlab.github.io/leaf/. benchmarking framework for federated settings. It consists of (1) a suite of open-source datasets, (2) an array of statistical and systems metrics, and (3) a set of reference implementations. As shown in Figure 1, ’s modular design allows these three components to be easily incorporated into diverse experimental pipelines. We now detail ’s core components.

Figure 1:   modules and their connections. The “Datasets” module preprocesses the data and transforms it into a standardized JSON format, which can integrate into an arbitrary ML pipeline.  ’s “Reference Implementations” module is a growing repository of common methods used in the federated setting, with each implementation producing a log of various different statistical and systems metrics. This log (or any log generated in an appropriate format) can be used to aggregate and analyze these metrics in various ways.   performs this analysis through its “Metrics” module.


We have curated a suite of realistic federated datasets. We focus on datasets where (1) the data has a natural keyed generation process (where each key refers to a particular device); (2) the data is generated from networks of thousands to millions of devices; and (3) the number of data points is skewed across devices. Currently,   consists of three datasets:

  • [noitemsep, leftmargin=*]

  • Federated Extended MNIST (FEMNIST), which serves as a similar (and yet more challenging) benchmark to the popular MNIST [13] dataset. It is built by partitioning the data in Extended MNIST [5] based on the writer of the digit/character.

  • Sentiment140 [8]

    , an automatically generated sentiment analysis dataset that annotates tweets based on the emoticons present in them. In this dataset, each device is a different twitter user.

  • Shakespeare, a dataset built from The Complete Works of William Shakespeare [28, 16]. Here, each speaking role in each play is considered a different device.

We provide statistics on these datasets in Table 1. In , we provide all necessary pre-processing scripts for each dataset, as well as small/full versions for prototyping and final testing. Moving forward, we plan to add datasets from different domains (e.g. audio, video) and to increase the range of machine learning tasks (e.g. text to speech, translation, compression, etc.).

Dataset Number of devices Total samples Samples per device
mean stdev
Table 1: Statistics of datasets in .


Rigorous evaluation metrics are required to appropriately assess how a learning solution behaves in federated scenarios. Currently,   establishes an initial set of metrics chosen specifically for this purpose. For example, we introduce metrics that better capture the entire distribution of performance across devices: performance at the 10th and 90th percentiles and performance stratified by natural hierarchies in the data (e.g. “play” in the case of the Shakespeare dataset). We also introduce metrics that account for the amount of computing resources needed from the edge devices in terms of number of FLOPS and number of bytes downloaded/uploaded. Finally,   also recognizes the importance of specifying how the accuracy is weighted across devices, e.g., whether every device is equally important, or every data point equally important (implying that power users/devices get preferential treatment). Notably, considering

stratified systems and accuracy metrics is particularly important in order to evaluate whether a method will systematically exclude groups of users (e.g., because they have lower end devices) and/or will underperform for segments of the population (e.g., because they produce less data).

Reference implementations:   In order to facilitate reproducibility,   also contains a set of reference implementations of algorithms geared towards federated scenarios. Currently, this set is limited to the federated learning paradigm, and in particular includes reference implementations of minibatch SGD and FedAvg [16]. Moving forward we aim equip  with implementations for additional methods and paradigms with the help of the broader research community.

3   in action

We now show a glimpse of   in action. In particular, we highlight two of ’s characteristics:

enables reproducible science:   To demonstrate the reproducibility enabled via , we focus on qualitatively reproducing the results that [16] obtained on the Shakespeare dataset for a next character prediction task. In particular, it was noted that for this particular dataset, the FedAvg method surprisingly diverges

as the number of local epochs increases. This is therefore a critical setting to understand before deploying methods such as FedAvg. To show how   allows for rapid prototyping of this scenario, we use the reference FedAvg implementation and subsample

devices (around of the total) in our Shakespeare data (which can be easily done through our framework). Results are shown in Figure 2, where we indeed see similar divergence behavior in terms of the training loss as we increase the number of epochs.

Figure 2: Convergence behavior of FedAvg on a subsample of the Shakespeare dataset. We use a learning rate of and devices per round for all experiments. We are able to achieve test accuracy comparable to the results obtained in [16]. We also qualitatively replicate the divergence in training loss that is observed for large numbers of local epochs ().

provides granular systems and statistical metrics:   As illustrated in Figure 3, our proposed systems and statistical metrics are important to consider when serving multiple clients simultaneously. For statistical metrics, we show the effect of varying the minimum number of samples per user in Sentiment140 (which we denote as ). We see that, while median performance degrades only slightly with data-deficient users (i.e., ), the 25th percentile (bottom of box) degrades dramatically. Meanwhile, for systems metrics, we run minibatch SGD and FedAvg for FEMNIST and calculate the systems budget needed to reach an accuracy threshold of . We characterize the budget in terms of total number of FLOPS across all devices and total number of bytes uploaded to network. Our results demonstrate the improved systems profile of FedAvg when it comes to the communication vs. local computation trade-off, though we note that in general methods may vary across these two dimensions, and it is thus important to consider both aspects depending on the problem at hand.

Figure 3: Statistical and Systems analyses for Sent140 and FEMNIST. is the minimum number of samples per user, is the percentage of local data used by each device, and is the number of clients selected per round. Orange lines represent the median device accuracy, green triangles represent the mean, boxes cover the 25th and 75th percentile, and whiskers cover the 10th to the 90th percentile. We use a learning rate of for all experiments.

4 Conclusion

We present , a modular benchmarking framework for learning in federated settings, or ecosystems marked by massively distributed networks of devices. Learning paradigms applicable in such settings include federated learning, multi-task learning, meta-learning, and on-device learning.   allows researchers and practitioners in these domains to reason about new proposed solutions under more realistic assumptions than previous benchmarks. We intend to keep   up to date with new datasets, metrics and open-source solutions in order to foster informed and grounded progress in this field.