Flower: A Friendly Federated Learning Research Framework

by   Daniel J. Beutel, et al.

Federated Learning (FL) has emerged as a promising technique for edge devices to collaboratively learn a shared prediction model, while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store the data in the cloud. However, FL is difficult to implement and deploy in practice, considering the heterogeneity in mobile devices, e.g., different programming languages, frameworks, and hardware accelerators. Although there are a few frameworks available to simulate FL algorithms (e.g., TensorFlow Federated), they do not support implementing FL workloads on mobile devices. Furthermore, these frameworks are designed to simulate FL in a server environment and hence do not allow experimentation in distributed mobile settings for a large number of clients. In this paper, we present Flower (https://flower.dev/), a FL framework which is both agnostic towards heterogeneous client environments and also scales to a large number of clients, including mobile and embedded devices. Flower's abstractions let developers port existing mobile workloads with little overhead, regardless of the programming language or ML framework used, while also allowing researchers flexibility to experiment with novel approaches to advance the state-of-the-art. We describe the design goals and implementation considerations of Flower and show our experiences in evaluating the performance of FL across clients with heterogeneous computational and communication capabilities.


page 1

page 2

page 3

page 4


On-device Federated Learning with Flower

Federated Learning (FL) allows edge devices to collaboratively learn a s...

FL_PyTorch: optimization research simulator for federated learning

Federated Learning (FL) has emerged as a promising technique for edge de...

Federated Learning-based Active Authentication on Mobile Devices

User active authentication on mobile devices aims to learn a model that ...

Prophet: Proactive Candidate-Selection for Federated Learning by Predicting the Qualities of Training and Reporting Phases

Federated Learning (FL) is viewed as a promising technique for future di...

FedSAE: A Novel Self-Adaptive Federated Learning Framework in Heterogeneous Systems

Federated Learning (FL) is a novel distributed machine learning which al...

FLaaS: Enabling Practical Federated Learning on Mobile Environments

Federated Learning (FL) has recently emerged as a popular solution to di...

FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors

How to accurately and efficiently label data on a mobile device is criti...

1 Introduction

Figure 1: Flower framework architecture.

There has been tremendous progress is enabling the execution of deep learning models on mobile and embedded devices to infer user contexts and behaviors 

Fromm et al. (2018); Chowdhery et al. (2019); Malekzadeh et al. (2019); Lee et al. (2019); Yao et al. (2019); LiKamWa et al. (2016); Georgiev et al. (2017). This has been powered by the increasing computational abilities of mobile devices as well as novel algorithms which apply software optimizations to enable pre-trained cloud-scale models to run on resource-constrained devices. However, when it comes to the training of these mobile-focused models, a working assumption has been that the models will be trained centrally on the cloud, using training data aggregated from several users. Given the recent events such as the Cambridge Analytica scandal Guardian (2020) and increasing regulatory attempts to strengthen data protection, e.g., the EU General Data Protection Regulation (GDPR), the assumption that user data could be aggregated on a central server may not hold any longer, which in turn, could severely limit the progress of deep learning applications.

Federated Learning (FL) McMahan et al. (2017) is an emerging area of research in the machine learning community which aims to enable distributed edge devices (or users) to collaboratively train a shared prediction model while keeping their personal data private. At a high level, this is achieved by repeating three basic steps: i) local weight updates to a shared prediction model on each edge device, ii) sending the local weight updates to a central server for aggregation, and iii) receiving the aggregated model back for next round of local updates. We provide a more detailed primer of FL in §2.1. An example of FL is the mobile keyword prediction in the Google Keyboard on Android Google (2020a) – the keystroke data collected by the keyboard is extremely sensitive as it could reveal a user’s passwords or bank account details, therefore Google employs FL to train a keyword prediction model on this data in a distributed and privacy-preserving way.

From a systems perspective, a major bottleneck to FL research is the paucity of frameworks that support implementation of FL methods on mobile and wireless devices. While there are frameworks such as TensorFlow Federated Google (2020b); Abadi et al. (2016a) (TFF) and LEAF Caldas et al. (2018) that enable experimentation on FL algorithms, they do not provide support for deployment of FL methods on edge devices. Deploying FL methods on mobile and wireless edge devices have different and unique challenges over server-side implementations – firstly, there is heterogeneity in the software stack on mobile and embedded devices (e.g., Android, iOS, Raspbian). Next, these devices may have drastically different compute and memory availability, which will impact the local training of models on them. Finally, owing to the wireless communications channel, the network bandwidth on mobile devices is much lower and prone to fluctuations when compared to cloud machines, which influences the amount of time it takes to synchronize the local weight updates in FL with the central server. All these system-related factors, in combination with the choice of the weight aggregation algorithm, can impact the accuracy and training time of models trained in a federated setting.

In this paper, we present Flower, a new framework for FL, that supports experimentation of both algorithmic and systems-related challenges in FL. Flower offers a stable, language and ML framework-agnostic implementation of the core components of a FL system, and provides higher-level abstractions to enable researchers to quickly experiment and implement new ideas on top of a reliable stack. Moreover, Flower allows to quickly transition existing training pipelines and workloads into a FL setup in order to evaluate their convergence properties and training time in a federated setting. Most importantly, Flower provides support for extending FL implementations to mobile and wireless clients, with heterogeneous compute, memory and communication resources.

While extending FL research to mobile clients is at its heart, Flower can also support algorithmic research on FL by allowing researchers to run their experiments on distributed cloud servers, or even on a single machine. As system-level challenges of limited compute, memory and network bandwidth in mobile devices are not a major bottleneck for the powerful cloud servers, Flower provides built-in tools to simulate many of these challenging conditions in a cloud environment and allows a realistic evaluation of FL algorithms. Finally, Flower is designed with scalability in mind and enables research which leverages both a large number of connected clients and a large number of clients training concurrently.

In §3, we present the core design and architecture details of Flower, followed by the system implementation specifics in §5 and §4. In addition to the system-level abstractions, Flower also provides abstraction for implementing various FL algorithms through a Strategy interface (with three popular variants of FedAvg already included), which we describe in §5.4. Finally, in §6, through a series of experiments on realistic mobile-focused ML workloads, we show that Flower can support reproducible FL experiments by varying both system-level parameters such as the total number of concurrent device clients, the amount of computing heterogeneity in the client pool or the network bandwidth available to each client, as well as algorithm-specific parameters such as the choice of averaging algorithm or the client sampling strategies for training and evaluation. We also demonstrate the scalability of Flower by concurrently training FL models on 100 clients and exchanging as much as 20GB of data (i.e. model parameters) in every round of FL. In summary, we make the following contributions to the mobile systems literature:

  • [leftmargin=*]

  • We present Flower, a new Federated Learning framework that supports implementation and experimentation of FL methods on mobile and wireless devices. At the same time, Flower can support algorithmic research in FL by simulating real-world system conditions such as limited computational resources, slow network speeds, which are common for mobile and wireless clients.

  • We describe the design principles and implementation details of Flower, along with several examples of integrating it with cloud-based and mobile clients. In addition to being language- and ML framework-agnostic, Flower is also fully extendable and can incorporate emerging weight averaging algorithms, new FL training strategies and different communication protocols.

  • Using Flower as the underlying framework, we present experiments that explore both algorithmic and system-level aspects of FL on three mobile-focused machine learning workloads. Our results quantify the impact of various system bottlenecks such as client heterogeneity and fluctuating network speeds on FL performance.

  • Flower is open-sourced under Apache 2.0 License

    222https://github.com/adap/flower with the hope that it would assist the research community in quick experimentation of FL-focused research questions, and that the community members will further extend the framework to support new communication protocols and mobile clients.

2 Background and Related Work

We begin by describing the general idea of Federated Learning (FL), along with prior work and how it relates to Flower.

2.1 Federated Learning Approach

FL, proposed in McMahan et al. (2017), introduced a way of learning models from decentralized data residing on mobile devices. It is based on the assumption that data remains on-device, i.e., there are a number of data partitions and we have no influence on the data distribution within each partition. The described problem setup was proposed under the name federated optimization, with federated learning being an approach to solve federated optimization problems.

FL operates under the assumption that there is a server (which usually does not hold any data) and a number of mobile clients connected to that server, each holding their respective local data set (or “partition”). The FL process begins by initializing a shared model, e.g., a CNN. The learning process then progresses in so-called rounds. As shown in Figure 2, each round can be imagined as a sequence of the following steps.

Figure 2: A high-level overview of FL detailing the steps performed in a single round.
  1. [leftmargin=*]

  2. Select a subset of connected clients which are going to participate in the next round of FL.

  3. Distribute the current global model weights to the selected clients, along with instructions on how to perform training (e.g., the number of epochs).

  4. The (selected) clients receive the model weights and train those weights on their locally held data set (according to the instructions received). On completion, the clients send the updated model weights back to the server.

  5. Receive all locally updated models and aggregate them, use the aggregated model to replace the previous version of the global model. For aggregation, the initial proposal was to build a weighted average which weights updates based on the number of training examples used by each participating client to compute the respective update.

Evaluation needs to be performed in a similarly distributed fashion as the server does not hold any data. It can either be performed in parallel to training on a different set of clients, or after each a training round. For efficiency reasons, only a subset of clients is used for training and evaluation.

2.2 Prior Work

FL builds on a vast body of prior work and has since been expanded in different directions. McMahan et al. (2017) introduced the basic federated averaging (FedAvg) algorithm and evaluate it in terms of communication efficiency. Bonawitz et al. (2019) followed by providing details as to scalability considerations of a Google-internal FL system. The optimization of distributed training with and without federated concepts have been covered from many angles Dean et al. (2012); Jia et al. (2018); Chahal et al. (2018); Sergeev and Balso (2018); Dryden et al. (2016).

There is also active work on privacy and robustness improvement for FL: A targeted model poisoning attack using Fashion-MNIST Xiao et al. (2017) (along with possible mitigation strategies) was demonstrated by Bhagoji et al. (2018). Abadi et al. (2016b) proposes an attempt to translate the idea of differential privacy to deep learning. Secure aggregation Bonawitz et al. (2017) is a way to hide model updates from “honest but curious” attackers. Robustness and fault-tolerance improvements at the optimizer level are commonly studied and demonstrated, for example by Zeno Xie et al. (2019a).

Most of these approaches have in common that they implement their own systems to obtain the described results. The main intention of Flower is to provide a framework which would (a) allow to perform similar research using a common framework and (b) enable to run those experiments on a large number of real mobile devices.

3 Flower Design Goals

In this section we describe the underlying design goals and some exemplary use cases addressed by Flower.

3.1 Design Goals

The creation of Flower was motivated by the observations that real-world FL workloads often face heterogeneous client environments, that the FL state-of-the-art advances quickly, that FL is difficult to implement, and that ML frameworks evolve rapidly. Based on those observations we define four major design goals for Flower:

  • [leftmargin=*]

  • ML-framework agnostic: Given that ML-frameworks for the mobile setting are just emerging, Flower should be compatible with existing and future ML-frameworks.

  • Client agnostic: Given heterogeneous environment on mobile clients, Flower should be interoperable with different programming languages, operating systems, and hardware settings.

  • Expandable: Given the rate of change in FL, Flower should be expandable to enable both experimental research and adoption of recently proposed approaches.

  • Accessible: Given the number and complexity of existing ML workloads, Flower should enable application developers to federate those pipelines with low engineering overhead.

  • Scalable: Given the often large number of mobile application users, Flower should scale to a large number of concurrent clients to foster research on a larger scale.

3.2 Use Cases

Flower aims to support a wide range of use cases for both researchers and application developers. We present some of them and outline the role of Flower in their implementation.

3.2.1 Reproducible research

FL research requires the implementation of several components before one can focus on the actual question at hand. Flower offers proven implementations of these components in an easy to use package, thus enabling researchers to quickly experiment and implement new ideas on top of a reliable stack. Furthermore, having a library of existing methods already implemented in Flower allows researchers to compare against existing ideas very quickly or base their experimentation on those ideas by modifying the provided implementations. Flower will also offer a common platform for mobile and wireless FL as it is the only framework that allows for fully-heterogeneous setups and was conceived with a focus on the requirements imposed by those workloads. A common platform for benchmarking FL approaches against each other is important because subtle implementation differences in the underlying distribution mechanisms can substantially harm comparability of results.

3.2.2 Federating existing workloads

While FL has emerged as a promising technique, the breadth of ML tasks that have been translated to the federated setting remains quite narrow. Flower, by linking with existing ML-frameworks and providing capabilities to run on both mobile devices and the cloud, will allow someone to take an existing ML training codebase and federate it in under a day (by our estimate). It allows to quickly re-use existing model training pipelines, add the capability for federated training, understand their convergence properties in bandwidth limited environments using cloud simulation, and then translate them to run on real mobile devices all in the same framework.

3.2.3 (Wireless) network fluctuations

Fluctuating network connectivity is an inherent property of the wireless/mobile setting. Even modern devices in urban environments face occasional connectivity limitations, for example, when driving through a tunnel. FL training usually runs for an extended amount of time, which implies a changing pool of available devices. These dynamics are generally known and discussed, but rarely incorporated in experimental setups. Flower enables researchers to quantify the resulting effects on the learning process, both by running on real mobile devices or by simulating bandwidth constraints in the cloud.

3.2.4 Cross-platform workloads

The norm of mobile FL will be heterogeneous devices collaborating on model training. Yet existing frameworks have very limited support for heterogeneity and focus on either server-side definition of client-side computations or ignore mobile clients completely by only offering cloud-based simulation of federated computations. Flower offers robust support for vastly different client device configurations collaborating in a single federation. It therefore allows to test the performance of existing algorithms in heterogeneous environments, but perhaps more importantly also enables research of algorithms which acknowledge and actively incorporate the heterogeneity assumption. An example could be the training of a vision model that needs to work across Android, iOS, and Raspberry Pi clients. Those devices run different OS, have different hardware both in terms of processing and in term of camera, and require different programming languages. Flower enables such research by providing high-level abstractions for common platforms but also language independent low-level primitives for more exotic platforms, thus allowing users to integrate arbitrary devices into their experiments.

3.2.5 Scaling research

Real-world FL setups often have a pool of hundreds or thousands of clients available for training. Yet some research configurations appear to be over-simplified in the sense that they only use, for example, ten clients. Perhaps this is due to the implementation effort of larger systems. Flower was designed with scalability in mind and enables research which leverages both a large number of connected clients and a large number of clients training concurrently. We hope this will encourage research that generalises better to the properties of real-world FL.

3.3 Framework Architecture

The Flower framework is comprised of a number of components, illustrated in figure 1. We describe how Flower attempts to offer easy-to-use building blocks that are flexible enough for researchers to build their own FL algorithms by providing different integration points for both client-side and server-side logic.

3.3.1 Flower Client

Researchers can federated existing ML workloads by implementing a Flower interface called Client. The Client interface allows Flower to orchestrate the FL process by calling user code to perform certain aspects of the FL process, e.g., training or evaluation. A SDK handles many details such as connection management, Flower protocol, and seralization for the user. The current version of Flower provides this SDK for Android (Java-based) and Python, other common mobile setups will be supported in future releases (e.g., Objective-C on iOS).

An important property of Flower is that clients implemented in different languages running on different platforms can collaborate on training, which reflects the common case for mobile applications. The fact that those clients run on mobile devices also means that we can evaluate FL in a wireless setting. We provide more details about successful implementations in §4 and §6.

3.3.2 Flower Server

The Flower server is responsible for connection handling, client life cycle management, performing (customizable) rounds of federated learning, distributed and/or centralized validation, customizable (weight) update aggregation, metric collection, and error handling.

Although not explicitly stated in the design goals, we tried to move complexity into the server to keep client implementation(s) as lightweight as possible (by definition, there is only one server, but many clients which often run on low-powered mobile devices). The server also offers a plug-in architecture which allows users to customize considerable parts of FL for both research and workload adaptation.

3.3.3 Federation Strategy

New FL algorithms can be easily implemented using an abstraction called Strategy. The most popular FL algorithm is FedAvg (see §5.5 for more) and its variants are used in many settings. Flower implements already a number of these already including: plain FedAvg, fault-tolerant FedAvg, and other state-of-the-art FL algorithms designed for heterogeneous settings (see Table 1). Users can also choose to customize aspects of FedAvg by providing their own Strategy implementation, for example, to influence the way clients are selected for training. It first and foremost enables researchers to experiment with completely new FL algorithms. §5.4 and §5.5 go into more detail about how strategy implementations work under the hood and also showcase how they can be used to implement a (simple) fault-tolerant version of FedAvg in a few lines of Python.

Strategy Description
Vanilla Federated Averaging algorithm
proposed in McMahan et al. McMahan et al. (2017)
A variant of FedAvg that can tolerate faulty
client conditions such as client disconnections,
laggard clients.
Implementation of the algorithm proposed by
Li et al. Li et al. (2018) to extend FL to heterogenous
network conditions.
Implementation of the algorithm proposed by
Li et al. Li et al. (2019) to encourage fairness in FL.
A new strategy for FL in scenarios of
heterogeneous client computational capabilities.
This strategy identifies fast and slow clients
and intelligently schedules FL rounds across
them to minimize the total convergence time.
Table 1: Built-in federated learning strategies (or algorithms) available in Flower. A new strategy could be implemented using Flower’s Strategy interface.

3.3.4 Flower Protocol

Flower offers an additional low-level way of building highly specialized clients, for example, to enable FL research on devices which were previously not supported and could therefore not be considered for FL. Server and client communicate through the Flower protocol, which clients can implement directly using a large number of supported programming languages. Using this interface gives users full flexibility to target an even wider range of client environments (e.g., C++ for embedded devices), but it requires more effort in the sense that connection management, serialization, and protocol semantics must be handled manually. This is somewhat similar to the concept of strategies, which allow for heavy customization on the server side. Integrating directly with the Flower protocol allows for sophisticated implementations, which can enable new kinds of FL methods.

3.3.5 Flower Datasets

The performance of FL algorithms is often influenced by the local datasets on each clients – as such, in order to compare different FL algorithms in a fair and reproducible manner, it is important that they are implemented assuming the same data partitions across clients. Hence, with the goal of encouraging reproducible research, Flower provides a set of in-built datasets and partition functions that can be used dynamically distribute the datasets across FL clients. Currently, Flower offers three datasets, namely Fashion-MNIST, CIFAR-10 and Speech Commands. While the first two datasets are commonly used for evaluating vision models, Speech Commands is a popular speech dataset used to train spoken keyword detection models. On top of these datasets, Flower has implemented partitioning functions which can split the dataset across clients in a user-defined way, e.g., {100% i.i.d}, {50% i.i.d., 50% non-i.i.d}. For example, in the {100% i.i.d} case, the partitions across clients follows the same distribution as the original dataset. However, in the {50% i.i.d., 50% non-i.i.d} setting, half of the data across each client is i.i.d, while the remaining half is sampled only from one of the classes. Other types of datasets and partition functions could be added to Flower in the future.

3.3.6 Flower Baselines

Flower also provides end-to-end FL implementations by combining Flower Datasets and Flower Strategies with popular neural network architectures for training client models. For example, one can run end-to-end FL training by using

CIFAR-10 as the FL dataset, ResNet50 as the client model, and FedAvg as the FL Strategy. These experiments are enabled by a Domain-Specific Language (DSL) interface where users can just specify their desired experiment conditions and let Flower take care of creating an end-to-end FL pipeline.

3.3.7 Flower Tools

Finally, Flower provides a set of system-level tools to deploy, simulate and monitor various parameters of interest during FL. At this stage, these tool include (i) a Deployment Engine which can automatically spawn cloud instances (currently AWS) to simulate server and clients, (ii) a Compute Simulator, which simulates the slow or fast computational capabilities of different client, and (iii) a Network Simulator, which can change the network bandwidth between server and clients to simulate challenging heterogeneous network conditions.

3.4 Framework Comparison

We now compare the properties of Flower to other FL toolkits, namely TensorFlow Google (2020b); Abadi et al. (2016a) Federated (TFF), PySyft Ryffel et al. (2018), and LEAF Caldas et al. (2018). Table 2 provides an overview, with a more detailed description of those properties following thereafter.

TFF PySyft LEAF Flower
Heterogeneous clients
Network agnostic ?
Server-side definitions
FL research
ML-framework agnostic ✓**
Python SDK
Mobile SDKs ✓***
Language agnostic API
Benchmark suite ✓****

* per-process limitation ** limited to PyTorch and TF/Keras

*** preliminary on iOS *** preliminary

Table 2: Comparison of different FL frameworks.
  • [leftmargin=*]

  • Heterogeneous clients refers to the ability to run workloads which include clients running on different platform using different languages, all in the same workload. FL targeting mobile devices will clearly have to assume pools of clients of many different types (e.g., iPhone, Android phone, Raspberry Pi). Flower supports such heterogeneous client pools through simple client-side integration points. It is the only framework in our comparison that does so, with TFF and PySyft expecting a compatible client environment, and LEAF being focused on Python-based simulations.

  • Network agnostic FL describes the capability to run FL irrespective of the network connection of individual clients participating in the process. Mobile devices are known to have fluctuating network connections, ranging from poor 2G/3G mobile networks to fast Wi-Fi. Flower uses a communication layer based on gRPC, which in turn uses an efficient binary serialization format. Flower is also careful to avoid unnecessary overhead for establishing connections, which is critical on low-end mobile networks. Other frameworks either do not support fully networked setups or make no explicit mention of connection efficiency (TFF seems to be based on gRPC as well though).

  • Server-side definition of computations executed on the client describes a programming model which attempts to control the entire training process from a single point, the server. This approach is used by TFF and PySyft, which try to describe computations by taking a system-wide perspective. This approach can be advantageous to understand the entirety of computations run, but it requires a full re-write of existing client-side ML pipelines.

  • FL research is advancing quickly, but the results are often difficult to reproduce because common frameworks are just starting to emerge. Being able to use a framework to quickly implement research ideas without having to re-implement all the moving parts of FL is an important property. Flower enables implementation of new FL algorithms through a plug-in architecture on the server-side, but also through easy integration of existing ML workloads. TFF and PySyft enable research on federated computations, but require users to rewrite ML workloads with the primitives provided.

  • ML-framework agnostic libraries allow researchers and users to leverage their previous investments in existing ML-frameworks by providing universal integration points. This is a unique property of Flower: The ML-framework landscape is evolving quickly and we believe that it is best to let the user choose which framework to use for their local training pipelines. TFF is tightly coupled with TensorFlow and LEAF also has a dependency on TensorFlow. PySyft supports two ML-frameworks (PyTorch and TensorFlow), but does not allow for integration with arbitrary tools.

  • SDK implementations for popular mobile and server platforms such as Android (Java), iOS (Objective-C), and Python make a framework more approachable and easier to use, especially on mobile devices which usually require tighter integration with the provided application frameworks as opposed to more flexible server environments. Flower provides a Java SDK (Android) and a Python SDK (server). Objective-C (iOS) is supported on the protocol level, with a full SDK following at a later point. Other frameworks provide strong Python support, support for mobile platforms however is either in the planning stage or not planned at all.

  • Language agnostic API.

    The mobile ecosystem is very diverse. Apart from common platforms like Android and iOS there also other platforms worth supporting. Researchers who run, for example, a Raspberry Pi or a Nvidia Jetson probably have very specific requirements. For adoption it is therefore paramount to not have programming language or platform as a limiting factor. Flower achieves this language agnostic interface by offering protocol-level integration through gRPC, which is compatible with a wide range of programming languages. Other frameworks in our comparison are based on Python, with some of them indicating that they plan to support Android and iOS in the future.

  • Benchmark suite. Experimentation with new FL algorithms requires comparison with existing methods. Having existing implementation at ones disposal can greatly accelerate research progress. Flower currently implements a number of FL methods in the context of popular ML benchmarks, e.g., a federated training of CIFAR-10 Krizhevsky et al. (2005) image classification. LEAF also comes with a number of benchmarks built-in.

4 Flower Framework Client

We start to detail the client perspective on Flower since it is the main point of integration between Flower and different workloads, for both mobile device and server settings.

4.1 Architecture

From a client perspective there are two ways of interacting with the framework: Using a high-level abstraction in supported languages such as Python (figure 3, top), or using the raw Flower remote procedure call (RPC) protocol otherwise (figure 3, bottom). These two possibilities help to bridge the gap between usability and flexibility. Figure 3 shows this on a conceptual level, with a distinction between functionality provided by the framework and functionality implemented by the user. We first describe the underlying protocol.

Figure 3: Flower client architecture.

4.2 Flower Protocol

The Flower protocol is comprised of two broad message categories: Instructions and connection management. Instructions are sent from the server to the client. The server might instruct the client to evaluate a given set of weights on the local data and the client then replies with the evaluation result. Connection related messages can originate on both the client and the server: The server might decide that it will not select the client for some time and instruct it to reconnect later, or the client might change its state such that it becomes necessary to disconnect from the server (e.g., a mobile client not being plugged in for charging any more).

From an implementation perspective, the client first establishes a connection and then waits for instructions from the server. The current implementation of Flower uses gRPC  Foundation (a), which uses an interface description language (IDL) to define the types of messages exchanged. Compilers then generate efficient implementations for different languages such as Python, Java, or C++.

The major reason for choosing gRPC is that is uses an efficient binary serialization format, which is especially important on low-bandwidth mobile connections. Bi-directional streaming allows for the exchange of multiple message without the overhead incurred by re-establishing a connection for every request/response pair, again, an important consideration for mobile wireless connections. Further benefits are typed RPC calls and good support for a variety of languages.

4.3 Client Interface

For evaluation (§6 and §7) we implemented a number of workloads running on different platforms, for example, an object recognition model running on five different kinds of Android devices. The described protocol is general enough to implement arbitrary FL workloads. However, it does not offer the ease-of-use Flower intends to provide because users have to implement a number of RPC-related tasks such as connection management, Flower protocol, and serialization.

We therefore offer higher-level SDKs for common platforms and languages, the first two being Python on Linux and Java on Android. Their intent is to implement functionality which is universal across workloads whilst allowing the user to focus on workload specific logic. The SDK calls user code through an easy to implement interface.

Let us assume that the user wants to train a model for the CIFAR-10  Krizhevsky et al. (2005) image classification task in a federated way and that the necessary client-side logic is implemented in a class called CifarClient. One can then start a client by providing both an instance of CifarClient and details of the server to connect to:

python import flwr server_address = ”[::]:8080” # IPv6 client = CifarClient(…) flwr.client.app.start_client(server_address, client)

The function start_client handles connection management, Flower protocol, and serialization. It interacts with user code through callback methods which are defined in the abstract base class Client and implemented in a concrete user class (e.g., CifarClient).

Figure 4: Flower Android Client UI. A user can specify parameters (e.g., a time range, minimum battery level) which will dictate their participation in FL.

4.4 Client Implementation Example

We continue with the previous example and assume that the user wants to implement CifarClient using TensorFlow/Keras. The following Python code shows a simplified example of such an implementation attempt, omitting some (necessary) details for the sake of brevity.

python class CifarClient(flwr.Client): def __init__(self, m, xy_train, xy_test): self.model = m self.x_train, self.y_train = xy_train self.x_test, self.y_test = xy_test

def get_weights(self): return self.model.get_weights()

def fit(self, weights): self.model.set_weights(weights) self.model.fit(self.x_train, self.y_train) return self.model.get_weights()

def evaluate(self, weights): self.model.set_weights(weights) return self.model.evaluate(self.x_test, self.y_test)

Full implementations will of course be more involved because they require, for example, learning rate schedules. However, the example hopefully illustrates that the interface methods provided by flower.Client map well to concepts commonly found in ML workloads. Implementing a client using a different ML-framework would work in a similar manner. Instead of calling Keras methods, one would call, for example, PyTorch  Paszke et al. (2019) methods. This alignment of the provided interface with universal ML concepts has allowed us to port existing workloads with little overhead, irrespective of the ML-framework used.

4.5 Integration Example: Android

A good example of Flower’s platform-agnostic capabilities is its seamless integration with clients running on Android. We first compile the gRPC interface definition (which describes the messages exchanged between the FL clients and the server) using a compiler provided with Android SDK. Thereafter, in the Android application, we spawn a thread that runs in the background and sets up bi-directional streaming RPC with the Flower server using the StreamObserver class. The on-device training is done using TensorFlow Lite (TFLite) and is handled by a class named TaskClient which implements the Client interface provided by Flower for the ML task (e.g., CIFAR-10) at hand. While TFLite is primarily designed for mobile inference, we exploit its capabilities for model personalization Lite (2020) to implement on-device training. Upon receiving messages from the server, the background thread calls the appropriate TFLite training methods exposed by the TaskClient class and returns their output to the server.

5 Flower Framework Server

The server-side faces similar questions of customization and ease-of-use as the client-side, but in an arguably less heterogeneous environment compared to the landscape of mobile clients. We describe the high-level architecture of the Flower server and how researchers can leverage it for FL.

5.1 Architecture

The Flower server is considerably more complex than the client since it has to coordinate the FL process over clients, whereas each client only needs to coordinate its behaviour with one server. Amongst the responsibilities of the server are both high-level tasks like the coordination of the core FL process and low-level tasks such as connection management and error handling. Its components can be grouped into three categories: Flower core, RPC stack, and strategy (see Fig. 5).

Figure 5: Flower server architecture.

5.2 Protocol

The server-side protocol is based on the same IDL description as mentioned the client side. Our current implementation is based on gRPC  Foundation (a) and uses bi-directional streaming to decrease the overhead incurred by connection establishment, which is especially relevant on mobile devices. The client opens a connection (without sending any payload) and then waits for the server to send instructions such as Fit or Evaluate. The payload of these messages includes all necessary parameters, e.g., the current global model weights in the case of Evaluate.

One important property of this architecture is that the server is unaware of the nature of connected clients, which is important because it allows to train models across heterogeneous client platforms and implementations. It does not only allow to implement workloads for one platform or another, but it even allows for fully-heterogeneous workloads which learn across a set of heterogeneous devices, for example, a Java client running on Android collaborating with a Swift client running on iOS. Please note that even though the current implementation uses gRPC, there is no inherent reliance on it. The internal server architecture uses modular abstractions such that components that are not inherently tied to gRPC are unaware of it. This could enable future versions of the server to support other RPC frameworks (e.g., Apache Avro Foundation (b)), and even learn across heterogeneous clients where some are connected through gRPC, and others are connected through other RPC frameworks.

5.3 Federated Learning Core

Having a number of connected and available clients at its disposal, the server can continue to perform the core duties of an FL system. FL progresses in rounds of learning, each of which can be described as a sequence of client selection, client-side training, result collection, result aggregation, and (optional) evaluation of the aggregated result.

The server implements those primitives commonly found in different FL approaches. It does however avoid to prescribe a specific FL algorithm and instead delegates the decision of how to continue to a so-called strategy. One can think of a strategy as a server plug-in that allows users to build a vast range of custom Flower server behaviours (detailed in 5.4).

Both distributed and centralized validation are supported. Real-world workloads which have their data distributed over the participating clients will usually use distributed validation. It works by distributing the global model to (a subset of) available clients, have them evaluate the model on their local data set, and return the individual evaluation results (e.g., loss) to the server. Distributed evaluation is part of the previously described Flower protocol. Centralized validation on the other hand can happen on the server (if validation data is available on the server). It is an important tool for benchmarking since the performance estimates it provides are more stable compared to those of distributed evaluation (which usually validates the model on only a small subset of the available validation data). Flower supports centralized validation through the strategy abstraction.

5.4 Strategy Abstraction

The strategy abstraction is the heart of the FL activity — it is basically synonymous with the FL algorithm performed. One design goal of Flower was to enable flexibility for both researchers to experiment with state-of-the-art approaches and application developers to tune the behaviour for their respective workload. The server achieves this flexibility through a plug-in architecture which delegates certain decisions to a user-provided implementation of the abstract base class Strategy. This strategy abstraction can therefore be used to inject arbitrary logic and customize core aspects of the FL process, e.g., client selection and update aggregation. As of now, we have successfully implemented five kinds of FL algorithms as summarized in Table 1. These algorithms include FedAvg and its fault-tolerant variant and several state-of-the-art algorithms for extending FL to heterogeneous environments.

Average users should still be able to run the system without having to implement the details of their own FL algorithms. For users who do not intend to supply their own strategy instance, the server therefore loads a DefaultStrategy which performs FedAvg with sensible defaults. See §5.5 for an example. One implication of this design is that server strategies developed for one use case can easily be used in others. It even enables the creation of an open ecosystem: researchers can propose new strategies and offer them in stand-alone libraries, and application developers can compose those with the core framework and their individual workload.

5.5 Strategy Example

The following example demonstrates how one of the most prominent FL algorithms to date, namely FedAvg, can be implemented and even customized to incorporate a basic level of fault-tolerance, as proposed by Bonawitz et al. (2019). Such implementations are based on the high-level Strategy interface provided by Flower. Note that a strategy implementation does not necessarily need to override all methods defined by flwr.Strategy, but it can instead decide to rely on sensible default implementations by only overriding selected methods of other flwr.Strategy implementations.

python class FedAvg(flwr.Strategy): def on_configure_fit(self, rnd, weights, client_manager): clients = client_manager.sample(10) return [configure(c) for c in clients]

def on_configure_evaluate(self, rnd, weights, client_manager): clients = client_manager.sample(10) return [configure(c) for c in clients]

def on_aggregate_fit(self, rnd, results, failures): if len(results) ¡ 8: return None return weighted_avg(results)

def on_aggregate_evaluate(self, rnd, results, failures): return loss_avg(results)

A Flower server running with this strategy performs FedAvg. It randomly samples ten clients for training and aggregates the updates provided by them if at least eight of them return successfully. This provides a basic level of fault tolerance by allowing clients to fail, but also ensuring to only replace the model if the completion rate is above a certain threshold (80% in this case). It would sample another ten clients for distributed evaluation, and averages the results (but now ignoring the number of failures). It does not implement other possibilities offered by the Strategy interface, for example, centralized validation.

6 Evaluation Methodology

We now present a series of experiments that demonstrate Flower’s abilities in supporting the implementation of real-world FL workloads across heterogeneous clients and varied networking speeds.

6.1 Datasets

Three widely-used benchmark datasets are used in our evaluation, namely CIFAR-10 Krizhevsky et al. (2005), Fashion-MNIST Xiao et al. (2017) and Office-31 off (2020). CIFAR-10 and Office-31 are examples of object recognition datasets and hence are representative of the ML workloads expected in mobile vision use-cases such as supporting object discovery for vision-impaired users see (2020) and mobile lifelogging Mathur et al. (2017). Fashion-MNIST, on the other hand, contains images of fashion items such as trousers or pullovers and is representative of the workloads of e-commerce mobile applications that provide clothing recommendations to users.

CIFAR-10 consists of 60,000 images from 10 different object classes. The images are 32 x 32 pixels in size and in RGB format. We use the training and test splits provided by the dataset authors — 50,000 images are used as training data and remaining 10,000 images are reserved for testing.

Fashion-MNIST consists of images of fashion items (60,000 training, 10,000 test) with 10 classes. The images are 28 x 28 pixels in size and in grayscale format.

Office-31 contains images of common office objects (e.g., printers, tables) belonging to 31 different classes. The dataset has images captured from different sources (a web camera, a DSLR camera and Amazon product images) — for our experiments, we only use the object images from ‘Amazon’ as the FL workload primarily because the images from other sources are fewer in number. In total, we use 2900 images (10% are held out for testing) which are 300 x 300 pixels in size and in RGB format.

6.2 Experiment Settings

Flower Server. We implement the Flower Server in Python following the FedAvgStrategy described earlier which allows varying the number of FL clients used in the training process. As mentioned earlier, Flower is designed for scalability – as such, in our experiments we vary the number of clients concurrently participating in the training process from 4 to 100, and evaluate their impact of the FL accuracy and training time.

Mobile Clients. FL systems are likely to be driven by mobile clients owned by users distributed around the world. As highlighted in §4, Flower is agnostic of the programming language and machine learning framework used for model training and therefore, it can incorporate any client that supports on-device training, including mobile clients.

For our evaluation, we implement FL clients on Android phones that run TensorFlow Lite (TFLite). While TFLite is primarily designed for mobile inference, we exploit its capabilities to do on-device model personalization to implement a FL client application on it (refer to §4.4

for details). Specifically, we train an object recognition model in a federated mobile setting to recognize everyday office objects in the Office-31 dataset. We use a pre-trained and frozen MobileNetV2 base model for extracting image features and train a 2-layer DNN (using FL) as the classifier that operates on the extracted features. In order to scale our experiments to a reasonably large number of mobile clients with different operating systems, we deploy Flower on the Amazon AWS Device Farm 

dev (2020) that enables deploying applications on real mobile devices accessed through AWS. Table 3 list the mobile devices from AWS Device Farm used in our evaluation.

Cloud-based Clients. To experiment with bigger and more compute-intensive FL workloads that cannot be run on today’s mobile devices due to limited runtime memory or computational resources, we also implement Flower clients on AWS VMs with TensorFlow as the ML training framework. This allows us to flexibly scale the number of clients and also train deeper architectures (e.g., ResNet50) with larger datasets (e.g., CIFAR-10). With the emergence of edge AI accelerators such as Google Coral equipped with purpose-built ASIC chips, we anticipate that in the future, these workloads could be executed directly on edge devices using Flower.

Device Name Type OS Version
Google Pixel 4 Phone 10
Google Pixel 3 Phone 10
Google Pixel 2 Phone 9
Samsung Galaxy Tab S6 Tablet 9
Samsung Galaxy Tab S4 Tablet 8.1.0
Table 3: Flower clients on AWS Device Farm.

6.3 Evaluation Goals

We use FedAvg McMahan et al. (2017), the most common federating learning algorithm, for our evaluation, however Flower can also support other FL algorithms such as Mocha Smith et al. (2017). We first show the performance of Flower in training a model for our three datasets under federated settings. For the Office-31 dataset, we train a MobileNetV2 + 2 layer DNN model on Android-based mobile clients hosted in AWS Device Farm. For the CIFAR-10 and Fashion-MNIST datasets, we train models using the ResNet50 and two-layered CNN architectures on cloud-based clients. These experiments are primarily designed to explore how Flower scales with the number of client nodes and we report the test accuracy and training time for these experiments.

Next, we explore the capabilities of Flower to support practical scenarios of system heterogeneity. To this end, we implement Flower clients on machines with different computational capabilities and report its impact on the training time of FL. We also investigate the effect of a client’s network bandwidth on the FL convergence time, by simulating speeds similar to 4G networks. Finally, we study the scalability of Flower to extreme cases wherein it has to support FL across hundreds of active clients.

7 Evaluation Results

We now present our key results that highlight the abilities of Flower in supporting the implementation of real-world FL workloads on heterogeneous clients. Due to a lack of mobile and wireless support in earlier FL tools, many of the individual experiments shown are the first of their kind. More specifically, our results show that:

  • [leftmargin=*]

  • Using Flower, researchers can easily change various FL experimental parameters such as number of participating clients per round or the averaging strategy on the server, which in turn help in evaluating the accuracy and scalability of the FL system.

  • Flower can work seamlessly with both mobile (Android) clients as well as cloud-based clients implemented in Python. By enabling support for heterogeneous clients with different computation capabilities and network bandwidths, Flower enables accurate estimation of metrics such as FL training time in real-world scenarios.

  • Flower can be used to uncover and quantify the effect of various system-related bottlenecks on FL performance. For example, we find that transitioning from cloud-scale clients to clients running on average 4G speeds can increase the training time of a ResNet50 model by 2.15 times.

  • Flower can scale to hundreds of concurrent clients and seamlessly handle large-amounts of data transfer that takes place in a FL training cycle.

Performance on FL workloads. In Figure 6, we present the test accuracy obtained on the Office-31 dataset by training a 2-layer DNN on top of a pre-trained MobileNetV2 on Android clients. For this experiment, we first vary the number of Android clients () participating in FL in each round – the training dataset is split equally across all participating clients without overlap, for example, if there are 10,000 samples in the training set and 10 participating clients, then each client gets assigned 1,000 samples as its local training data. From Figure 5(a), we observe that by increasing the number of participating clients in each round, a more accurate model can be trained. Intuitively, as more clients participate in the training, the model gets exposed to more and diverse training examples, thereby increasing its generalizability to unseen test samples. Notably, the test accuracy obtained using FL is on par with centralized test accuracy on Office-31 dataset using MobileNetV2. Next, we vary the number of local training epochs () on each Android client and observe that results in the best test accuracy. Using a high number of local epochs () causes the local weight updates to diverge significantly across the clients, and hence leading to a poor performance of FedAvg.

It is worth highlighting that these results are dependent on the hyper-parameters (e.g., learning rate, optimizer) used for training – as such, they could be further enhanced. Our primary goal, through this experiment, was instead to demonstrate that researchers can use Flower to conduct different types of FL experiments with minimal effort.

(a) Varying the number of clients ().
(b) Effect of local training epochs ().
Figure 6: Flower supports implementation of FL clients on Android devices and in general, on any device that provides on-device training support. Here we show various experiments on Android devices enabled by Flower. (a) shows that varying the number of clients () in FL can impact the overall test accuracy. For this experiment, we used =5. (b) shows the effect of local training epochs () on FL accuracy. For this experiment, we use =10.

In Figure 7, we show the effect of varying the number of FL clients on larger workloads such as CIFAR-10 and Fashion-MNIST running on AWS Virtual Machines. We instantiate 100 Flower Clients and randomly sample clients for each training round – this can be achieved by simply modifying the FedAvgStrategy5.5) provided by Flower. We observe that even with 10 clients per round, we can achieve test accuracy similar to 20 or 30 clients. This experimental result becomes important in the context of battery-powered edge devices — if a model can be accurately learned with less devices per round, this knowledge could be used to optimize the scheduling of FL workloads across devices to save energy and communication costs. Finally, in Figure 9, we show the effect of varying the local training epochs on FL test accuracy with 100 clients. We observe that more local training epochs are beneficial for both the CIFAR-10 and Fashion-MNIST workloads.

(a) Fashion-MNIST
(b) CIFAR-10
Figure 7: Effect of varying number of clients ( out of 100) on the FL test accuracy. For this experiment, we used =5.
(a) Fashion-MNIST
(b) CIFAR-10
Figure 8: Effect of varying local training epochs () on each client on the FL test accuracy. We set =100.

Effect of Computational Heterogeneity. In practice, FL clients which are often mobile and embedded devices are likely to have vastly different computational capabilities. While some newer smartphones are now equipped with mobile GPUs, other phones or wearable devices may have a much less powerful processor. Certainly, the computational power of each client will impact the time taken to perform local weight updates on the shared model, which in turn would influence the total time for FL. A researcher interested in doing such an experiment to evaluate the effect of compute heterogeneity on convergence time could simply implement and run Flower Clients on the platform of interest and execute them with the Flower framework to obtain the total FL training time and convergence properties.

In Figure 8(a), we show a result of synchronous training a ResNet50 He et al. (2016) model on the CIFAR-10 dataset with 10 clients. First, we only use GPU-enabled machines as the FL clients and obtain the total training time for 60 rounds of FL – for example, with , it takes around 270 minutes to collaboratively train the model on GPU (Nvidia V100) machines. Next, we add just one CPU-only machine to the client pool, i.e., the model is now trained with 9 GPU-enabled clients and 1 CPU-only client. We observe that the training time increases to 970 minutes (3.5x) (for ) due to the computational bottleneck of the CPU-only machine. While this finding is not surprising, it can certainly assist a researcher or a developer to appropriately schedule their FL workloads.

Effect of Network Bandwidth. The network bandwidth on the client end is another key factor that influences the training time of FL systems. To study this aspect, Flower clients could either be implemented on real mobile devices or on a server which simulates different network bandwidths. As an example, we show in Figure 8(b) the effect of various network bandwidths on the training time (60 rounds) of the CIFAR-10 FL workload (, ). We simulate four bandwidths for the clients: 1 Gbps (cloud-scale clients), 100 Mbps (a high-speed 4G mobile client), 30 Mbps (an average-speed 4G mobile client) and 20 Mbps (a slow 4G mobile client). These speeds correspond to the real-world 4G speeds in different parts of the world Test (2020). Figure 8(b) shows that FL training time increases from 200 minutes (on cloud-based clients) to above 430 minutes (on average 4G speeds).

Federated Learning at Scale. Finally, we demonstrate that Flower can efficiently handle the large amounts of data that is typically transferred in a FL training process. From Figure 10, it can be observed that when we train a ResNet50 model on the CIFAR-10 dataset, as much as 20GB of data (i.e., model parameters) is exchanged between the server and clients during each round of FL, when using a sampling rate of 1.0 (i.e., using all available clients for FL). The underlying gRPC protocol used in Flower hides the implementation complexities of data transfer, and enables researchers to focus on the algorithmic or systems related challenges of FL.

(a) Fashion-MNIST
(b) CIFAR-10
Figure 9: (a) Effect of computation heterogeneity of the clients on FL training time. Even a single client with lower compute capabilities can lead to a significant increase in the total convergence time. (b) Effect of communication bandwidth of FL training time. The training time increases significantly as we transition to 4G network speeds (20-100 Mbps).
Figure 10: Effect of different client sampling rates on data transferred per round of training (ResNet50, CIFAR-10, C=100). Flower handles as much as 20GB of data exchanged per round of training when we sample all available clients.

8 Limitations and Next Steps

In the following, we highlight limitations of the current Flower design and evaluation – along with discussing future areas of research.


Our existing evaluation has considered three popular datasets that stretch Flower in important dimensions. But we appreciate the need to expand this set to include certain larger-scale examples (e.g., ImageNet), as well as those based on different modalities (such as audio, text and tabular data) that are anticipated to be relevant to mobile devices. Alongside the expansion to more datasets, we also understand the value in testing Flower under a wider variety of neural architectures. Currently our Flower evaluation focuses only on CNN- and DNN-based architectures. Although on both these counts (dataset and architectures) we are not aware of any reason for unexpected empirical results under Flower as it builds upon existing ML frameworks, and support existing FL algorithms, for which prior results in the literature already show such breadth in results.

Libraries for efficient training on mobile devices are still in a nascent stage. But for any FL solution it is clearly a critical ingredient. By design, Flower can leverage any ML framework (e.g., TensorFlow or PyTorch) which maximizes its ability to use existing training pipelines. However, it also means Flower inherits the limitations of these frameworks that currently offer very limited support for on-device training (unlike the extensive solutions for on-device inference). It is anticipated that ML frameworks will address this short-coming within the next 12 months. Furthermore, Flower already includes expanded on-device training via the integration of TFLite model personalization routines (described in §4) which can be treated as a proof-of-concept for future support. Such restrictions may prevent certain models out-of-the-box from being easily deployed, largely due to the memory requirements of the backward pass. However because of the rich set of APIs (e.g., Java/Android) it still remains possible for Flower users to implement solutions that circumvent such barriers.

Looking Ahead. The most exciting and timely next step for Flower will be to examine a variety of FL algorithms and results at a much large scale and a much great level of heterogeneity than has previously been possible. To date FL approaches in the literature are rarely evaluated with large numbers of client participants (such as in the thousands) – and virtually are never tested under a pool of different mobile devices, as would be the norm for FL systems targeting mobile platforms. Similarly, evaluations also completely neglect the impact of diverse wireless conditions as devices collaborate. As a result, we plan to use Flower to revisit a number of key FL results and test if these results hold up under more realistic conditions. Performing such experiments become much more feasible under Flower – and we anticipate this will uncover many situations existing FL solutions do not perform as expected.

Because Flower makes it relatively easy to federate existing training pipelines, we also will immediately begin to test a diversity of new application areas to see how they behave with the relaxed non-centralized assumptions of FL. Example applications we expect to be suitable are those that would significantly benefit from personalization and adaption to distinctive deployment environments in which they fine-tune themselves. One example of this is device microphone adaptation for improved audio and speech modeling Mathur et al. (2019).

Complementing conventional supervised applications, we also expect Flower to be indispensable in the exploration of the rapidly maturing area of unsupervised, semi-supervised and self-learning Xie et al. (2019b). FL using supervised methods are often not practical simply because it is unnatural to acquire labeled data from users. But in contrast, devices have plentiful access to virtually unlimited amounts of unlabeled data. Furthermore, these learning approaches significantly increase the amount of data to be trained upon as unlabeled data is much more prevalent and so benefits from FL ability to distribute the training computation.

9 Conclusion

We have presented in this paper Flower – a new framework for FL that is specifically designed for application to mobile devices and the wireless links that connect these devices. Although Flower is broadly useful across a range of FL settings (such as fixed position homogeneous clients), we believe it will be a true game-changer for adoption and innovation within FL for mobile. Flower accelerates the ability for researchers and practitioners to: devise new FL algorithmic and system solutions; prototype end-to-end FL applications; and perhaps most importantly, offers a solid foundation from which methods can be tested and compared at scale, under device heterogeneity and a diversity of wireless connections.