The Internet of Federated Things (IoFT): A Vision for the Future and In-depth Survey of Data-driven Approaches for Federated Learning

by   Raed Kontar, et al.
University of Michigan

The Internet of Things (IoT) is on the verge of a major paradigm shift. In the IoT system of the future, IoFT, the cloud will be substituted by the crowd where model training is brought to the edge, allowing IoT devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift was set into motion by the tremendous increase in computational power on IoT devices and the recent advances in decentralized and privacy-preserving model training, coined as federated learning (FL). This article provides a vision for IoFT and a systematic overview of current efforts towards realizing this vision. Specifically, we first introduce the defining characteristics of IoFT and discuss FL data-driven approaches, opportunities, and challenges that allow decentralized inference within three dimensions: (i) a global model that maximizes utility across all IoT devices, (ii) a personalized model that borrows strengths across all devices yet retains its own model, (iii) a meta-learning model that quickly adapts to new devices or learning tasks. We end by describing the vision and challenges of IoFT in reshaping different industries through the lens of domain experts. Those industries include manufacturing, transportation, energy, healthcare, quality reliability, business, and computing.



There are no comments yet.



Towards a Federated Learning Framework for Heterogeneous Devices of Internet of Things

Federated Learning (FL) has received a significant amount of attention i...

Federated Learning for Edge Networks: Resource Optimization and Incentive Mechanism

Recent years have witnessed a rapid proliferation of smart Internet of T...

Game Theory Based Privacy Preserving Approach for Collaborative Deep Learning in IoT

The exponential growth of Internet of Things (IoT) has become a transcen...

Federated Learning Cost Disparity for IoT Devices

Federated learning (FL) promotes predictive model training at the Intern...

Decentral and Incentivized Federated Learning Frameworks: A Systematic Literature Review

The advent of Federated Learning (FL) has ignited a new paradigm for par...

Clustering Algorithm to Detect Adversaries in Federated Learning

In recent times, federated machine learning has been very useful in buil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-A Preamble

At the early stages of the COVID-19 pandemic, companies that mass-produce personal protective equipment (PPE) required long ramp-up times to fulfill the urgent demand [demandsurge, 3drapidresponse]. The ramp-up time took longer than expected as supply chains across the globe were critically disrupted, with entire countries in lockdown and essential workers succumbing to the virus [fordcorona]. Realizing this, many citizens and small businesses tried to bridge the supply gap using readily available, and low-cost 3D printers [civilian3d, 12yearold]. This attempt at so-called massively distributed manufacturing [MDM] helped fill PPE production gaps to some extent [civilian3d, 12yearold]. However, it also revealed critical impediments to realizing massively distributed manufacturing in terms of standardizing production requirements, guaranteeing quality and reliability, and attaining high production efficiencies that can rival those of mass production [MDM]. For example, a large percentage of parts printed by citizens did not meet the quality requirements [fillgap, faultymasks]. Even when following standard 3D printing guidelines, several prints failed [defectivemasks] while others experienced recurrent defects due to the use of models or methods that did not account for the specific environment in which the 3D printer is operating [salmi20203d]. On the other hand, citizens that succeeded struggled to effectively broadcast their improved models or methods to other users to help improve quality across the network of manufacturers [cant3dprint].

Now imagine an alternative future based on a cyber-physical operating system for massively distributed manufacturing. All 3D printers are IoT-enabled through wifi and smart sensors. In addition, printers now have computation power through AI chips (many 3D printers nowadays have such capabilities, ex: Raspberry Pi’s [baumann2017additive, okwudire2018low]). The printers collaboratively learn a model for 3D printing PPE accurately with the help of a central orchestrator, guiding the production to the desired quality level. To preserve privacy and intellectual property and allow for massive parallelization, raw data from each 3D printer is never shared with the central server; instead, printers exploit their compute resources at the edge by running small local computations and only sharing the minimal information needed to learn the model. This model, despite having a global state, is personalized to form a local model that accounts for individual-level external factors affecting each 3D printer.

In this alternative reality, responders can 3D print PPE at the desired quality level with little or no defects. Responders act quickly due to the massively parallelized efforts from many 3D printers and the effective utilization of network bandwidth. In addition, with their personalized 3D printing models, the responders are able to push 3D printers at faster speeds to shorten printing time while maintaining quality [UMtech, duan2018limited, okwudire2018low]. Accordingly, the PPE supply gap is successfully filled until mass production ramps up.

In this future, not only manufacturing benefits. Take healthcare wearable devices as an example. Compute power on such devices has been immensely increasing over the years. Now, personal data need not be uploaded to a central cloud system to learn an anomaly detection model for health signals. Instead, the “cloud” is replaced by the “crowd”, where wearable devices store necessary data, perform local computations and send only needed model updates to the central authority. This decouples the ability to learn the model from storing data in the cloud by bringing training to the device as well, where a model can be learned across thousands of millions of wearable devices in geographically dispersed locations.

Let us now switch paradigms and replace smart devices with “smart” institutes. Different medical institutions can join efforts and collaboratively learn diagnostic models without directly sharing their electronic health records, as imposed by the Health Insurance Portability and Accountability Act (HIPAA). Now, diagnostic models can leverage largely diverse datasets and promote fairness through a decentralized learning framework that mitigates privacy risks and costs associated with centralized modeling. Learning can be done across institutes and individuals at multiple scales and in areas that this has not been possible or allowed before.

The future described above is not a far cry away. It has already been set into action as the immediate yet bold next step for the Internet of Things (IoT). It is the cultivation of Industry 4.0. A cultivation of advances in interdisciplinary fields in the past two decades: ranging from data science, edge computing, machine learning, operations research, optimization, data acquisition technologies, physics-guided modeling, and privacy, amongst many others.

In this article, we term this future of IoT as the Internet of Federated Things (IoFT). The term “federated” refers to some level of internal autonomy of IoT devices and is inspired by the explosive interest during the past two years in Federated Learning (FL): an approach that allows decentralized and privacy-preserving training of models [fedavg]. With the help of FL, the decentralized paradigm in IoFT exploits edge compute resources in order to enable devices to collaboratively extract knowledge and build smart analytics/models while keeping their personal data stored locally. This paradigm shift not only reduces privacy concerns but also sets forth many intrinsic advantages including cost efficiency, diversity, and reduced computation, amongst many others to be detailed in the following sections.

I-B Purpose and Uniqueness

This paper is a joint effort of researchers across a wide variety of expertise to address the three questions below:

  1. What are the defining characteristics of IoFT?

  2. What are key recent advances and potential data-driven methods in IoFT that allow learning in one of the three dimensions stated below? what modeling, optimization, and statistical challenges do they face? and what are potential promising solutions?

    • A Global model: that maximizes utility across all devices. The global model aims at capturing the commonalities and intrinsic relatedness across data from all devices to improve prediction and learning accuracy.

    • A Personalized model: that tries to personalize and adapt the global model to data and external conditions from each device. This embodies the principle of multi-task learning, [pan2009survey] where each device retains its own model while borrowing strength across all IoFT devices.

    • A Meta-learning model: that learns a global model which can quickly adapt to a new task with only a small amount of training samples and learning steps. This embodies the principle of “learning to learn fast,” [snell2017prototypical] where the goal of the global model is not to perform well on all tasks in expectation, instead to find a good initialization that can directly adapt to a specific task.

  3. How will IoFT shape different industries and what are the domain specific challenges it faces for it to become the standard practice? Through the lens of domain experts, we shed light on the following sectors: manufacturing, transportation, energy, healthcare, quality & reliability, business and computing.

Besides defining the central characteristics of IoFT, our paper’s focus is summarized in two folds. The first is data-driven modeling where we categorize FL approaches in IoFT into learning a global, personalized, and meta-learning model and then provide an in-depth analysis on modeling techniques, recent advances, possible alternative, and statistical/optimization challenges. The second focus is a vision of IoFT’s potential use cases, application-specific models, and obstacles within different application domains. Our overarching goal is to encourage researchers across different industries to explore the transformation from IoT to IoFT so that critical societal impacts brought by this emerging technology can be fully realized.

We note here that some excellent surveys on FL have been recently released. Most notably, lim2020federated address FL challenges in mobile edge networks with a focus on communication cost, privacy and security, niknam2020federated discuss FL application in wireless communications, especially under 5G networks, li2020federated provide a thorough overview of implementation challenges in FL, yang2019federated then categorize different architectures for FL, rahman2020survey discuss the evolution of the deployment architectures with an in-depth discussion on privacy and security, while aledhari2020federated highlight necessary protocols and platforms needed for such architectures, kairouz2019advances study open problems in FL and recent initiatives while providing a remarkable survey on privacy-preserving mechanisms. Along this line, lyu2020threats highlight threats and major attacks in FL. While our focus is on data-driven modeling for IoFT and how various application fields will be affected by the shift from IoT to IoFT, the surveys above serve as excellent complementary work for a bird’s eye view of FL and hence IoFT.

The remainder of this paper is organized as follows. Sec. II highlights the past and present features of IoT-enabled systems leading to IoFT. Secs. III - V provide data-driven modeling approaches for learning a global, personalized, and meta-learning model, along with their challenges and promising solutions. Finally, Sec. VI poses central statistical and optimization open problems in IoFT. These open problems are from both a theoretical and applied perspective. Finally, Sec. VII provides a vision for IoFT within manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing.

Throughout this paper, we use IoFT to denote the future IoT system we envision, while FL denotes the underlying data analytics approach for data-driven model learning within IoFT. Also, edge device, local device, node, user, or client are used interchangeably to denote the end-user based on the problem context.

I-C IoFT Website and Central Directory

While exploring data-driven modeling approaches to FL in IoFT, it became clear that real-life datasets (in engineering, health sciences, etc..) are pressingly needed to fully explore the disruptive potential of IoFT. While few already exist, they are based on artificial examples, and the few non-artificial datasets are mostly focused on mobile applications. However, for IoFT to become a norm in different industries, real-life datasets with defining features of the underlying system are needed to unveil the potential challenges and opportunities faced within different domains. Only with a deep understanding of the underlying system and domain, one formulates the right analytics. Towards this end, this paper features a supplementary website ( managed by the University of Michigan. The website will serve as a central directory for IoFT-based datasets and will feature brief descriptions of each dataset categorized by its respective field with a link to the repository (research lab website, GitHub account, papers, etc..) where the data is contained. Our hope is to provide a means for model validation within different domains, encourage researchers to develop real-life datasets for IoFT and help with the outreach and visibility of their datasets and corresponding papers.



Ii Internet of Things: The Past, Present and Future

Fig. 1: Key components of an IoT enabled system

IoT-enabled systems possess three defining characteristics: tangible physical components that comprise the system, connectivity among components that enable data acquisition and sharing, and data analytics and decision-making capabilities that transform a merely “connected” system into a “smart and connected” system. These defining features of IoT enabled systems [porter2014smart, analytics2016age, chen2015data] are shown in Fig. 1. IoT has brought broad disruptive societal impacts, particularly on economic competitiveness, quality of life, public health, and essential infrastructure [madakam2015internet]. Companies around the globe have invested heavily in IoT, including: Google’s Cloud IoT [googleiot], Samsung’s Active wearable device [sumsung], Amazon’s Webservices solutions [awsamazon], Rockwell’s Connected Enterprise [rockwell], Welbilt’s Smart Home Appliances, to name a few. The value at stake is more than 15 trillion dollars, a number expected to triple in the next decade [atzori2010internet].

The essential feature of an IoT system is that data from multiple similar units and across multiple components within the system are collected during their operation, often in real-time. Since we have observations from potentially a large number of similar units, we can compare their operations, share information, and extract common knowledge to enable accurate prediction and control. One can argue that such a notion of IoT dates back a long time before the Industrial Revolution, to the time when artisans producing crafts in geographically close locations used to gather to share knowledge and perfect/standardize the quality of their crafted product [srai2016distributed]. A lot has changed since then.

Ii-A IoT: The Present

Fig. 2: Present Day IoT system

Starting with the industrial revolution came rapid advances in connectivity, automation, data science, cloud-based systems, among many others [manufcaturingtrends, predictivemaintanence]. An IoT sensor price dropped to $0.48 on average, and wide-area communication became readily available with around $36.13 billion connected IoT devices in 2018 [sensorprice]. Distributed computing allowed handling larger datasets than what was previously thought possible and cloud-based solutions for data storage and processing have become widely available for commercial use (ex: Amazon’s AWS [amazonaws] or Microsoft’s Azure [azurevideo]). This ushered in the present-day era of Industry 4.0 characterized by IoT-enabled systems [atzori2010internet]. In this present era, a typical IoT-enabled system structure is shown in Fig. 2. Take for example GM’s OnStar®  or Ford’s SYNC®  teleservice systems [onstar, GMIOT, sync]. Vehicles enrolled for this service have their data in the form of condition monitoring (CM) signals uploaded to the cloud regularly. The cloud then acts as a back-office or data center that processes the data to keep drivers informed about the health of their vehicle. In the cloud, GM and Ford train models that can monitor and predict maintenance needs, amongst others. The data is also used to cross-validate the behavior of their learned models for continuous improvement. When the need arrives, service alerts are then sent to drivers.

Much like other IoT giants such as Google, Amazon and Facebook, GM and Ford have long adopted this centralized approach towards IoT: (i) gigantic amounts of data are uploaded and stored in the cloud (ii) models (such as predictive maintenance, diagnostics, text prediction) are trained in these data centers (iii) the models are then deployed to the edge devices. Needless to say, the need to upload large amounts of data to the cloud raises privacy concerns, incurs high costs, and benefits large enterprises capable of building their own private cloud infrastructures at the expense of smaller entities.

Here, distributed learning is often implemented in centralized systems to alleviate the huge computational burden via parallelization. In such systems, the clients are computing nodes within this centralized framework. Nodes can then access any part of the dataset, as data partitions can be continuously adjusted. In contrast, as described in the following sections, in IoFT, the data resides at the edge and is not centrally stored. As a result, data partitions are fixed and cannot be changed, shuffled, nor randomized.

Ii-B IoT: The Future

With the tremendous increase in computational power on edge devices, IoT is on its way to move from the cloud/datacenter to the edge device, hence the aforementioned notion of substituting the “cloud” by the “crowd”. In this IoT system of the future (IoFT), devices collaboratively extract knowledge from each other and achieve the “smart” component of IoT, often with the orchestration of a central server, while keeping their personal data stored locally. This paradigm shift is based on one simple yet powerful idea: with the availability of computing resources at the edge, clients can execute small computations locally instead of learning models on the cloud and then only share the minimum information needed to learn that model. As a result, IoFT decouples the ability to do analytics from storing data in the cloud by bringing training to the edge device as well. The underlying premise is that IoFT devices have computational (ex: AI chips) and communication (ex: wifi) capabilities.

Fig. 3: IoFT: IoT system of the Future

Let us start with a simple example, assume the central orchestrator in Fig. 3 wants to learn the mean () of a single feature () over all clients. Now assume that clients have some computational capabilities. To calculate , client only needs to run a small calculation to compute their own mean (

) and share it, rather than sharing their entire feature vector (

). is a sufficient statistic to learn .

In reality, models are often more complicated and require multiple communications between the central orchestrator and clients. For instance, and without loss of generality, assume that IoFT devices cooperate to learn a deep learning model through borrowing strength from each other, rather than using their own knowledge in isolation. In the decentralized realm of IoFT, model learning is often administered by a central orchestrator and follows the cycle shown in Fig.


. (i) The orchestrator (i.e., the central server) selects a set of IoFT devices meeting certain eligibility requirements and broadcasts an initial model to the selected clients. This model contains the neural network (NN) architecture, initial weights, and a training program. (ii) IoFT devices perform local computations by executing the program on their local data, and each device reports its focused update to the orchestrator. Here the program can be running stochastic gradient descent (SGD) on local data, and the focused update can be updated weights or a gradient. It is worth noting that the client might choose to encrypt their focused update or add noise to it for enhanced privacy at this stage. (iii) The central orchestrator collects the focused updates from clients and aggregates them to update the global model. (iv) This procedure is then iterated over several rounds until a stopping criterion, such as validation accuracy, is met. Through this process, the global model can account for knowledge from all IoFT clients, and each client can indirectly make use of the knowledge from other clients. Finally, the learned global model goes through a testing phase such as quality-A/B testing on held-out devices and a staged rollout on a gradually increasing number of devices.

This decentralized paradigm shift, made possible by compute resources at the edge, sets forth many intrinsic advantages that include:

  • Privacy: By bringing training to the edge device, users no longer have to share their valuable information, instead, it is kept local and never shared.

  • Autonomy: IoFT devices can be under independent control and opt-out of the collaborative training process at any time. Yet, with enhanced privacy in IoFT, clients will be more inclined to collaborate and build better models.

  • Computation: As the number of IoT devices skyrockets, computational and storage needs accumulated from these devices (say smartphones) is far beyond what any data center or cloud computing system can handle [FLfuture]. Instead, by exploiting compute and storage capacity at the edge, massive parallelization becomes a reality [singh2019detailed, huo2018training].

  • Cost: Focused updates embody the principle of data minimization and contain the minimum information needed for a specific learning task. As a result, less information is transmitted to the orchestrator, which reduces communication costs and efficiently utilizes network bandwidth. Also, compute power at the edge device is now utilized. Hence storage and computational needs of the orchestrator are minimal. This is in contrast to distributed systems where massive utilization and synchronization of GPU and CPU power in the cloud is needed.

  • Fast Alerts and Decisions: In IoFT, upon deployment of the final model to clients, real-time decisions or service alerts are achieved locally at the edge. In contrast, cloud-based systems incur a lag in deployment, as decisions made in the cloud need to be transmitted to the clients (as shown in Fig. 2).

  • Minimal Infrastructure: With the increase in computing power of IoT devices and the gradual market penetration of AI chips [aichipsbillion], minimal hardware is required to achieve the transition to IoFT.

  • Fast encryption: Encryption of focused updates can be done readily and with better guarantees compared to encrypting entire datasets.

  • Resilience: Edge devices are resilient to failures at the orchestrator level due to the existence of a local model.

  • Diversity and Fairness: IoFT allows integrating information across uniquely diverse datasets, some of which have been restricted to be shared previously (recall medical institutes example). This diversity and ability to learn across geographically disperse locations promotes fairness by combining data across boundaries [brisimi2018federated, chang2018distributed].

Having recently realized its disruptive potential to traditional IoT, industries are eagerly trying to exploit IoFT in their operating systems and production. However, these efforts are in their infancy phase, awaiting broad implementations. Google pioneered some of the IoFT applications in their mobile keyboard “Gboard” [hard2018federated, chen2019federated, yang2018applied, ramaswamy2019federated] and Android messaging [googlesupport] to improve next-word predictions and preserve privacy. Additionally, they introduced a decentralized framework to update android models on their Pixel phones [mcmahan2017communication]. In this framework, each android phone updates its model parameters locally and sends out the updated parameters to the Android cloud, which trains its central model from the aggregated parameters. BigTech giants have since started to catch up and utilize FL in their systems. Most notably Apple adopted FL in their QuickType keyboard, “Siri” and privacy protection protocols [bhowmick2018protection, appleprivacy]. As well as Microsoft in their device’s telemetry data [ding2017collecting]. Further, FL has seen some application in optimizing mobile edge computing and communication [wang2019edge, lim2020federated], computational offloading [wang2019edge] and reliable network communication [samarakoon2019distributed].

Most of the current IoFT applications are present within the technology industry and specifically tailored for mobile applications and few others. However, IoFT is expected to infiltrate all industries that benefit from knowledge sharing, data analytics, and decision-making. Indeed, the gradual use of FL in the technology industry has set in motion a timid yet insuppressible momentum for IoFT application in other sectors. For instance, in the healthcare field, FL is lately being used as a medium of collaboration between hospitals to share patients’ electronic records and other medical data [brisimi2018federated, nvidiaclara, huang2020loadaboost, futurefl]. In Sec. VII, we will present a deeper vision into how IoFT and FL will shape the future of various industries; those include manufacturing, transportation, energy, healthcare, quality & reliability, business, and computing.

Ii-B1 Challenges

IoFT as an emerging technology poses significant intellectual challenges. Interdisciplinary skills across diverse fields are needed to bring the great promise of IoFT into reality. Below we highlight some of the challenges and shed light on their uniqueness compared to centralized IoT systems. This is by no means an exhaustive list as IoFT challenges vary widely across different application sectors as highlighted in Sec. VII.

  • Statistical Heterogeneity: IoFT devices often have local datasets that differ in both size and distribution. Recent papers have shown the unfortunate wide gap in the global model’s performance across different devices due to their heterogeneity in distribution [zhao2018federated, wang2019federated] and size [duan2019astraea]

    . For instance, IoFT devices may have (i) unique outputs, labels, or features only observed within certain IoFT devices. (ii) Similar outputs but with dissimilar features (i.e., feature distribution skew) or vice versa. This statistical heterogeneity directly consequences IoFT’s ability to reach out to many devices operating under different external factors and subject to geographic, cultural, and socio-economic differences. In contrast, traditional IoT systems offer a key, yet often subtle fundamental advantage: the ability to handle nonindependent or identically distributed (i.i.d) data by shuffling/randomizing the raw data collected in the cloud before learning; be it through distributed computing or learning on a single machine. This is not a luxury that IoFT possesses; rather, it is a price to pay for enhanced privacy.

  • Personalization and Negative Transfer: In the IoFT process described in Sec. II-B all clients collaborate to learn a global model; “one model that fits all”. This integrative analysis of multiple clients implicitly assumes that these local datasets share some commonalities. However, with heterogeneity, negative transfer of knowledge may occur, which leads to decreased performance relative to learning tasks separately [kontar2020minimizing, li2020negative]. One possible solution is through personalized modeling where global models are adapted for local clients (refer to Sec. IV for data-driven personalization approaches). Indeed, personalization may be the fundamental tool to overcome the heterogeneity barrier intrinsic to IoFT. Yet developing validation techniques to identify negative transfer and minimize it is a critical problem in FL.

  • Communication Efficiency and Resource Management: Communication can be a critical bottleneck for IoFT, especially with a large number of participants. Unlike cloud datacenters, edge devices in IoFT often have limited communication bandwidth with unstable and slow connection [konevcny2016federated]. As a result, IoFT devices are often unreliable and can drop out due to battery loss or connectivity loss. Besides that, devices themselves are heterogeneous in their computational capabilities and memory budgets. Therefore, resource management in IoFT is of critical importance. Methods such as compressed communication [tang2019doublesqueeze, koloskova2019decentralized], client selection [xu2020client] and optimal trade-offs between convergence rates, accuracy, energy consumption, latency and communications [nguyen2020resource, reisizadeh2019robust] are of high future relevance. Another possible approach is through incentive design to encourage reliable clients to participate in the training process and minimize dropout rates [kang2019incentive].

  • Privacy: Privacy remains one of the key challenges and motivators behind IoFT. IoFT systems are prone to poisoning attacks on both edge devices and the central server. Targeted data perturbations [bagdasaryan2020backdoor, chen2017targeted, liu2017trojaning] to specific labels/instances or corrupting a large number of devices (i.e., fake devices) can immensely reduce accuracy. Further, a malicious server might be able to reconstruct raw data even through a focused update. As a result, secure computation, aggregation, and communication are needed in IoFT [beimel2019power, bittau2017prochlo]. So is adversarial data modeling to ensure robustness against corrupted data in case breaches are inevitable [madry2017towards].

  • Bias and Fairness: IoFT systems can raise bias and fairness concerns. For example, sampling reliable phones with a larger bandwidth (i.e., more expensive phones) can lead to models mostly representative of people with certain socioeconomic statuses. Further, it is often important to build models that are competitive over different groups or attributes. This becomes a bigger challenge if such sensitive attributes are not shared. Therefore, fair FL is an important challenge to tackle within IoFT [yue2021gifair, li2019fair]

  • Other Statistical and Optimization Challenges: We also refer readers to Sec. VI for both statistical and optimization challenges/opportunities and Sec. VII for domain-specific challenges in different sectors.

We here note that Secs. III, IV, V shed light on data-driven modeling approaches (global, personalized and meta-learning) aimed to tackle of the challenges above. However, we exclude (i) privacy and communication efficiency: since there are excellent surveys focused mainly on these challenges (refer to Sec. I-B) (ii) resource management: since literature in that area is still scarce.

Ii-B2 IoFT structures

The underlying structure and overall architecture of IoFT should be tailored to fit certain applications and overcome specific challenges. Current IoFT architectures are influenced by the data composition and the FL learning process. For instance, in the situation where multiple clients collaborate to learn a global model with the orchestration of a central server (as seen in Fig. 3), it is implicitly assumed that local datasets share a common feature space but have a different sample space - i.e. different clients. Such data composition is technically referred to as Horizontally partitioned data [yang2019federated]. A typical FL system architecture for Horizontally portioned data (also known as Horizontal FL (HFL)), would exploit the availability of a common feature space. Notably, horizontally partitioned data are very common across different applications, making HFL the common practice in IoFT [hard2018federated, chen2019federated, yang2019federated, yang2018applied, ramaswamy2019federated].

Fig. 4: IoFT with vertically partitioned data

However, not all datasets share a common feature space which naturally poses the need for a different architecture. Vertically partitioned data, which refers to datasets sharing a different feature space but similar sample space, is another familiar theme in various applications. Such datasets mostly appear in scenarios that involve joint collaboration between large enterprises. Consider as an example two different health institutes, each owning different health records yet sharing the same patients. Suppose you wish to build a predictive model for a patient’s health using a complete portfolio of medical records from both healthcare institutes. Unlike HFL where each client trains a local model using their own data, training a local model requires data owned by other clients since each client holds a disjoint subset of the data. Accordingly, a typical FL system architecture for Vertically partitioned data (also known as Vertical FL (VFL) [yang2019federated]) is designed to introduce secure communication channels between clients to share the needed training data, while preserving privacy and preventing data leakage from one provider to another. For this, VFL architecture may involve a trusted, neutral party to orchestrate the federation. The orchestrator aligns and aggregates data from participants to allow for collaborative model building using the joint data, see Fig. 4. Nonetheless, VFL remains less explored than HFL, and most of the currently developed structures can only handle two participants [nock2018entity, hardy2017private, yang2019parallel]

. More challenging scenarios can occur when clients have datasets that share only partial overlap in the feature and sample spaces. FL in these cases can leverage transfer learning techniques to allow for collaborative model training

[pan2010survey, yang2019federated].

Fig. 5: Peer-to-peer network

The structures described above are designed to handle challenges arising from dataset partitioning. However, different challenges require new structures. One notable commonality of the above structures is the usage of a central orchestrator that coordinates the FL process in IoFT. The caveat, however, is that a central orchestrator is a single point of failure and can lead to a communication bottleneck with a large number of clients [lian2017can]. Accordingly, fully decentralized solutions can be explored to nullify the dependency on a central orchestrator. In fully decentralized architectures, communication with the central server is replaced by peer-to-peer communication, as seen in Fig. 5. In this setting, no central location receives model updates/data or maintains a global model over all clients, however, clients are set to communicate with each other to reach desired solutions. Notably, such peer-to-peer networks are better able to achieve scalability in situations with a large number of clients, thanks to their fully decentralized mechanism [kermarrec2015want]; the current success of blockchains is a clear demonstration of this. Further, they offer additional security guarantees as it is difficult to observe the system’s full state [bellet2018personalized]. However, such architecture yields performance concerns. Some clients could be malicious in peer-to-peer networks and potentially corrupt the network (e.g., violate data privacy). Others could be unreliable and thus disrupt the communication channels. Consequently, a level of trust in a central authority in a peer-to-peer architecture can be of benefit in regulating the network’s protocols.

The structures discussed here are by no means comprehensive, and several others exist in the literature (see [yang2019federated, kairouz2019advances, rahman2020survey, li2020federated]). However, the common denominator here is that IoFT structures spawn from challenges of FL applicability to different scenarios. As IoFT is poised to infiltrate more and more fields, domain-specific challenges will dictate its architecture.

Iii Learning a Global Model

Hereon, we discuss data-driven approaches for FL within IoFT. As aforementioned, we classify model building in FL into three categories: (i) a global model, (ii) a personalized model (iii) a meta-learning model. We then provide an in-depth overview of data-driven models, open challenges, and possible alternatives within these three categories.

As will become clear shortly, the current FL techniques mostly focus on predictive modeling using deep learning and first-order optimization techniques, specifically stochastic gradient descent (SGD). This is understandable as the immense data collected within IoFT often necessitates such an approach. Yet, as we discuss in the statistical/optimization perspective (Sec. VI) and applications (Sec. VII) sections, exploring FL beyond predictive models and deep learning is critical for its wide-scale implementation. Topics such as graphical models, correlated inference, zeroth and second order distributed optimization, validation & hypothesis testing, uncertainty quantification, design of experiments, Bayesian optimization, optimization under conflicting objectives (see in Sec. VII-E

), game theory and reinforcement learning, amongst others, are yet to be explored in the IoFT realm.

Iii-A A General Framework for FL

As highlighted in Fig. 3, IoFT allows multiple clients to collaborate and learn a shared model while keeping their personal data stored locally. This shared model is referred to as the global model as it aims to maximize utility across all devices. One can view the global model as: “one model that fits all”, where the goal is to yield better performance in expectation across all clients relative to each client learning a separate model using its own data.

We start by constructing the objective function of a global model. Assume there are clients (or local IoFT devices) and each client has number of observations. The general objective of training a global model is to minimize the average over the objective of all clients:


where is usually a risk function on client . This risk function can be expressed as

where indicates the data distribution of the -th client’s data observations , is the model to be learned parametrized by weights , and

is a loss function.

The risk function is usually approximated by the empirical risk given as . Therefore, learning a global model in FL aims at minimizing the average of risks over all clients. However, unlike centralized training, in IoFT client can only evaluate its own risk function and the central server does not have access to the data from the clients. Client and central server training are thus decoupled.

Given this setting, Algorithm 1 is a general “computation then aggregation” [zhang2020fedpd] framework for FL. In each communication round, a central orchestrator selects a subset of clients () and broadcasts the global model information to the subset. Each client then updates the global model using its own local data. Afterwards, clients send their updated models back to the central orchestrator/server. The orchestrator aggregates and revises the global model based on input from clients. The process repeats for several communication rounds until a stopping criterion, such as validation accuracy, is met. Note that we use to denote the set , a client’s dataset and the superscript to represent the -th communication round between the central server and selected clients, where .

1:  Input: Client datasets , , initialization for
2:  for  = 1, 2,  do
3:     Orchestrator selects a subset of clients , broadcasts global model , or a part of it, to clients in .
4:     for each  do
5:        Clients update model parameters
6:        Clients send updated parameters to server.
7:     end for
8:     Orchestrator updates by aggregating client updates
9:  end for
Algorithm 1 Framework for Learning a Global Model

One of the simplest FL algorithms is FedSGD [parallelsgd, fedsgd], a distributed version of SGD. FedSGD was initially used for distributed computing in a centralized regime. FedSGD partitions the data across multiple computing nodes. In every communication round, each node calculates the gradient from its local data using a single SGD step. The calculated weights are then averaged across all nodes. As a data-parallelization approach, FedSGD utilizes the computation power of several compute nodes instead of one. This approach accelerates vanilla SGD and has been widely used due to the growing size of datasets collected nowadays. Furthermore, since FedSGD only performs one step of SGD on a local node, averaging updated weights is equivalent to averaging gradients ( denotes steps size):

Despite being a viable option, traditional distributed optimization algorithms are often unsuitable in IoFT due to the large communication cost and the presence of heterogeneity. FedSGD transmits the gradient vector from one machine to the other after each single local optimization iterate. This issue is not critical in centralized distributed training when computation nodes are usually connected by large bandwidth infrastructure. However in IoFT, data lives on the edge device and not on a computing node. Communication with the central orchestrator at each gradient calculation is not feasible and may suffer immensely when the edge devices have limited communication bandwidth with unstable or slow connection.

To resolve this challenge, the seminal work of fedavg proposed a simple solution: FedAvg. The fundamental idea is that clients run multiple updates of model parameters before passing the updated weights to the central orchestrator. Specifically, in FedAvg, clients update local models by running multiple steps (e.g., local steps) of SGD on their local objective . Upon receiving updated weights from clients, the server_update function simply calculates the average of the client models: . An illustration contrasting FedAvg and FedSGD is shown in Fig 6. Here one can also add flexibility by re-scaling the global update with a step size , .

Indeed, despite its simplicity, FedAvg has seen wide empirical success within FL due to its communication efficiency and strong predictive performance on several datasets. To this day, FedAvg remains a standard benchmark that is often hard to beat. However, a major observed challenge was that the performance of FedAvg and FedSGD degrades significantly [fedavg] when data across clients are heterogeneous, i.e. non-i.i.d. data. Here one should note that empirical results have shown that FedAvg requires fewer communication rounds than FedSGD even in the presence of heterogeneity [fedavg].

Fig. 6: An illustration of FedAvg and FedSGD. Grey arrows represent gradients evaluated on the local client. Bold red arrows represent a global model update on the central server in one communication round. In FedSGD, each client performs one step of SGD, and sends the update to the server, while FedAvg allows each client to perform multiple SGD steps before averaging.

Iii-B Tackling Heterogeneity

As previously discussed, an intrinsic property of IoFT is that the data distribution across clients is often imbalanced and heterogeneous. Unlike centralized systems, data cannot be randomized or shuffled prior to inference as it resides on the edge. For example, wearable devices collect data on users’ health conditions such as heartbeats and blood pressure. Due to the many differences across users, the amount of data collected can significantly vary, and statistical patterns of these data are not alike, often with unique or conflicting trends. This heterogeneity degrades the performance of FedAvg. The reason is that minimizing the local empirical risk is sometimes fundamentally inconsistent with minimizing the global empirical risk when data are non-i.i.d. Mathematically, it also implies that , where superscript indicates an optimal parameter. This phenomenon is known as client-drift [karimireddy2020scaffold]. Notice that if local datasets are i.i.d., when the size of local datasets approaches infinity, converges to the global empirical risk , hence optimal solutions coincide. In the following, we introduce some works trying to address the heterogeneity challenge.

One method to allay heterogeneity in FL is regularization. In the literature, regularization has been a popular method to reduce model complexity. As less complex models usually generalize better [friedman2001elements, rademacher], regularization attains better testing accuracy. In FL, regularization places penalties on a set of parameters in the objective function to encourage the model to converge to desired critical points. Researchers in FL have proposed several notable algorithms using regularization techniques to train global models with non-i.i.d. data. Perhaps the most basic one is FedProx [fedprox] which adds a quadratic regularizer term (a proximal term) to the client objective:

The proximal term in FedProx limits the impact of client-drift by penalizing local updates that move too far from the global model in each communication round. Parameter controls the degree of penalization. It was also seen that FedProx allows each device to have a different number of local iterations , which is especially useful when IoFT devices vary in reliability and communication/computation power. Experimental results show that FedProx can partially alleviate heterogeneity, while reducing communication cost due to the often faster convergence and ability of reliable clients to run more updates than others. Here it is important to note that despite reducing client-drift, FedProx is still based on in-exact minimization since it does not align local and global stationary solutions.

Besides FedProx, [dane, fedsvrg, zhang2020fedpd, acarfederated] also develop a framework to tackle heterogeneity through regularization. Among this literature, DANE [dane] was proposed for distributed optimization yet is readily amenable to FL settings. DANE uses a local objective:


where is also a parameter for weighting the regularization and is the global update at the previous communication round. Compared with the Fedprox objective, (III-B) adds one term that linearly depends on . This term aligns the gradient of the local risk to that of the global risk. To see it, one can calculate the gradient of (III-B) as , where the term approximates the difference between the local and global gradient by its value at the last communication round. It is shown that objective (III-B) can be interpreted as mirror descent. Interestingly, if the local loss function is quadratic, optimizing (III-B) can approximate performing Newton updates.

The exact minimization in (III-B) is sometimes infeasible, as edge devices usually have limited computation resources. To resolve the issue, Stochastic Controlled Averaging algorithm (SCAFFOLD) [karimireddy2020scaffold] replaces the exact minimization by several gradient descent steps on the local objective below,


where control variables and are defined as , i.e. the local gradient at the end of the last communication round, and . Objective (3) is akin to (III-B), since also has the alignment effect, except that it does not have the proxy term . To show the update rule in communication round , we use to denote the weight at the -th local iterate, and set . In round , the server samples a group of clients . For client in , the local update of SCAFFOLD is:


for . After iterations, clients send weights and gradients to the server. The server takes the average of control variables , and re-scales the updates for weights by , . Note here that is taken over all clients. For those that did not participate SCAFFOLD re-uses the previously computed gradients.

The idea behind SCAFFOLD is very intuitive. To solve (1), the ideal (centralized) update is that each client uses all client’s data . However such update rule is not possible in IoFT due to the need to communicate the gradients with the orchestrator at every optimization iterate. To mimic the ideal update, SCAFFOLD uses to approximates at using the last communication round, for all . Then also may approximate the gradient of the global risk, . If this approximation holds, the update of SCAFFOLD becomes similar to ideal (centralized) update. One caveat in such update scheme is that may not always equal (or approximate) the ideal value . Adding to that, re-uses the previously computed gradients when clients do not participate. Therefore, when client participation rate is low, can deviate far away from the ideal update leading to degraded optimization performance.

Empirically, SCAFFOLD requires fewer communication rounds to converge compared with FedAvg. A very similar algorithm is Federated SVRG [fedsvrg]

, which applies stochastic variance reduced gradient descent to approximately solve (

3). The update rule is , where is a diagonal matrix to rescale gradients. Federated SVRG reduces to SCAFFOLD when one sets

to the identity matrix.

As discussed, despite its efficiency on several FL tasks, SCAFFOLD does not work well in low client participation cases. To this end, FedDyn [acarfederated] uses a specially designed dynamic regularization to align gradients under partial participation. The objective on client is defined as:


Objective (5) is also closely related to (III-B). In (III-B), when the weight is near critical points of the global risk , is close to , thus (III-B) reduces to (5). As a simple fixed points analysis, when all models start from , i.e. a critical point of the global loss, the optimal solution of (5) is still , thus local updates will stay at . FedDyn is proved to converge to critical points of the global objective with a constant stepsize. Also, to deal with partial client participation, FedDyn uses a SAG-style [sag] averaging rule in server_update: instead of only averaging gradients from clients that participated in the training in one communication round, FedDynestimates gradients on disconnected clients based on historic values and averages all gradients (or gradient estimates). In practice, FedDyn is shown to achieve similar test accuracy with much fewer communication rounds compared with FedAvg and FedProx, especially when client participation rate is low.

A closely related algorithm to FedDyn is Federated primal-dual (FedPD) [zhang2020fedpd]. FedPD and FedDyn have different formulations, but end up with the same update rule under some conditions. In FedPD, the optimization problem in (1) is reformulated to a constrained optimization problem


To solve the constrained optimization problem, FedPD introduces dual variables , then defines the augmented Lagrangian (AL) for client to be . FedPD uses alternative descent on primal and dual variables to optimize . More specifically, FedPD first randomly initializes for all clients. At round , the algorithm updates by optimizing and fixing and . It then updates dual variables by and also . After the local updates, FedPD

makes a random choice with probability

, that all clients send updated back to the orchestrator which updates and broadcasts updated . With probability , all clients set and continue local training. Interestingly, by letting , it was shown that FedPD is equivalent to FedDyn with full client participation on an algorithmic level [equivalencefeddynandfedpd]. However, different from FedDyn, FedPD does not directly apply to partial participation settings.

Another algorithm that uses a constrained optimization formulation is FedSplit [fedsplit]. FedSplit [fedsplit] applies Peaceman-Rachford splitting [prsplit, monotoneoperator]. More specifically, FedSplit concatenates into one long vector and finds the optimal solution of on the subspace . The problem is also known as consensus optimization [consensusoptimization]. An important concept in consensus optimization is the normal cone defined as for and empty otherwise. At the optimal solution, the gradient should be in the normal cone of :

FedSplit treats gradient and normal cone as two operators, and uses Peaceman-Rachford splitting [monotoneoperator] to find a solution that satisfies the optimality condition. After some derivations, the authors propose the following update rules. At communication round , clients update their local weights , send them to server and store a local copy. In the following round, client receives global update , and calculates:

The server_update simply averages . Intuitively, operator splitting adds a regularization term centered at to the local objective. The carefully designed update rule has two advantages. Firstly it help alleviate client-drift: FedSplit convergences linearly to critical points of the global loss on convex problems. Also, it accelerates convergence: the theoretical convergence rate is faster than that of FedAvg on strongly convex problems.

In addition to algorithms applicable to general federated optimization problems, there are models designed specifically for neural networks to handle heterogeneity. For instance, researchers pointed out that re-permutation of neurons may cause declined performance in the aggregation step of FL. This re-permutation problem is due to the fact that different neural networks created by a weight permutation might represent the same function.

For example, consider a simple NN, where

is an activation function,

, and and are weight matrices. One can multiply a permutation matrix to and , , and the function remains the same. However updates on different clients may be attracted to networks with a different permutation matrix . This can cause averaging over weights to fail. To cope with this, [yurochkin2019bayesian] propose a neuron matching algorithm called Probabilistic Federated Neural Matching (PFNM). PFNM assumes and are generated by a hierarchical probabilistic model whose hyper-parameters are determined by global weights. Then PFNM

uses Bayesian inference to estimate the hyper-parameters, and reconstructs the global model from the inference.

However, [wang2020federated] argue that PFNM can only work on simple fully connected neural networks. To solve the problem, they extend PFNM to Federated Matched Averaging (FedMA) algorithm. FedMA updates weights of a neural network layer by layer. Firstly, clients train local NNs and send the trained first layer weights to the orchestartor, where denotes the weight vector of layer from client . The server uses matching algorithms such as the Hungarian algorithm in PFNM to estimate that represents the permutation vector of first layer model weights for client . Thus, become the matched weights after re-permutation. The server then averages the results , and broadcasts the averaged . After receiving , clients continue to train the remaining layers with the first layer fixed to . A similar match-then-average process repeats for remaining layers. For FedMA, the number of communication rounds equals the number of network layers. FedMA

is reported to have strong performance on CIFAR-10 and a well known language dataset called Shakespeare. Additionally, the performance of FedMA improves with the increase of local epochs

, while that of FedAvg and FedProx drops after a threshold of due to the discrepancy between local models (i.e. local weights wander away from each other). Thus FedMA enables clients to train more epochs between consecutive communications.

All approaches described above are of a frequentist nature. However, there has also been a recent push on improving global modeling through a Bayesian framework. The intuition is simple; rather than betting our results on one hypothesis () obtained via optimizing the empirical risk, one may average over a set of possible or integrate over all

weighted by their posterior probability

. This is the underlying philosophy of marginalization compared to optimization, whereby in the frequentist approach predictions are obtained through substituting the posterior by , where is the single optimized weight and is an indicator function. Indeed, this notion of Bayesian ensembling has seen a lot of empirical success in Bayesian deep learning [maddox2019simple, izmailov2018averaging].

One such approach is Fed-ensemble [Fedensemble]. Fed-ensemble, is a simple plug-in into any FL algorithm that aims to learn an ensemble of -models without additional communication costs. To do so, Fed-ensemble follows a random permutation sampling scheme where at each communication round, every client trains one of the models and then aggregation happens for each model separately (using FedAvg or other FL approaches). This approach corresponds to a variational inference scheme [blei2017variational, zhang2018advances] for estimating a Gaussian mixture variational distribution whose centers are randomly initialized at the beginning. Predictions on a new input are then obtained by taking an average over the predictions of the models


Fed-ensemble is also able to quantify predictive uncertainty. Using a neural tangent kernel argument, the authors show that all predictions from all models converge to samples from the same limiting Gaussian process in sufficiently overparameterized regimes (see Fig. 7) where each mode can behave like a model trained by centralized training.

Fig. 7: An illustration of an ensemble of models. The three model weights on the left figure correspond to the three predictions on the right. Although the weights are well separated, the predictions admit the same limiting posterior distribution.

Another recent work taking insights from Bayesian inference is FedBE [chen2020fedbe]. It performs statistical inference on the client-trained models and uses knowledge-distillation (KD) to update the global model. Intuitively, the goal of KD is to use high-quality base models from a global distribution to direct the global model update. More specifically, after receiving from clients, the server fits them with a Gaussian or Dirichlet distribution and then samples from the estimated distribution to form an ensemble of models . Similar to (7), the ensemble prediction on a new point is given by . In server_update, the global model is trained to mimic the average prediction of models in the ensemble by minimizing the discrepancy between the two predictions evaluated on an additional unlabeled dataset on the server:

where Div denotes a divergence measure, here cross-entropy. The updated is then sent to all clients. The authors empirically show that the ensemble and knowledge distillation turns out to be more robust to non-i.i.d. data than FedAvg. This approach however requires storing additional data on server which is not always feasible.

Iii-C Efficient & Effective optimization

Several studies attempt to improve FedAvg by adopting adaptive optimization algorithms to the FL realm. They show theoretically or empirically that the improved algorithms can converge faster and accelerate global model training. In general, acceleration can be achieved by either improving the server aggregation step (server_update) or client updates (client_update). FedAdam and FedYogi [fedadam] bring the well known Adam [kingma2014adam] and Yogi [reddi2018adaptive] algorithms to FL through augmenting the server_update function by adaptive stepsizes. More specifically, FedAdam and FedYogi

use a second order moment estimate

to adaptively adjust the learning rate. is initialized at the beginning. Upon receiving from clients, server calculates , and averages them . FedAdam updates as:

and FedYogi as:

where is a parameter for exponential weighting. The update rule for both FedAdam and FedYogi is:

where is a small constant for numerical stability. Though the proved theoretical convergence rates of FedAdam and FedYogi are only comparable to those of FedAvg

, the adaptive methods show strong performance on several FL tasks. Considering the success of adaptive stepsize methods in numerous important fields including language models

[attention], GANs [gantraining2019, gantraining2020], amongst others, we believe their use in FL is promising. A related algorithm in this vein is federated averaging with server momentum (FedAvgM) [liu2020accelerating], which uses server momentum in the server_update step.

Besides modifying server_update, multitudes of algorithms redesign the client_update function. For instance, there are some attempts to expedite local training by combining accelerating techniques in optimization.

FedAc [fedac] is a federated version of an accelerated SGD. Instead of updating a single variable as FedAvg does, FedAc updates three sequences iteratively on the client side by the following rules for several steps:

are four hyper-parameters. Among them are exponential averaging parameters, and are stepsize parameters. The server averages and from sampled clients, and broadcasts the averaged and , which clients will take as initialization of and in the next communication rounds. The algorithm then proceeds till convergence. [fedac] theoretically prove that FedAc can achieve a linear convergence rate faster than FedAvg when global risk in (1) is strongly convex. Empirical results show that FedAc saves communication cost when there are many devices in the network.

LoAdaBoost [loadaboost] adaptively determines the training epochs of clients by monitoring the training loss on each client and adjusting the training schedule accordingly. More specifically, after one communication round, clients send training losses, in addition to updated weights to the server. The server estimates the median of the training loss , . In the next round, all clients train for a certain amount of epochs , where is the average budget of epochs. If the training loss is lower than , the local training is deemed to have reached its goal in this round, and the updated weight will be directly sent back to server. If the training loss is higher than on client , then the model underfits client . As a result, LoAdaBoost will train the model on client for extra epochs until the local training loss is lower than or the total epochs exceed , whichever comes faster. Such dynamic training schedules allow LoAdaBoost to take resources of clients into consideration, thus can better utilize computation power on edge devices and enable faster training.

Iii-D Sampling Clients

Due to the often sheer size and unreliability of edge devices participating within IoFT, not all clients can participate in each communication round of the training process as shown in Algorithm 1. Therefore, choosing the appropriate subset at each communication round between the orchestrator and client is of utmost importance in FL. Here we shed light on some existing schemes, other possible alternatives and their implications.

We first start by noting that an alternative approach to write the global objective in (1) is through giving different weights to client risk function. This is given as


where is a weight such that and . In IoFT, it is common to have datasets of different sizes. Thus, a natural choice is to set where is the total data size across all clients . Clearly if all clients have the same dataset size , objective (8) reduces to (1).

Indeed, although most algorithms for FL use (1), both FedAvg and FedProx (among the earliest methods) use (8) by adding weights . FedAvg samples clients uniformly with probability , and averages client models with weights proportional to their local dataset size : . On the other hand, FedProx samples clients with probability and averages client models with equal weights:

. These sampling probability and weights are chosen to make client updates unbiased estimates of global updates - i.e. unbiased estimates of


However, both sampling schemes may have some drawbacks. For example, uniform sampling may be inefficient since the orchestrator can often sample unreliable clients or clients with very small datasets. Dataset size-based sampling addresses this issue, but it may raise fairness concerns as some clients are rarely sampled and trained. This also makes the training procedure more prone to adversarial clients with large datasets that can directly impact the training process.

To form better sampling schemes and accelerate training, adaptive sampling techniques have also been proposed. These FL algorithms update the sampling probability after each communication round from historical statistics [clientsamplewithloss, clientsamplewiththeta, oort, clientsamplingwithgn]. Such methods usually sample clients on which the model fits worse, more often. Intuitively, when a model incurs high training loss or large gradient norms on client , client is not performing well under the current model and should be trained for more epochs.

There are a range of choices for measuring the performance of a model on the client. Among them, there is a set of literature that calculates the sampling probabilities adaptively using gradient norms of the clients [convexadaptivesampling, nonconvexadaptivesampling, adambs]. Generally this can be given as

where is some constant. Other approaches sample clients based on their training loss [banditsurvey, clientsamplewithloss] where the gradient is substituted by the local loss at the end of each training round. In this set of literature, exploration and exploitation schemes are also used to continuously update the sampling probability. Client selection usually improves model performance and speeds up the training process. For example, [oort] can achieve to speed up in terms of time-to-accuracy compared with vanilla FedProx or FedYogi.

Due to prevailing statistical and system heterogeneity among clients, we believe client sampling techniques will be of great significance when practitioners try to deploy FL frameworks in IoFT. An effective sampling scheme can efficiently exploit differences in client’s resources while at the same time improving training speed and accuracy. Further, studying the connections between adaptive sampling and client re-weighting schemes (see Sec. III-E) used for fairness is an interesting topic worthy of investigation.

Iii-E Fairness across clients

In IoFT, it is crucial to ensure that all edge devices have good prediction performance. However, the key challenge is that devices with insufficient amounts of data, limited bandwidth, or unreliable internet connection are not favored by conventional FL algorithms. Such devices can potentially end with bad predictive ability. Besides this notion of individual fairness, group fairness also deserves attention in FL. As FL penetrates many practical applications, it is important to achieve fair performance across groups of clients characterized by their gender, ethnicity, etc. Before diving into the literature, we first start by formally defining the notion of fairness. Suppose there are groups (e.g., ethnicity) and each client can be assigned to one of those groups. Group fairness can be defined as follows.

Definition 1.

Denote by the set of performance measures (e.g., testing accuracy) of a trained model . For trained models and , we say is more fair than if , where denotes variance.

When , this definition is equivalent to the individual fairness. Definition 1 is widely adopted in most FL literature [mohri2019agnostic, li2019fair, huang2020fairness, zhang2020fairfl, zeng2021improving]. This notion of fairness might be different from traditional definitions such as demographic disparity [feldman2015certifying]

, equal opportunity and equalized odds

[hardt2016equality] in centralized systems. The reason is that those conventional definitions cannot be extended to FL as there is no clear notion of an outcome which is optimal for an edge device [kairouz2019advances]. Instead, fairness in FL can be defined as equal access to effective models (e.g., the accuracy disparity [zafar2017fairness] or the representation disparity [li2019fair]). Specifically, the goal is to train a global model that incurs a uniformly good performance across all devices or groups [kairouz2019advances].

Despite the importance of fairness, unfortunately, very limited work exist along this line in FL. As will become clear shortly, the few works in this area mainly focus on a client re-weighting scheme through exploiting the weighted global objective in (8) instead of (1).

GIFAIR-FL [yue2021gifair] is the first algorithm that can handle both group and individual fairness in FL. Specifically, it achieves fairness by penalizing the spread in the loss among client groups. This can be translated to the following optimization problem:

where is a regularization parameter and

is the average loss for group and is the set of indices of devices who belong to group . The original formulation of can be further simplified as


and is the group index of device . can be viewed as a scalar that related to the statistical ordering of among client group losses. Therefore, to collaboratively minimize , each edge device is minimizing , a scaled version of the original local loss function . The central server will aggregate local parameters and update at every communication round. From the expression of , one can see that a higher value of will be assigned to the client that has higher group loss. Therefore, GIFAIR-FL will impose higher weights for clients with bad performances. Furthermore, those weights will be dynamically updated at every communication round to avoid possible model over-fitting. [yue2021gifair] has shown that GIFAIR-FL will converge to an optimal solution or a stationary point even when heterogeneity exists.

Agnostic federated learning (AFL) [mohri2019agnostic] is another algorithm that re-weights clients at each communication round. Specifically, it solves a robust optimization problem in the form of

AFL computes the worst-case combination of weights among edge devices. This approach is robust but may be conservative since it only focuses on the largest loss and thus causes pessimistic performance to other clients. du2021fairness further refine the notation of AFL by linearly parametrizing weight parameters by some kernel functions. Upon that hu2020fedmgda+ combine minimax optimization with gradient normalization to formulate a new fair algorithm FedMGDA+.

Inspired by fair resource allocation for wireless networks problems, [li2019fair] propose the q-FFL algorithm for fairness. They slightly modify the loss function and add a power to each user


The intuition is that tunes the amount of fairness: the algorithm will incur a larger loss to the users with poor performance. Therefore, q-FFL can ensure uniform accuracy across all users.

To the best of our knowledge, fairness is an under-investigated yet critical area in the FL setting. We hope this section can inspire the continued exploration of fair FL algorithms.

Iv Learning a Personalized Model

Fig. 8: Personalized IoFT

As highlighted in previous sections, heterogeneity is a fundamental challenge for IoFT. IoFT devices often exhibit highly heterogeneous trends and behaviors due to differences in operational, environmental, cultural, socio-economic and specification conditions [kontar2017nonparametric, kontar2018nonparametric, yue2019variational]. For instance, in manufacturing, operational differences involve changes in the speed, load, or temperature a product experiences. As a result, data distribution across edge devices can be vastly heterogeneous, that one single global model cannot perform consistently well on all edge devices. This also has severe fairness implications as devices with limited data and unreliable connection will not be favored by many FL algorithms due to higher weights (recall FedAvg and its variants) given to devices with more data or those that can participate more often in the training process. Indeed, in the past few years, multiple papers have shown the wide gap in a global model’s performance across different devices when heterogeneity exists [jiang2019improving, hard2018federated, wang2019federated, smith2017federated, kairouz2019advances].

One straightforward solution to address the challenges above is through personalization. As shown in Fig. 8, instead of using one global model for all edge devices, personalized FL fits tailor-made models for IoFT devices while levering information across all those devices. The rest of this section will discuss current personalization approaches, their drawbacks and potential alternatives. We divide the personalization techniques into fully personalized and semi-personalized. For fully personalized algorithms, each edge device retains its own individualized model, and for semi-personalized algorithms, models are tailor-made only to a group of clients. In Sec. V, we will further discuss personalization from a meta-learning perspective.

Iv-A Fully Personalized

From a statistical perspective, let , but let the conditional distributions of given vary across IoFT devices. One can write this as where clients share the same (a linear model, neural network) yet with different parameters . In this situation, the difference in the data distributions across clients can be explained by the difference in . This is often referred to as a concept shift and implies a change in the input-output relationship across clients [machinelearninginnonstationaryenvironments, mcmahan2021advances]. For example, in manufacturing the same design setting can have different effects on the manufactured product given external factors such as operational speed or load. Also, take the sequence prediction task on mobile phones as an example: for different users, the word following “I live in …” should be different [surveyonpersonalization]. This example corresponds to a concept shift: is assumed the given part of the sentence ‘I live in’, and is the next word to predict. In this situation, should be customized for different clients even if is the same.

This section discusses current approaches to address a concept shift across clients, their drawbacks, and promising alternatives. Modeling for a shift in is highlighted in our statistical perspective (Sec. VI-A).

To accommodate client specific concept shifts while leveraging global information one can extend the global FL model in (1) to the following general objective for personalized FL:


where are shared global parameters while is a set of unique parameters for each client.

The current literature aiming to address a concept shift can be broadly split into two categories: (i) weight sharing and (ii) regularization. It will also become clear shortly that many current approaches follow a train-then-personalize philosophy which may be dangerous in some instances.

Iv-A1 Weight Sharing

The first set of literature solves (10) by using different layers of a neural network to represent and [fedper, lg-fedavg]. The underlying idea is that base layers process the input to learn a shared feature representation across clients, and top layers learn task-dependent weights based on the features.

FedPer [fedper] fits global base layers, and personalizes top layers. As an example, a fully connected multi-layer neural network can be expressed as , where are the number of network layers. Recall from Sec. III-B, denotes an activation function and ’s are weight matrices. In this example, FedPer takes to as base layers that characterize , and to as personalized layers that characterize in (10). In one communication round, client uses SGD to update and simultaneously. However, different from FedAvg, only is transmitted to the server where it is then aggregated. FedPer is found to perform better than FedAvg on image classification tasks such as CIFAR-10 and CIFAR-100. On these datasets, the authors show that having the last one or two basic residual blocks of Resnet-34 personalized can yield the best testing performance. Similarly, LG-FedAvg [lg-fedavg] takes top layers as a global weight and base layers as personalized weights . The intuition is to learn customized representation layers for different clients, and to train a global model that operates on local representations. Additionally, by carefully designing the loss of representation learning, the generated local representation can confound protected attributes like gender, race, etc.

Iv-A2 Regularization

In contrast to splitting of global and local layers, other recent work treat neural networks holistically and learn personalized ’s by exploiting regularization [mocha, multitask-fl, pfedme].

Perhaps the most straightforward method to personalize via regularization is to follow a train-then-personalize (TTP) approach. As the name suggests, this approach trains the global model on all clients then adapts it to individual devices. The simplest way for the adaptation is fine-tuning [local-adaptation, finetune]

, which is also widely employed in computer vision and natural language processing

[gpt2]. More specifically, in the TTP approach, we have a two step procedure. Step 1 - Train: clients collaborate to train a global model using FedAvg (or its variants) - recall is the last communication round. Step 2 - Personalize: clients make small local adjustments based on their local data to personalize . Notice that for such methods, ’s and are in the same parameter space thus it’s possible to perform addition or calculate the difference of these weight vectors. Weight regularizing methods thus usually allow all the weight vector to differ across clients, instead of forcing some coordinates of these weight vectors to be exactly the same.

A simple means for the personalization step is to start from and perform a few steps of SGD to minimize the local loss function . Indeed, this approach to fine-tuning is shown to generalize better than fully local training or global modeling on next word prediction [finetunebert] and image classification tasks (e.g. [finetuneface, finetunecovid]). In this same essence, one may exploit regularization to encourage the weights of personalized models to stay in the vicinity of the global model parameters to balance each client’s shared knowledge and unique characteristics. For instance, using ideas from FedProx, the personalization step can encourage to remain within a vicinity of the global solution as shown below.


Other forms of regularization can also be used. For instance, by employing the popular elastic weight consolidation model (EWC) [kirkpatrick2017overcoming] that is often used in continual learning, we can control as

where are diagonal elements of the Fisher information matrix.

Some recent approaches [ditto, pfedme] have exploited the ideas above but in an iterative manner, where local and global parameters in the train-then-personalize step are obtained by alternating optimization methods. Among them, Ditto [ditto] simply proposed the following bi-level optimization problem for client :


To solve this formulation, Ditto uses the following update rule. In communication round , clients firstly receive a copy of global weight which is updated to using multiple (S)GD steps on the local risk function ; much like FedAvg. In the meantime, clients also obtains by multiple descent steps on the regularized loss (12):

At the end of the training round, client sends only global weight back to server. Server simply averages received weights . Empirically, Ditto has shown strong personalization accuracy on multiple commonly used FL datasets.

In Ditto, the global weight update is independent of personalized weights and follows the FedAvg procedure. Hence global weights cannot learn from the performance of personalized weights. To integrate the update of and , [pfedme] proposes Moreau envelope FL (pFedMe) for personalization. pFedMe formulates the following bi-level optimization problem:

pFedMe gets its name because is the Moreau envelope of . In the inner level optimization, personalized weights minimize the local risk function in the vicinity of reference point , and in the outer level minimization, is minimized to produce a better reference point. This objective is closely related to model-agnostic meta-learning (MAML). Sec V will cover more details about meta-learning algorithms. The optimal solution of pFedMe satisfies the relation