Log In Sign Up

Federated Learning: Balancing the Thin Line Between Data Intelligence and Privacy

by   Sherin Mary Mathews, et al.
US Bank

Federated learning holds great promise in learning from fragmented sensitive data and has revolutionized how machine learning models are trained. This article provides a systematic overview and detailed taxonomy of federated learning. We investigate the existing security challenges in federated learning and provide a comprehensive overview of established defense techniques for data poisoning, inference attacks, and model poisoning attacks. The work also presents an overview of current training challenges for federated learning, focusing on handling non-i.i.d. data, high dimensionality issues, and heterogeneous architecture, and discusses several solutions for the associated challenges. Finally, we discuss the remaining challenges in managing federated learning training and suggest focused research directions to address the open questions. Potential candidate areas for federated learning, including IoT ecosystem, healthcare applications, are discussed with a particular focus on banking and financial domains.


page 1

page 2

page 3

page 4


Federated Learning on Non-IID Data: A Survey

Federated learning is an emerging distributed machine learning framework...

Federated Learning: Challenges, Methods, and Future Directions

Federated learning involves training statistical models over remote devi...

Federated Learning for Open Banking

Open banking enables individual customers to own their banking data, whi...

Federated Learning for Healthcare Informatics

Recent rapid development of medical informatization and the correspondin...

Try to Avoid Attacks: A Federated Data Sanitization Defense for Healthcare IoMT Systems

Healthcare IoMT systems are becoming intelligent, miniaturized, and more...

Challenges and Opportunities for Machine Learning Classification of Behavior and Mental State from Images

Computer Vision (CV) classifiers which distinguish and detect nonverbal ...

1. Landscape of Federated Learning

1.1 Need for Federated Learning

Data has always been of significant priority for businesses of all sizes, especially in Financial Services. With the advancement of technology, companies tend to capture customer data from many sources, such as tracking customer’s activities and appending other data sources to proprietary sources to enhance their ability to contextualize data and draw new insights. However, preserving customer data privacy is vital considering the sensitive nature of customer data. hahn2018security; yang2019federated.

Several factors have driven the global consensus to focus on preserving data privacy and security of individuals, businesses, and societies. As a result, governments are strengthening data security and privacy protection measures. For example, General Data Protection Regulations (GDPR) aims to protect the user’s privacy by requiring operants to distinctly obtain user consent and adhere to sufficient data privacy requirements khan2019data

. The establishment of these laws and regulations poses new challenges to the traditional data processing mode of Artificial Intelligence (AI) to varying degrees. With frequent incidents of personal data breaches and individual and institutional data rights not being equal, there is a need for strict data privacy regulations. As traditional machine learning exposes more of its drawbacks, finding new, secure, and effective ways to collect as well as learn from data becomes crucial, thereby opening new research avenues and methods to build

personalized models without violating user privacy.

In addition to the privacy-preserving dilemma, there is an additional dilemma related to “data isolated islands” li2020preserving. Data is foundational for building AI models, and data islands lead to data being stored, maintained, and isolated in different organizations li2019survey. Integrating the data scattered in various organizations is challenging and could introduce a considerable cost. These challenges are particularly prevalent in the Financial Services industry, where data is often stored and processed in a highly segregated manner due to regulatory and privacy concerns. The following sections discuss the two key challenges and why federated learning is suited for their resolution.

1.2 Definition of Federated Learning

Federated learning is a distributed machine learning architecture that solves the dilemma of learning across data silos by enabling the training over multiple decentralized data stores. The federated learning setting also generates more robust models without sharing data, leading to privacy-preserved solutions with higher security and tighter access privileges for data konevcny2015federated; konevcny2016federated. Traditional machine learning frameworks primarily use the centralized method to process the data using centralized collection, unified processing, cleaning, and training models, which requires the training data to be located in the same server. To this end, federated learning adheres to two important ideas: local computing and model transmission, which reduces systematic privacy risks and costs from traditional centralized machine learning methods. The inherent focus on decentralized training is aimed at ensuring the data privacy of each device zhang2021survey.

Federated learning brings the model and code to the data rather than taking the data to where the model resides as is currently the case in most learning approaches, i.e., the data remains locked on a server or edge device while only the algorithm travels between the servers xu2021federated. Federated learning is known as collaborative learning, where algorithms get trained across multiple devices or servers with decentralized data samples without exchanging the actual data. This approach is radically different from other more established techniques, such as getting the data samples uploaded to servers or having data in some form of distributed infrastructure.

2. Taxonomy and System Architecture of Federated Learning

As data privacy increasingly becomes a critical societal concern, federated learning has been a crucial research topic in enabling the collaborative training of machine learning models among different organizations under privacy restrictions. As researchers try to support more machine learning models with different privacy-preserving approaches, there is a need to develop systems and infrastructures to reduce the complexity and effort of various federated learning algorithms. This section conducts a review of federated learning architectures li2019survey and analyzes the system components to understand the critical system design components and guide future research. Compared to other federated learning reviews focused on general communication architecture yang2019federated, platforms aledhari2020federated, and protocols lim2020federated; lin2020ensemble, this paper mainly provides an overview of the federated learning paradigm, focusing primarily on machine learning and security criteria. Specifically, we provide a taxonomy for federated learning systems according to the following four aspects:

data distribution, privacy mechanisms, federation scale, and open source frameworks


2.1. Data Distribution

While considering how the training and inference data are distributed, existing federated learning approaches can be classified into

horizontal federated learning,

vertical federated learning, and federated transfer learning


Horizontal federated learning is a federated learning approach in which datasets on the devices share the same attributes in different instances yang2019federated

. In this category of federated learning, users have similar attributes in terms of domain usage style and derived statistical information. An example would be a machine learning model to predict the probability of possible occurrence of cancer cells or a machine learning model for next word prediction and keyword spotting

leroy2019federated. Federated learning allows the secure sharing of user-sensitive data (e.g. date of birth, account number, medical images) through aggregated updates from each client.

Vertical Federated learning is applicable where shared data between unrelated domains are used to train the global model. Clients utilizing this approach have a transitional resource to provide encryption logic to guarantee that only the common data statistics are shared between separate organizations yan2016survey; cheng2021secureboost. A real-time use-case would be a scenario where a bank’s marketing team of a credit card division would like to improve their model by learning the most purchased items from external online shopping domains. User information from the bank and details from separate shopping sites are shared to train the model with the intermediate encryption logic, ensuring a restricted and secure exchange of only derived statistics. With this liaising of information exchange, banking domains can serve customers better with relevant offers, and online shopping domains can revise their points allocation for customers using credit card transactional data.

Federated Transfer learning utilizes the classic machine learning-based transfer learning pan2009survey technique to train a new requirement on a pre-trained framework that has been already trained on a similar dataset to solve an entirely different problem liu2020secure. Training on a pre-trained model gives an advantage compared to using a fresh model built from scratch. A real-time example would be transferring a global model to a personalized user and adjusting the model to provide a customized model on a specific user’s wearable device. Similar to vertical federated learning, participants can benefit from larger datasets and well-trained machine learning model statistics to serve their unique requirements using a federated transfer learning approach.

2.2. Privacy Mechanisms

Privacy mechanisms play a key role in federated learning and offer another way to group algorithms. The two state-of-the-art privacy mechanisms for federated learning-based data protection include differential privacy and cryptographic methods.

Differential Privacy prevents the federated learning server from identifying the owner of a local update and ensuring that a single record does not influence the output of a function dwork2006our. Differential Privacy adds a certain degree of noise in the original local update while furnishing theoretical guarantees on the model quality and protection against the inference attack on the model cheng2021secureboost; truex2019hybrid. However, due to the injected noises in the learning process, such systems tend to produce less accurate models.

Cryptographic methods include homomorphic encryption and secure multi-party computation (SMC) bonawitz2017practical; chai2020secure; fontaine2007survey. The parties need to encrypt their messages before sending and decrypt the encrypted output leading to high computation overhead. The user privacy of these federated learning systems is usually well protected as the cryptographic method guarantees that all the parties cannot learn anything except the output hardy2017private.

2.3. Scale of Federated Learning

The scale of federation is another critical factor in designing effective algorithms. Based on the scale of data and the number of client nodes, federated learning can be labelled as Cross-device federated learning or Cross-silo federated learning.

Cross-device Federated Learning

Cross-device federated learning has many clients in an analogous domain with similar interests. This type is an excellent fit for IoT or mobile applications use-cases yu2020sustainable. Due to the significant number of clients, tracking and maintaining transaction history logs is not easy. Most clients connect using unreliable networks where participation in training rounds happens randomly. Similar to data partitioning in horizontal federated learning, resource allocation strategies like client selection and device scheduling are used to make updates.

Cross-silo Federated Learning

Clients are of small-scale numbers ranging from 2 to 100 indexed devices and are almost always available for training rounds. Cross-silo federated learning is more flexible than cross-device federated learning zhang2020batchcrypt. It is used in scenarios within organizations or within groups of organizations to train the machine learning model with their confidential data. Training data can be horizontal or vertical with vertical learning methods resulting in significant communication bottlenecks and computation issues. Similar to vertical federated learning and federated transfer learning implementations, the inference information is restricted using the homomorphic encryption technique. The batch encryption technique reduces computation and communication costs zhang2020batchcrypt.

2.4 Open-source Federated Learning frameworks

PySyft, FATE, and Tensorflow Federated are currently a few open-source frameworks for researchers and developers to explore federated learning solutions.


is written in Python on top of the PyTorch framework and provides a virtual hook for connecting to clients through a websocket port

ryffel2018generic. FATE framework provides production-ready APIs with Kubernetes integration to implement federated learning in horizontal, vertical, and transfer learning modes Fed). Tensorflow Federated Github includes integration with Google Kubernetes Engine (GKE) or a Kubernetes cluster for orchestrating interaction with clients and the central server for federated learning. Google’s TensorFlow Federated -TFF5 Github is one of the first attempts in the community to bring federated learning to practical reality, and Gboard enables android mobile users to predict the next word while using the local mobile phone keyboard.

3. An empirical study on current challenges in Federated Learning

The topic of federated learning is still in its infancy and will continue to be an active area of research for the foreseeable future. As federated learning evolves, so will the attack mechanisms, and hence it is essential to provide a broad overview of current challenges on federated learning. In this section, we outlined the different challenges and potential vulnerabilities to improve the robustness of federated learning systems. Our ultimate goal is to investigate these areas, promote research collaboration, and develop general-purpose defense mechanisms robust against various attack modalities without degrading model performance.

While the applications are many, several challenges are associated with federated learning. These challenges can be broadly classified into two categories: security-related challenges and training challenges.

3.1 Security challenges

Security-related challenges include the privacy and security threats that arise due to the presence of an adversary who gains access to a user device and installs a malicious client that gains access to the black-box algorithm. Security attacks can be induced mainly by malicious actors in the learning process, and they can be either targeted or non-targeted. In targeted attacks, the adversary wants to influence the prediction on specific tasks, while in non-targeted attacks, the adversary’s motivation is to compromise the accuracy of the global model. We will concentrate on model poisoning attacks, data poisoning attacks, and inference attacks within the security vulnerabilities.

(A) Data Poisoning Attacks

In a data poisoning attack, an attacker poisons the training data for a certain number of participating devices during the learning process resulting in compromising the accuracy of the global model. Here, the attacker poisons the data by directly injecting poisoned data to the targeted device or other connected devices and hence is one of the most commonly used attack techniques against machine learning models chen2017targeted; shafahi2018poison. Data poisoning attacks also leverage gradient descent to generate adversarial training examples.

A label-flipping attack and backdoor attack fall under data poisoning attacks. In a label-flipping attack, an adversary can alter its local data by flipping the labels of training instances of source class to the target class while keeping the training data features intact, resulting in substantial drops in global model’s accuracy fung2018mitigating; biggio2012poisoning. In a backdoor poisoning attack, an adversary inserts backdoored inputs into local data to tweak individual features that are then transferred into the global model gu2017badnets.

Mitigation approaches can make use of PCA-based clustering and techniques such as the FoolsGold framework to defend against data poisoning attacks fung2018mitigating. Clustering approaches check model updates at the aggregator and then cluster them into two groups using dimensionality reduction techniques. Clusters identified with less than n/2 clients are placed into suspicious clusters of malicious clients. The FoolsGold scheme proposes limiting potentially malicious client’s contributions with similar model updates to the global model by reducing their learning rates. It shows promising results when the training data is non-i.i.d. but fails when the training data is i.i.d. as it incorrectly penalizes honest clients with similar data distributions resulting in substantial drops in test accuracy. Another way to defend against backdoor attack genre is to identify the participants based on their model updates before model averaging in each round of learning.

(B) Model Poisoning Attacks

A formidable challenge in federated learning is the possibility of an adversary initiating an attack to poison the local client’s models instead of the local data. The attacker compromises some of the local devices modifying its local model parameters, thus shifting the model’s boundary, introducing global model errors, and affecting its accuracy. For example, the attacker can introduce a stealthy backdoor functionality into the global model and compromise one or several participants. It trains a model on the backdoor data using a constrain-and-scale technique, submits the resulting model, and replaces the global model with the attacker’s backdoored model.

The most common model poisoning defense measures combine secure aggregation, anomaly detection, and participant-level differential privacy. Using secure aggregation acts as a robust defense mechanism as the individual updates from each participant are invisible to the aggregator. Integrating these combinations of solutions into an automatic, predictable model helps prevent poisoning attacks

bhagoji2019analyzing; bagdasaryan2020backdoor

. Additional defense mechanisms against model poisoning attacks include rejections based on error rate, loss function, or a combination of both. In error rate-based rejections, the framework rejects the models with a significant effect on the error rate of the global model. In loss function-based rejections, the models with a significant impact on the loss function of the global model will be rejected.

(C) Inference Attacks

Despite the substantial privacy promise of federated learning, inference attacks have demonstrated that it is possible to infer sensitive personal information from training data used in model updates during the learning process in some scenarios. In inference attacks, an attacker can infer sensitive information to which no access is granted by querying the model several times or using prevailing common knowledge. A commonly used method to mitigate inference attacks is to utilize differential privacy that provides efficient and statistical guarantees against learning for an adversary su2018securing. Noise is added to the data to obscure sensitive items so that the other party cannot distinguish the individual’s information, making it impossible to restore the original data, thereby rendering inference attacks ineffectiveshokri2017membership.

Other defense techniques include calibrated domain-specific data augmentation, in which the distinctiveness of model updates is decremented using calibrated domain-specific data augmentation. Additionally, running the framework in a trusted execution environment and secure computation are good defense techniques to counteract inference attacks. Trusted Execution Environment (TEE) presents a secure platform for running the federated learning process with low computational overhead but is suitable only for CPU devices. The most commonly used methods used in Secure Computation include Homomorphic Encryption and Secure Multiparty Computation (SMC). In homomorphic encryption, computations are executed on encrypted inputs without decrypting the data. In SMC, two or more parties concede to run the inputs provided by the clients and expose the outputs only to a subset of clients aono2017privacy; melis2019exploiting.

3.2 Training challenges

Training-related challenges encompass the issues related to high dimensional models, the overhead required during multiple training iterations, and heterogeneity of the models participating in the learning. Here, we will focus on high dimensionality, heterogeneous architectures, optimization of defense mechanisms, and challenges of non-i.i.d. datasets.

(A) Curse of Dimensionality

Large models with huge dimensional parameter vectors are particularly susceptible to privacy, and security attacks

gao2019privacy; chang2019cronus. Most federated learning algorithms require overwriting the local model parameters with the global model, making them susceptible to poisoning and backdoor attacks. The adversary can make minuscule but detrimental changes in high-dimensional machine learning models without being detected. Thus, sharing the model parameters may not be an ideal design choice in federated learning as it opens all the model’s internal state to inference attacks and maximizes the model’s malleability by poisoning attacks. Understanding and determining the need for sharing model updates is essential to address these fundamental shortcomings of federated learning. Sharing less sensitive information or only sharing model predictions in a black-box manner can result in more robust privacy protection for federated learning gao2019privacy.

(B) Heterogeneous Architectures

Sharing model updates are presently limited to homogeneous federated learning architectures. However, it would be compelling to study how to collaboratively extend federated learning to train models with heterogeneous architectures and investigate if state-of-the-art privacy preserving techniques are suited to such heterogeneous federated learning paradigms gao2019privacy; chang2019cronus.

(C) Decentralized Federated Learning

Decentralized federated learning is a potential learning framework for collaboration among businesses that do not trust any third party as no centralized server is required in the system. In this criterion, each party could be voted in as a server in a round-robin manner. It would be interesting to examine if existing threats on server-based federated learning apply in this scenario yang2019federated. Any adversarial participant can steal the training data from its neighbors if we conduct decentralized training in a ”ring all reduce” manner lyu2019towards. It might open new attack surfaces as there is a possibility that the last party selected as the server is more likely to effectively poison the whole model if it chooses to insert backdoors. This scenario resembles server-based federated learning models, which were more vulnerable to backdoor attacks in later training rounds nearing convergence.

(D) Optimization for Defense Mechanisms

Federated learning servers incur an extra computational cost when deploying defense mechanisms to identify an adversary attacking the system. In addition, disparate defense mechanisms may have different effectiveness against various attacks and incur a diverse cost. Therefore, it is crucial to study the optimization methods for deploying multiple defense mechanisms/ deterrence measures. Game-theory frameworks hold exceptional promise in addressing this challenge.

(E) Challenges on non-i.i.d. datasets

Although many remedies have been recommended for handling non-i.i.d. data distributions in federated learning, many challenges remain open. Federated learning contains many hyperparameters, e.g., the number of local epochs, the total number of clients, and client dropout probability, which vary from algorithm to algorithm, making it hard to benchmark the actual non-i.i.d. performance of these algorithms. Additionally, though few real image datasets are proposed, a universal homogeneous and heterogeneous benchmark dataset has still not emerged in the field of federated learning


. Synthetic non-i.i.d. data generated by arbitrary partitioning datasets may not effectively evaluate the performance of a method proposed for handling non-i.i.d. data. Though the vertical federated learning framework can be widely adopted in practical industry scenarios, only a handful of work has attempted to cope with the potential problems caused by non-i.i.d. distribution. The issue of overlapping data features, non-i.i.d. cases with both attribute and label skewness, and features with crowdsourcing skew deserve more attention.

Privacy protection is an essential purpose of federated learning. Still, several methods designed to address non-i.i.d. data, such as data sharing and knowledge distillation, inevitably increase the risk of privacy exposure. It is still unclear to what extent these methods harm data privacy, as there are no quantitative measures to identify the degree of privacy leakage.

There is an increasing demand for Automated Machine Learning (AutoML) cui2019fast, and there are practical examples of self-learning within the field of Neural Architecture Search (NAS). However, only a limited amount of research on the influence of non-i.i.d. distribution on federated NAS has been reported he2020fednas; singh2020differentially. Adversarial training was primarily developed for i.i.d. data and remains a challenging problem on how it can be adapted for non-i.i.d. settings.

As future directions, it would be worthwhile to have clearly defined quantitative criteria for measuring the degree of privacy leakage so that the maximum amount of shared data can be bounded. In order to compare federated learning algorithms fairly, the industry needs defined benchmark problems that reflect real word requirements and challenges along with standardized federated learning hyperparameter settings. Federated neural architecture search (FNAS) is an emerging research direction, and handling non-i.i.d. problems in FNAS along with vertical federated learning is an interesting future direction zhao2018federated.

4. Promising Research Directions

Federated learning finds excellent applications in almost every industry as it removes the barriers related to data sharing. Banking, financial services, healthcare, Internet-of-things (IoT), and natural language processing (NLP) applications related to next-word prediction and content suggestions represent promising areas to apply federated learning to increase data security and privacy.

4.1 Applications in Banking & Financial Services

One of the best uses of federated learning in finance is in the banking sector for example in credit risk assessment cheng2020federated; yang2019federated. Typically banks use white-listing techniques to rule out the customers using their credit card reports from the central banks. Factors such as taxation and reputation could be prescribed for risk management by collaborating with other financial institutions and e-commerce companies. Federated learning could help build a risk assessment machine learning model to keep customer’s information private among organizations. Banks can leverage federated learning technologies for credit risk management in financial applications. Several banks could jointly generate a total credit score for a customer without sharing their data. With the development of research in federated learning, many companies or research teams can establish various tools oriented to federated learning-based research and subsequent product development kawa2019credit.

Financial institutions can train deep learning federated learning models on the server by sending encrypted model weights and bias coefficients back and forth

yang2019federated. Federated learning systems maintain client confidentiality relating to the portfolio components and have been used to optimize expense ratios and pricing for portfolio management in banking and financial services. These techniques allow managers, financial advisors, and robo-advisors to connect with other investment banks who can provide a fair purchase price during buying or selling a client’s portfolio. Federated learning systems can potentially improve current efforts to curb unlawful financial activity like money laundering and fraud in coping with financial crimes as well as enhance regulatory compliance efforts such as to the GDPR. Federated learning techniques can improve this by enabling shared machine learning without sharing data.

Another avenue is the open banking eco-system. Open banking empowers individual customers to own their banking data with substantial potential benefits of customer experience, revenue, and the inclusion of more small and medium-sized players with innovative ideas and fine-grained service models long2020federated; brodsky2017data. The most impressive aspect of federated learning is its ability to decompose model training into distributed nodes and a centralized server without collecting private data. This kind of disintegrated learning framework has great potential to protect user’s privacy and sensitive data, and therefore, federated learning combines naturally with open banking data marketplaces. With federated learning, it is foreseeable to have decentralized data ownership in the finance sector, thereby boosting a new ecosystem of data marketplaces and financial services. This just-in-time technology can learn intelligent models in a decentralized training manner.

4.2 Applications in IoT and smart retail

The concept of edge computing as computing simple queries across distributed, low-powered devices has been investigated for over a decade in the topic of fog computing, computing at the edge, and sensor networks. Federated learning is an excellent fit for resource-constrained mobile devices, Internet-of-things (IoT), industrial sensor applications, smart retail, and other privacy-sensitive use cases.

Modern IoT systems, such as wearable devices, smart homes, autonomous vehicles, contain numerous sensors that collect and adapt to incoming data in real-time. For example, a fleet of independent sensors for self-driving vehicles may require up-to-date traffic or pedestrian behavior model to operate safely for traffic flow prediction techniques samarakoon2018federated. The dynamic nature of the surroundings constrains existing autonomous driving decisions due to offline training. Building aggregate models among various organizations may be challenging due to the limited connectivity of each device and the private nature of the data. Federated learning methods can help train online models that efficiently adapt to changes in these systems while maintaining user privacy tan2020federated; liu2020privacy.

Federated learning can also be an excellent application in smart retail as smart retail aims to use machine learning technology to provide personalized services to customers based on customer data such as user purchasing power and product characteristics for product recommendation and sales services zhao2020mobile.

4.3 Application in Healthcare

Electronic Health Records (EHR) are considered the primary healthcare data sources for machine learning applications miotto2018deep. Training machine learning models only using the limited data available in a single hospital might introduce bias in the predictions. Making the models more generalizable requires training with more data from a single hospital, which can only be realized by sharing data among organizations. It might not be feasible to share patient’s electronic health records among hospitals considering the sensitive nature of data. Federated learning can be an excellent option for building a robust collaborative learning model and bringing together the research knowledge from different medical institutions min2019predictive.

4.4 Application in Natural Language Processing (NLP)

Machine learning and natural language processing have several facets: conversational dialogue systems, information extraction, structured prediction, clustering, language understanding, topic modeling, and ranking. NLP helps us better understand human language semantics. Building an NLP framework requires considerable amount of data to train highly accurate language models from multiple sources such as mobile phones, tablets, etc. However, privacy comes as a bottleneck for centralized language learning models as information from each edge device contains individual user information that needs to be protected. We can deploy federated learning methods to address these growing data risks, data rights, privacy, and security

garcia2020decentralizing. It would be compelling to investigate the explainability of federated learning for NLP mathews2019explainable, especially interpreting how NLP models work in data heterogeneity.

5. Conclusion and Future Work

Federated learning is a new learning paradigm with a recent surge in popularity and helps train a high-quality shared global model with a central server from decentralized data scattered among several clients. As research in federated learning is still in the nascent stage, we believe that the issues presented in this paper are pivotal in shaping the developments in this area. We provided an overview of the existing system abstractions and building blocks for different federated learning systems. We have presented open-source tools to facilitate both the reproducibility of federated learning results and the dissemination of new solutions.

This article investigates the existing training and security challenges in federated learning and discusses the corresponding solutions. The proposed discussions can help build fully-fledged solutions for data privacy protection via federated learning. Applications related to finance, banking, healthcare, autonomous-driving, and the IoT ecosystem are promising candidate areas for future exploration for federated learning. Finally, we outlined a set of open problems that need to be addressed for federated learning to have a broader impact.