The proliferation of smart devices, mobile networks and computing technology have sparked a new era of Internet of Things (IoT), which is poised to make substantial advances in all aspects of our modern life, including smart healthcare system, intelligent transportation infrastructure, etc. With huge amount of smart devices connected together in IoT, we are able to get access to massive user data to yield insights, train task-specified machine learning models and utimately provide high-quality smart services and products. To reap the benefits of IoT data, the predominant approach is to collect distributed user data to a central cloud for modeling and then transfer the trained model to user devices for task inferences. This kind of approach can be ineffective as data transmission and model transfer will result in high communication cost and lantency. Moreover, it may raise data privacy concerns as the user-sensitive data are required to upload to the remote cloud. An alternative is to train and update the models at each IoT device with its local data, in isolation from other devices. However, one key impediment of this approach lies in the high resource demand for deploying and training models on IoT devices with limited computational, energy and memory resources. Besides, insufficient data samples and local data shifts will lead to an even worse model.
A sophisticated solution to deal with distributed data training is federated learning which enables collaboratively training a high-quality shared model by aggregating locally-computed updates uploaded by IoT devices . The primary advantage of this approach is the decoupling of model training from the need for direct access to the training data, Thus, federated learning is able to learn a satisfactory global model without compromising user data privacy. Nevertheless, there are three major challenges in the key aspects of federated learning process in the complex IoT environments, making it unsuitable to directly deploy federated learning in IoT applications.
These three challenges faced by federated learning can be summarized as (1) device heterogeneity, such as varying storage, computational and communication capacities; (2) statistical heterogeneity like the non-IID nature of data generated from different devices; (3) model heterogeneity, the situation where different devices want to customize their models adaptive to their application environments. Specifically, resource-constrained IoT devices will be only allowed to train light-weighted models under certain network conditions and may further result in high communication cost, stragglers and fault tolerance issues which can not be handled by traditional federated learning. As federated learning focuses on achieving a high-quality global model by extracting common knowledge of all participating devices, it fails to capture the personal information for each device, resulting in a degraded performance for inference or classification. Furthermore, traditional federated learning requires all participating devices to agree on a common model for collaborative training, which is impractical in realistic complex IoT applications.
To tackle these heterogeneity challenges, an obvious way is to perform personalization in device, data and model levels to mitigate heterogeneity and attain high-quality personalized model for each device. Due to its broad application scenarios, personalization has recently attracted great attention. We investigate the emerging personalized federated learning approaches which can be the viable alternative to traditional federated learning and summarize them into four categories: federated transfer learning, federated meta learning, federated multi-task learning and federated distillation. These approaches are able to alleviate different heterogeneous problems in the complex IoT environments and can be promising enabling techniques for many emerging intelligent IoT applications.
In this article, we propose a synergistic cloud-edge framework named PerFit for personalized federated learning which mitigates the device heterogeneity, statistical heterogeneity and model heterogeneity inherent in IoT applications in a holistic manner. To tackle the high communication cost issue in device heterogeneity, we resort to edge computing which brings the necessary on-demand computing power in the proximity of IoT devices . Therefore, each IoT device can choose to offload its computationally-intensive learning task to the edge which fulfills the requirement for fast-processing capacity and low latency. Besides, edge computing can mitigate privacy concerns by storing the data locally in proximity (e.g., in the smart edge gateway at home for smart home applications) without uploading the data to the remote cloud. Moreover, privacy and security protection techniques such as differential privacy and homomorphic encryption can be adopted to enhance the privacy protection level. For statistical and model heterogeneities, this framework also enables that edge servers jointly train a global model under the coordination of a central cloud server in a cloud-edge paradigm. After the global model federated learning, at the device side, different kinds of personalized federated learning approaches can be then adopted to enable personalized model deployments for different devices tailored to their application demands. We further illustrate a representative case study based on a specific application scenario—IoT based activity recognition, which demonstrates the superior performance of PerFit for high accuracy and low communication size.
The remainder of this article is organized as follows. The following section discuss the main challenges of federated learning in IoT environments. To cope with these challenges, we advocate a personalized federated learning framework based on cloud-edge architecture and investigate some emerging solutions to personalization. Then we evaluate the performance of personalized federated learning methods with a motivating study case of human activity recognition. Finally, we conclude the article.
Ii Main Challenges of Federated Learning in IoT Environments
In this section, we first elaborate the main challenges and the potential negative effects on the performance of traditional federated learning in IoT environments.
Ii-a Device Heterogeneity
There are typically a large number of IoT devices that differ in hardware (CPU, memory), network conditions (3G, 4G, WiFi) and power (battery level) in IoT applications, resulting in diverse computational, storage and communication capacities . Thus, device heterogeneity challenges arise in federated learning, such as high communication cost, stragglers and fault tolerance . In federated setting, communication costs are the principal constraints considering the fact that IoT devices are frequently offline or on slow or expensive connections. In the federated learning process performing a synchronous update, the devices with limited computing capacity will become stragglers. Moreover, participating devices may drop out the learning process due to connectivity and energy constraints, causing a negative effect on federated learning. As the stragglers and faults issues are very prevalent due to the device heterogeneity in complex IoT environments, it is of great significance to address the practical issues of heterogeneous device communication and computation resources in federated learning setting.
Ii-B Statistical Heterogeneity
Due to users’ different usage preferences and patterns, the personally-generated data from different devices may naturally exhibits the kind of non-IID and unbalanced distributions. For example, in healthcare applications, the distributions of users’ activity data differ greatly according to users’ diverse physical characteristics and behavioral habits. Moreover, the number of data samples across devices may vary significantly 
. This kind of statistical heterogeneity is pervasive in complex IoT environments. To address this heterogeneity challenge, the canonical federated learning approach, FedratedAveraging (FedAvg), is demonstrated to be able to work with certain non-IID data. However, FedAvg may lead to a severely degraded performance when facing strongly skewed data distributions. Specifically, on the one hand, non-IID data will result in weight divergence between federated learning process and the traditional centralized training process, which indicates that Fedvg will finally obtain a worse model than centralized methods and thus result in poor performance. On the other hand, FedAvg only learns the coarse features from IoT devices, while fails in learning the fine-grained information on a particular device.
Ii-C Model Heterogeneity
In the original federated learning framework, participating devices have to agree on a particular architecutre of the training model so that the global model can be effectively obtained by simpling aggregating the model weights gathered from local models. However, in practical IoT applications, different devices want to craft their own models adaptive to their application environments and resource constraints (i.e., computing capacity). And they may be not willing to share the model details due to privacy concerns. As a consequence, the model architectures from different local models exhibit various shapes, making it impossible to perform naive aggregation by traditional federated learning. In this case, the problem of model heterogeneity turns to become how to enable a deep network to understand the knowledge of others without sharing data or model details. Model heterogeneity inherent in IoT environments has attracted considerable research attention due to its practical significance for intelligent IoT applications.
Iii Cloud-Edge Framework for Personalized Federated Learning
As elaborated in Section II, there exist device heterogeneity, statistical heterogeneity and model heterogeneity in IoT applications, which pose great challenge to traditional federated learning. The potential solutions for all these heterogeneities may come down to personalization. That is to say, by devising and leveraging more advanced federated learning methods, it is allowed for personal devices to craft their personal models which meet their distinct specifications and tailor to their own data distributions.
In this article, we advocate a personalized federated learning framework for intelligent IoT applications which tackles the heterogeneity challenges in a holistic manner. As depicted in Fig. 1, our proposed PerFit framework adopts a cloud-edge architecture, which brings necessary on-demand edge computing power in the proximity of IoT devices. Therefore, each IoT device can choose to offload its intensive computing tasks to the edge (i.e., edge gateway at home, edge server at office, or 5G MEC server outdoors) via the wireless connections, thus the requirements for high processing efficiency and low latency of IoT applications can be fulfilled.
To support collaborative learning for intelligent IoT applications, federated learning is then adopted between the edge servers and the remote cloud, which enables to jointly train a shared global model by aggregating locally-computed models from the IoT users at the edge while keeping all the sensitive data on device. To tackle the heterogeneities issues, we will further carry out personalization and adopt some personalized federated learning methods to fine-tune the learning model for each individual device.
Specifically, the collaborative learning process in PerFit mainly consists of the following three stages as shown in Fig. 1:
Offloading stage. When the edge is trustworthy (e.g., edge gateway at home), the IoT device user can offload its whole learning model and data samples to the edge for fast computation. Otherwise, the device user will carry out model partitioning by keeping the input layers and its data samples locally on its device and offload the remaining model layers to the edge for device-edge collaborative computing.
Learning stage. The device and the edge collaboratively compute the local model based on personal data samples and then transmit the local mdel information to the cloud server. The cloud server aggregates local model information submitted by participating edges and averages them into a global model to send back to edges. Such model information exchanging process repeats until it converges after a certain number of rounds of iterations. Thus a shared global model will be achieved and then transmitted to edges for further personalization.
Personalization stage. To capture the specific personal characteristics, each device will train a personalized model based on global model and its own personal information (i.e., local data).
The adopted personalized federated learning mechanism will be the core of the collaborative learning in PerFit, which also determines the exchanging model information between the cloud server and the edges. For example, it is also allowed to transmit only part of the model parameters due to the specific setting of federated transfer learning as we will discuss in the coming section. If facing the situation where different models are trained on different IoT devices, the output class probabilities of local models can be encapsulated as its local information to send to the cloud server via federated distillation approaches, thus the model heterogeneity can be mitigated. PerFit is flexible to integrate with many kinds of personalized federated methods by transmiting different kinds of information. In the following section, we elaborate some emerging personalized federated learning approaches that can mitigate the nature heterogeneity challenges from different aspects in IoT applications.
Iv Personalized Federated Learning Mechanisms
As mentioned above, there exist different types of heterogeneities in complex IoT environments, which poses new challenges for federated learning paradigm when deployed in the IoT applications. Generally, the heterogeneity challenges are derived from the personal characteristics of the participating devices in device, data and model levels. Thus we resort to some emerging personalized federated learning methods to mitigate these heterogeneities. These personalized federated learning schemes can be categorised by federated transfer learning, federated meta learning, federated multi-task learning and federated distillation, which will be elaborated as follows.
Iv-a Federated Transfer Learning
Transfer learning aims at transferring knowledge (i.e., the trained model parameters) from a source domain to a target domain. In the setting of federated learning, the domains are often different but related, which makes knowledge transfer possible. The basic idea of federated transfer learning is to transfer the globally-shared model to distributed IoT devices for further personalization in order to mitigate the statistical heterogeneity (non-IID data distributions) inherent in federated learning. Considering the architecture of deep neural networks and communication overload, there are two main approaches to perform personalization via federated transfer learning.
Chen et al.  first train a global model through traditional federated learning and then transfer the global trained model back to each device. Furthermore, each device is able to build personalized model by refining the global model with its local data. In model transfer, considering the different functions of different layers in deep networks, only model parameters of specified layers will be fine-tuned instread of retraining whole model. As presented in Fig. 2 (a), model parameters in lower layers of global model can be transferred and reused directly for local model as lower layers of deep networks focus on learning common and low-level features. While the model parameters in higher layers should be fine-tuned with local data as they learn more specific features tailored to current device.
Arivazhagan et al. 
propose FedPer which takes a different way to perform personalization through federated transfer learning. FedPer advocates viewing deep learning models asbase + personalization layers as illustrated in Fig. 2 (b). Base layers act as the shared layers which are trained in a collaborative manner using the existing federated learning approach (i.e., FedAvg method). While the personalization layers are trained locally thereby enabling to capture personal information of IoT devices. In this way, after the federated training process, the globally-shared base layers can be transferred to participating IoT devices for constituting their own personalized deep learning models with their unique personalization layers. Thus, FedPer is able to capture the fine-grained information on a particular device for superior personalized inference or classification, and address the statistical heterogeneity to some extent. Besides, by uploading and aggregating only part of the models, FedPer requires less computation and communication overhead, which is essential in IoT environments.
Iv-B Federated Meta Learning
Federated learning in IoT environments generally faces statistical heterogeneity such as non-IID and unbalanced data distributions, which makes it challenging to ensure a high-quality performance for each participating IoT devices. To tackle this problem, some researchers concentrate on improving FedAvg algorithm by leveraging the personalization power of meta learning. In meta learning, the model is trained by a meta-learner which is able to learn on a large number of similar tasks and the goal of the trained model is to quickly adapt to a new similar task from a small amount of new data . By regarding the similar tasks in meta learning as the personalized models for the devices, it is a nature choice to integrate federated learning with meta learning to achieve personalization through collaborative learning.
Jiang et al.  propose a novel modification of FedAvg algorithm named Personalized FedAvg by introducing a fine-tuning stage using model agnostic meta learning (MAML), a representative gradient-based meta learning algorithm. Thus, the global model trained by federated learning can be personalized to capture the fine-grained information for individual devices, which results in an enhanced performance for each IoT device. MAML is flexible to combine with any model representation that is amenable to gradient-based training. Besides, it can learn and adapt quickly from only a few data samples.
Since the federated meta learning approach often utilizes complicated training algorithms, it has higher implementation complexity than the federated transfer learning approach. Nevertheless, the learned model by federated meta learning is more robust and can be very useful for those devices with very few data samples.
Iv-C Federated Multi-Task Learning
In general, the federated transfer learning and federated meta learning aim to learn a shared model of the same or similar tasks across the IoT devices with fine-tuned personalization. Along a different line, federated multi-task learning (MTL) aims at learning multiple distinct tasks for different devices simultaneously and tries to capture the model relationships amongst them. Through model relationships, the model of each device may be able to reap other device’s information. Moreover, the model learned for each device is always personalized. As shown in Fig. 3, the model-specific layers aims to capture the relationships amongst the models of different learning tasks while task-specific layers are for personalization. Therefore, each device is able to attain a well-trained personalized model and yield better performance. Integrating MTL, federated multi-task learning aims to collaboratively train local models for IoT devices and their model relationships under a federated learning framework so as to mitigate statistical heterogeneity and obtain high-quality personalized models for all participating IoT devices.
Smith et al.  develop a distributed optimization method MOCHA through a federated multi-task learning framework. For high communication cost, MOCHA allows the flexibility of computation which yields direct benefits for communication as performing additional local computation will result in fewer communication rounds of federated setting. To mitigate stragglers, the authors proposes to approximately compute the local update for devices with limited computing resources. Besides, asynchronous updating scheme is also an alternative approach for straggler avoidance. Furthermore, by allowing participating devices periodically dropping out, MOCHA is robust to fault tolerance. As device heterogeneity inherent in complex IoT environments is critical to the performance of federated learning, federated multi-task learning is of great significance for intelligent IoT applications.
Iv-D Federated Distillation
In original federated learning framework, all clients have to agree on a particular architecture of the model trained on both the global server and the local clients. However, in some realistic business setting, like healthcare and finance, each participant would have capacity and desire to design its own unique model, and may not be willing to share the model details due to privacy and intellectual property concerns. This kind of model heterogeneity poses new challenge to traditional federated learning.
To tackle this challenge, Li et al.  propose FedMD, a new federated learning framework that enables participants to independently design their own models by leveraging the power of knowledge distillation. In FedMD, each client need to translate its learned knowledge to a standard format which can be understood by others without sharing data and model architecture. And then a central server collects these knowledges to compute a consensus which will be further distributed to the participating clients. The knowledge translation step can be implemented by knowledge distillation, for example, using the class probabilities produced by client model as the standard format as shown in Fig. 4. In this way, the cloud server aggregates and averages the class probabilities for each data sample and then distributes to clients to guide their updates. Jeong et al.  propose federated distillation where each client treats itself as a student and sees the mean model output of all the other clients as its teacher’s output. The teacher-student output difference provides the learning direction for the student. Here it is worthnoting that, to operate knowledge distillation in federated learning, a public dataset is required bacause the teacher and student outputs should be evaluated using an identical training data sample. Moreover, federated distillation can significantly reduce the communication cost as it exchanges not the model parameters but the model output.
Iv-E Data Augmentation
As user’s personally-generated data naturally exhibits the kind of highly-skewed and non-IID distribution which may greatly degrade the model performance, there are emerging works focusing on data augmentation to facilitate personalized federated learning. Zhao et al. 
propose a data-sharing strategy by distributing a small amount of global data containing a uniform distribution over classes from the cloud to the edge clients. In this way, the highly-unbalanced distribution of client data can be alleviated to some extent and then the model performance of personalization can be mitigated. However, directly distributing the global data to edge clients will impose great privacy leakage risk, this approach is required to make a trade-off between data privacy protection and performance improvement. Moreover, the distribution difference between global shared data and user’s local data can also bring performance degradation.
To rectify the unbalanced and non-IID local dataset without compromising user privacy, some over-sampling techniques and deep learning approaches with generative ability are adopted. For example, Jeong et al.  propose federated augmentation (FAug), where each client collectively trains a generative model,and thereby augments its local data towards yielding an IID dataset. Specifically, each edge client recognizes the labels being lacking in its data samples, referred to as target labels, and then uploads few seed data samples of these target labels to the server. The server oversamples the uploaded seed data samples and then trains a generative adversarial network (GAN). Finally, each device can download the trained GAN’s generator to replenish its target labels until reaching a balanced dataset. With data augmentation, each client can train a more personalized and accurate model for classification or inference based on the generated balanced dataset.
V Case Study
In this section, we first describe the experiment settings and then evaluate different personalized federated learning approaches in terms of accuracy and comminication size.
V-a Dataset and implementation details
In the experiments, we focus on human activity recognition task based on a publicly accessible dataset called MobiAct . Each volunteer participating in the generation of MobiAct dataset wears a Samsung Galaxy S3 smartphone with accelerometer and gyroscope sensors. The tri-axial linear accelerometer and angular velocity signals are recorded by embedded sensors while volunteers perform predefined activities. There are ten kinds of activities recorded in MobiAct, such as walking, stairs up/down, falls, jumping, jogging, step in a car, etc. To practically mimic the environment of federated learning, we randomly select 30 volunteers and regard them as different clients. For each client, we take a random number of samples for each activity and finally, each client has 480 samples for model training. The test data for each client is composed of 160 samples under a balanced distribution.
Specifically, we study the personalization performance of the two widely-adopted approaches: federated transfer learning (FTL) and federated distillation (FD). We design two kinds of models for training on both the cloud server and the clients: 1) a Multi-Layer Perceptron network composed of three fully-connected layers with 400, 100 and 10 neural units (521,510 total parameters), which we refer to as the 3NN. 2) a convolutional neural network (CNN) with threeconvolutional layers (the first with 32 channels, the second with 16, the last with 8, each of the first two layers followed by a max-pooling layer), a fully connected layer with 128 units and ReLu activation, and a final softmax output layer (33,698 total parameters). Both 3NN and CNN are trained by minibatch Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.01.
V-B Experimental results
We compare the performance of personalized federated learning with both centralized scheme and traditional federated learning. Centralized CNN (cCNN) collects a large amount of data (the training data of 30 clients in our experiment) in a centralized cloud server to train a satisfactory CNN model. In traditional federated setting, we adopt CNN as both the cloud and the client models to train a single global model by FederatedAverage (FedAvg), which combines local stochastic gradient desent (SGD) on each client with a cloud server that performs model averaging. And then the well-trained global model is directly distributed to edge clients for classification or inference. For FTL, each client will fine-tune the model downloaded from the cloud server with its personal data. While in personalized federated distillation, each client can customize its own model according to its own restricts.
Fig. 5 illustrates the test accuracy of 30 clients under different learning approaches. We can see that under the coordination of a central cloud server, the edge clients in traditional federated learning (FL) are able to collectively reap the benefits of each other’s information without compromising data privacy and achieve a competitive average accuracy of similar to cCNN. However, as the data distribution trained for the globally shared model is different from that of each client, the global model may perform poorly on some clients. For example, the accuracy of some clients may lower than while some clients can reach a high accuracy of more than . With personalization performed by each client with its own data, the average accuracy of FTL can reach . Moreover, the accuracies of 30 clients vary in a very small scale which indicates that personalization can significantly reduce the performance degradation caused by non-IID distribution. Federated distillation approach has an accuracy improvement of compared with FL and the performance differences between different clients have also been narrowed.
The critical nature of communication constraints in cloud-edge scenarios also need to be considered in federated setting because of limited bandwidth, slow and expensive connections. We compare both the accuracy and communication size of different training models for FTL and FD. In FTL-3NN and FTL-CNN, we utilize 3NN and CNN as the model trained on both the cloud and the edge clients, respectively. For federated distillation, we consider two cases: (1) FD-1: 10 clients choose 3NN as their local models while the remaining 20 clients choose CNN; (2) FD-2: the local models of 20 clients are 3NN and the models for remaining 10 clients are CNN. As depicted in Fig. 6, all the four personalized federated learning methods can achieve a high accuracy of more than . However, the communication sizes vary dramatically. As all these methods can converge within hundreds of communication rounds, we only compare the communication size in each communication round. The commnication payload size for FTL depends on the model parameter number which are 521,510 and 33,698 for FTL-3NN and FTL-CNN, respectively. While the communication size for FD is proportional to the output dimension which is 10 in our human activity recognition task. In each communication round, we randomly select 500 samples from the globally-shared data and transmit the outputted class scores predicted by each participating device to the cloud server, thus the communication size for both FD-1 and FD-2 is 5000. Fig. 6 states that we are able to achieve superior prediction performance with light-weighted models or small communication overhead, which is of great significance to cloud-edge based IoT systems.
In this article, we propose PerFit, a personalized federated learning framework in a cloud-edge architecture for intelligent IoT applications with data privacy protection. PerFit enables to learn a globally-shared model by aggregating local updates from distributed IoT devices by leveraging the merits of edge computing. To tackle the statistical, device and model heterogeneities in IoT environments, PerFit can naturally integrate a variety of personalized federated learning methods and thus achieve personalization and enhanced performance for devices in IoT applications. We demonstrate the effectiveness of PerFit through a case study of human activity recognition task, which corroborates that PerFit can be a promising approach for enabling many intelligent IoT applications.
-  (2019) Federated learning with personalization layers. arXiv preprint arXiv:1912.00818. Cited by: §IV-A.
-  (2019) FedHealth: a federated transfer learning framework for wearable healthcare. arXiv preprint arXiv:1907.09173. Cited by: §IV-A.
-  (2019) Cartel: a system for collaborative transfer learning at the edge. In Proceedings of the ACM Symposium on Cloud Computing, pp. 25–37. Cited by: §I.
-  (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §IV-B.
-  (2018) Communication-efficient on-device machine learning: federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479. Cited by: §IV-D, §IV-E.
-  (2019) Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488. Cited by: §IV-B.
-  (2019) FedMD: heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581. Cited by: §IV-D.
-  (2019) Federated learning: challenges, methods, and future directions. arXiv preprint arXiv:1908.07873. Cited by: §II-A.
-  (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I.
-  (2017) Hybrid mobile edge computing: unleashing the full potential of edge computing in mobile device use cases. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 935–944. Cited by: §I.
-  (2017) Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434. Cited by: §II-A, §IV-C.
-  (2016) The mobiact dataset: recognition of activities of daily living using smartphones.. In ICT4AgeingWell, pp. 143–151. Cited by: §V-A.
-  (2019) Federated learning for healthcare informatics. arXiv preprint arXiv:1911.06270. Cited by: §II-B.
-  (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582. Cited by: §II-B, §IV-E.