Data Heterogeneity-Robust Federated Learning via Group Client Selection in Industrial IoT

Nowadays, the industrial Internet of Things (IIoT) has played an integral role in Industry 4.0 and produced massive amounts of data for industrial intelligence. These data locate on decentralized devices in modern factories. To protect the confidentiality of industrial data, federated learning (FL) was introduced to collaboratively train shared machine learning models. However, the local data collected by different devices skew in class distribution and degrade industrial FL performance. This challenge has been widely studied at the mobile edge, but they ignored the rapidly changing streaming data and clustering nature of factory devices, and more seriously, they may threaten data security. In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered industries, to improve industrial FL performance on non-i.i.d. data. Taking advantage of naturally clustered factory devices, FedGS uses a gradient-based binary permutation algorithm (GBP-CS) to select a subset of devices within each factory and build homogeneous super nodes participating in FL training. Then, we propose a compound-step synchronization protocol to coordinate the training process within and among these super nodes, which shows great robustness against data heterogeneity. The proposed methods are time-efficient and can adapt to dynamic environments, without exposing confidential industrial data in risky manipulation. We prove that FedGS has better convergence performance than FedAvg and give a relaxed condition under which FedGS is more communication-efficient. Extensive experiments show that FedGS improves accuracy by 3.5 superior effectiveness and efficiency on non-i.i.d. data.


page 1

page 11

page 12


A Survey on Federated Learning and its Applications for Accelerating Industrial Internet of Things

Federated learning (FL) brings collaborative intelligence into industrie...

Wireless Communications for Collaborative Federated Learning in the Internet of Things

Internet of Things (IoT) services will use machine learning tools to eff...

Efficient Ring-topology Decentralized Federated Learning with Deep Generative Models for Industrial Artificial Intelligent

By leveraging deep learning based technologies, the data-driven based ap...

Accelerating Federated Learning over Reliability-Agnostic Clients in Mobile Edge Computing Systems

Mobile Edge Computing (MEC), which incorporates the Cloud, edge nodes an...

Federated Dynamic Sparse Training: Computing Less, Communicating Less, Yet Learning Better

Federated learning (FL) enables distribution of machine learning workloa...

Fusion of Federated Learning and Industrial Internet of Things: A Survey

Industrial Internet of Things (IIoT) lays a new paradigm for the concept...

Sparse Federated Learning with Hierarchical Personalization Models

Federated learning (FL) is widely used in the Internet of Things (IoT), ...

I Introduction

In recent years, the Internet of Things (IoT) has played an increasingly integral role in the industrial community. Taking logistics sorting and automatic object identification as examples. Optical character recognition (OCR) cameras on logistics pipelines detect and read characters on packing boxes in order to sort them[1]. At the same time, the surrounding surveillance cameras are constantly monitoring, automatically identifying objects through optical recognition of the characters on their badges, and confirming whether the machines, robots, vehicles, and workers in the factory are legal entrants[2]. These optical sensors collect a huge amount of industrial data. In order to tap the value of these data, advanced data mining technologies are needed, especially machine learning (ML). However, gathering the industrial big data to the cloud leads to unbearable transmission overhead, and also violates data privacy regulations. Taking the idea of task offloading, federated learning (FL)[3] sinks model training from the cloud to the edge. OCR cameras use local optical data to train local OCR models, then upload their local model updates to the cloud to update the global model. The global model is then synchronized to OCR cameras. These steps are repeated until the global model converges. In this way, FL preserves data confidentiality because the raw data does not leave the devices.

The combination of industrial IoT (IIoT) and FL opens a door for smart industry[4, 5, 6]. FL provides powerful privacy-preserving tools for mining decentralized industrial data, and IIoT technologies such as smart sensors and mobile robots provide rich resources (e.g., data, computation) for FL. Despite these benefits, compared with OCR in natural scenes, FL in industries requires higher accuracy to ensure the reliability of industrial operations. However, sensors’ local data distributions can be highly heterogeneous due to differences in times, locations, functions, and so on. Taking the logistics industry of cross-border e-commerce as an example, the Singapore warehouse transports more packing boxes to Singapore than other countries, so the optical characters in the word “Singapore” appear more times than other characters. For this reason, the number of optical images of each character captured by different OCR cameras (in different warehouses) is skewed and inconsistent. Other examples are device failure detection[7] and object detection[8], which also prove the existence of data heterogeneity in real-world IIoT. These skewed distributed data are called non-independent and identically distributed (non-i.i.d.) and can lead to FL performance degradation[9], which becomes more challenging when local data of sensors are constantly changing.

The non-i.i.d. data challenge has inspired the research field of heterogeneous FL, especially the field of mobile edge computing (MEC)[5, 10, 12, 13, 11, 14, 15, 16, 17, 18, 19], which currently remains open. These kinds of literature have achieved great success in the context of MEC, but the following characteristics of IIoT make them still limited:

  1. Higher requirements for data security. Industries (e.g., manufacturing, logistics, and transportation) often face even more serious data threats due to owning a vast amount of valuable information, so they possess the most urgent and critical requirement to increase security to protect data. Therefore, any form of disclosure[10, 11, 12, 13] or tampering[14, 15, 16] of the confidential raw data is not allowed.

  2. Rapidly changing streaming data on data-intensive sensors. Data-intensive IIoT sensors such as OCR cameras require high sampling rates to capture the real-time phenomenon information and produce large amounts of data. In order to save storage space, these data will overwrite the old data that has been processed, forming a data stream similar to a first-in-first-out (FIFO) data queue. In such a dynamic environment, static approaches no longer work for IIoT, for example, [17], and the K-Center clustering algorithm in [18].

  3. Natural geographical clustering property. In modern industrial parks, IIoT devices in each factory are geographically adjacent, which makes them naturally clustered into groups and interconnected by highly reliable networks, for example, through regional 5G base stations (see Fig. 1). However, this valuable property is often ignored, and the rich communication resources at the edge are not fully utilized[12, 19, 20], which limits the improvement of industrial FL.

The above characteristics distinguish “FL in IIoT” from “FL in non-IIoT” (e.g., FL in Edge). Few literature has been proposed to tackle the non-i.i.d. data challenge of FL in IIoT, such as approaches based on centroid distance weighted averaging[7]

, reinforcement learning

[21], and kmeans-based cohorts[22]. However, none of them take into account the changing local data distribution or the natural geographic clustering property of devices in IIoT. More importantly, they do not address the fundamental problem causing FL model performance to degrade, namely the divergence in class distributions[9].

Fig. 1: A network architecture example in the modern industrial park. End devices within each factory submit locally trained ML models to nearby 5G edge base stations. Edge servers synchronize these models and upload the synchronized models to the cloud server for global synchronization. The cloud center can be located on the cloud to synchronize ML models from multiple industrial parks, or it can be a micro data center located in an industrial park to synchronize ML models of edge servers in this park.

To address the root cause of non-i.i.d. data, this paper aims to propose an effective approach to minimize the divergence in class distributions among heterogeneous devices. Taking advantage of the natural property of geographical clustering, we can select a subset of devices in each factory to construct “FL super nodes” with consistent class distributions. These super nodes can be treated as homogeneous clients participating in FL training, without exposing the confidential industrial FL process in risky data manipulation.

However, designing such an approach is not trivial. Firstly, selecting a subset of devices in each group to minimize the class distribution divergence among groups is a 0-1 integer programming problem with vector weight constraints, which is proved to be NP-complete.

More challenging, this procedure needs to be invoked frequently to adapt to rapidly changing local data and the mobility of mobile IIoT devices (e.g., robots, drones), which places high demands on execution latency. Secondly, even if class distributions among FL super nodes are forced to be homogeneous, devices’ local data in each FL super node can still be skewed. If not handled properly, these challenges will still degrade FL model performance.

To minimize the data heterogeneity among groups and realize efficient client selection, this paper proposes a novel gradient-based binary permutation optimizer GBP-CS to solve the above NP-complete client selection problem. GBP-CS runs a constraint-preserving gradient descent optimization procedure directly in the 0-1 integer space, and can build homogeneous FL super nodes in a very short time. Then, we propose Federated Group Synchronization (FedGS), which is a hierarchical cloud-edge-end FL framework for 5G empowered modern industries, to improve industrial FL performance on non-i.i.d. data. FedGS uses a compound-step synchronization protocol to train ML models, which can suppress data heterogeneity within and among FL super nodes. More specifically, FedGS uses a single-step synchronization protocol (e.g., SSGD[23]) within super nodes because of its robustness against data heterogeneity, and a multi-step synchronization protocol (e.g., FedAvg[24]) among homogeneous super nodes to reduce communication overhead. Theoretical analysis shows that FedGS has both the convergence upper bound and optimality gap better than FedAvg in the presence of non-i.i.d. data, and can be more time-efficient under a relaxed condition. Finally, we evaluate FedGS on the most widely adopted non-i.i.d. benchmark dataset FEMNIST[25], and compare it with 10 advanced approaches, including FedAvg, FedMMD[26], FedFusion[27], FedProx[28], IDA[29], CGAU[30], FedAvgM[31], and FedAdagrad, FedAdam, FedYogi from [32]. The main contributions of this paper are summarized as follows.

  • We propose a hierarchical cloud-edge-end FL framework FedGS for 5G empowered modern industries, which uses a novel compound-step synchronization protocol to coordinate the training process within and among groups. The new protocol is robust against data heterogeneity and can effectively improve industrial FL performance.

  • We propose a novel GBP-CS algorithm to select a subset of devices from each group to build homogeneous FL super nodes, which can find a desirable selection strategy in a very short time. GBP-CS is a general optimizer for constrained 0-1 integer programming problems and can be used for other practical cases such as game matching.

  • We analyze the convergence rate and optimality gap of FedGS and give a relaxed condition under which FedGS is more time-efficient than FedAvg. Theoretical results show that FedGS not only converges closer to the optimal, but also faster.

  • Extensive experiments compared to 10 advanced approaches show that FedGS improves FL accuracy by 3.5% and reduces training rounds by 59% on average. The results highlight the superior effectiveness and efficiency of FedGS on non-i.i.d. data.

Ii Related Work

In this section, we categorize related works into four types according to the techniques they use.

Data Sharing and Augmentation. This type of approach aims to minimize the class distribution divergence among devices by sharing or augmenting FL clients’ local datasets. For the sharing-based approaches, Zhao et al. propose to distribute a small portion of globally shared data (e.g., open available data) to clients’ devices[9]. Yao et al. collect metadata shared by voluntary clients to perform controllable meta updating[10]. Yoshida et al. reward FL clients for contributing local datasets and propose a hybrid learning mechanism wherein the server updates the model using the shared data and clients’ local models[11]. These approaches achieve a considerable improvement in FL accuracy, but they are suspected of leaking private data due to the need to share clients’ local datasets. Besides, open-available datasets do not always exist, especially in fields where data is highly confidential.

For the augmentation-based approaches, Duan et al. observe that the imbalance among different classes can also degrade FL accuracy[14]. Hence, they augment classes with fewer samples by simply random offset, rotation, cropping, and scaling. Jeong et al.[33]

propose to generate new samples using a globally trained conditional generative adversarial network (CGAN) to build unskewed local datasets.

Similarly, Wang et al.

generate synthetic data in the minority class based on linear interpolation to re-balance local datasets on edge devices

These approaches avoid the leakage of FL clients’ private data. However, they still bring credibility crises. Speculative clients can use the synthetic data generated out of thin air to participate in FL training while hiding their original data. Also, they can pretend that the synthetic data is a large volume of high-quality data for more rewards. Therefore, these operations (i.e., data sharing, data augmentation) are high risk and should be prohibited.

Hyperparameter Tuning.Hyperparameters play an important role in FL training convergence. Some works have been explored in hyperparameter tunings, such as tunning the number of local iterations and the learning rate. Wang et al. point out that the optimal performance can be achieved when the number of local iterations equals one[34]. However, the constrained resources (e.g., bandwidth, time, power) prevent us from doing this. In practice, large local iterations are more commonly used. For example, Yu et al. carefully set the number of local iterations and obtain a considerable convergence rate[35]. In addition, Li et al. point out that decaying the learning rate is necessary for FL convergence with large local iterations[36]. For a strongly convex and smooth objective function, FedAvg can converge to the optimal after applying learning rate decay, with a convergence rate of , where is the total number of local updates on a single device. These works give rigorous proofs for convergence analysis, which guides follow-up optimization on FL. However, carefully tuning these hyperparameters (e.g., number of local iterations, learning rate, and decay rate) requires multiple attempts and incurs high time costs.

Client-Side Adaption. This type of approach emphasizes that FL clients should adaptively retain global knowledge while improving local knowledge. Some examples are given to integrate them. Yao et al. point out that the global model contains more global knowledge and should be kept as a reference, rather than simply thrown away. Based on this idea, they adopt a two-stream model to transfer the global knowledge to the local model[26]. By minimizing the maximum mean discrepancy (MMD) loss, the two-stream model can extract more generalized features and learn better local representations. Then, in [27], they use the convolution, vector weighted average, and scalar weighted average operators to fuse the global and local features. Li et al. point out that too many local updates will cause the FL training to diverge, especially under the non-i.i.d. data setting[28]

. Hence, they add a proximal penalty term to local objective loss functions to constrain the local model to be closer to the global model and avoid excessive divergence. Rieger

et al. point out that clients express representations in different patterns and their shared knowledge may be obfuscated after synchronization[30]. Hence, they adopt conditional gated activation units to enable clients to condition their units. In this way, clients can identify whether the global feature is expressed and how to modulate the global pattern. These approaches impose more storage footprint and computation on resource-constrained client-side devices, requiring higher resource allocation and also higher energy consumption.

Server-Side Adaption. This type of approach explores how local models can be adaptively aggregated and how the global model can be adaptively optimized on the server-side. Yeganeh et al. aggregate clients’ local models according to their weights, by capitalizing an adaptive weighting approach based on the inverse distance between the local model parameters and the averaged model parameters[29]

. By using this approach, out-of-distribution models will be weighed down and the global model can have a lower variance. The authors also explored the combination with other metrics, such as the training accuracy and the data size.

However, these variants did not perform well in our experiments, probably because some honest but “out-of-distribution” devices were over-suppressed.

On the other hand, inspired by the ability of momentum accumulation to dampen oscillations[37], Hsu et al. adopt the momentum optimizer on the server-side and observe a significant improvement in FL accuracy[31]. Then, Reddi et al. introduce three advanced adaptive optimizers (i.e., Adagrad[38], Adam[39], and Yogi[40]) to update the server-side global model[32]. These adaptive federated optimizers enable the use of adaptive learning rates for different gradients and achieve great success, but unfortunately, they also require careful tunning of initial learning rates, and we observed drastic accuracy oscillation in the experiments.

Iii System Model

Symbol Explanation
Loss function (e.g., cross-entropy).
Maximum training rounds.
Number of iterations in each round.
, Number of devices (in factory ).
Number of devices to be selected per factory.
Number of devices to be randomly pre-sampled per factory.
Number of devices to be selected by GBP-CS per factory.
Number of factories (also the number of groups).
Number of classification classes.
Local learning rate.
Parameters of the global ML model at -th iteration.
Parameters of the ML model on BS at -th iteration.
Parameters of the local ML model on device in factory at -th iteration.
Local dataset of device in factory .
A mini-batch data of device in factory at -th iteration.
Size of local dataset of device in factory .
, Batch data size (of device in factory ).
Total size of data batches in factory .
Data size vector of label classes of mini-batch data .
Data size matrix of of devices.
Total data size vector of the pre-sampled devices.
Set of all devices in factory .
Set of selected devices in factory at -th iteration.
Local data distribution of device in factory .
Local data distribution of mini-batch data .
Mean data distribution of over selected devices .
Real-world global data distribution.
TABLE I: Summary of Main Symbols.

IoT devices in modern industrial parks can be divided into two types: Fixed devices (e.g., monitoring cameras, temperature and humidity sensors) and mobile devices (e.g., patrol drones and logistics robots). In the industrial park, due to the advantages of improved performance, reduced communication cost, and decentralized scalability, FL plays an important role in many industrial applications. For example, an anomaly detection application based on on-device federated monitors could be applied for IIoT scenarios, where sensing and monitoring devices may locate in harsh environments with high voltage and high radiation, and they may move around the factory, making them impractical to access wired networks

[41]. 5G mobile networks have been considered to be enhanced to support key performance features of industrial applications such as high throughput, low latency, and high scalability[42]. These features enable industrial FL applications to transmit model data of a large number of IIoT devices at a high cycle frequency, with a high data rate and low latency. Therefore, we consider a hierarchical cloud-edge-end network architecture empowered by 5G cellular wireless networks for training industrial FL applications, as shown in Fig. 1.

In this case, a modern industrial park has factories, each factory has smart devices. We consider the devices in the same factory as a group. These devices are equipped with embedded computing chips to perform lightweight processing, for example, training ML models on local data streams. The devices connect to the nearby base station (BS) (also identified as ) through 5G cellular wireless networks. Then, BSs communicate FL model data with the cloud center through the Internet. Specifically, the device of factory collects streaming sensory data in real-time and changes the local dataset , whose class distribution is defined as . The class distributions of different devices can be highly heterogeneous due to diverse local usage patterns. Thus, we have , where is the real-world global class distribution. The goal of industrial FL is to find the optimal model parameters that can minimize the global loss function,


The traditional workflow is briefly described below. In each round, a small subset of devices is randomly selected to participate in FL training. These devices utilize local datasets for several epochs to train their local ML models and upload these local models to connected BSs. BSs aggregate these local models and upload the aggregated models to the top server in the cloud. The top server globally aggregates BSs’ models, updates the global ML model, and synchronizes the updated model back to all BSs and end devices. These steps are repeated until the global model parameters

converge. However, this approach causes performance degradation to industrial FL due to data heterogeneity among devices, and it ignores the streaming nature of industrial data. Hence, in Section IV, a data heterogeneity-robust federated group synchronization approach is presented to address this issue.

We summarize the main symbols used in this paper in Table I. The framework and workflow of the proposed FedGS are illustrated in Fig. 2.

Fig. 2: Overview of FedGS framework and workflow. 1⃝ Each BS selects devices to form a super node. 2⃝ Each selected device trains its local model for one mini-batch SGD step. 3⃝ Each BS synchronizes local models in its group. 4⃝ Top server synchronizes models of BSs. 1⃝2⃝3⃝ loop times every 4⃝.

Iv FedGS: Framework and Workflow

The core idea is to strategically select a small subset of devices in each group to form FL super nodes with homogeneous data distributions. Then, these super nodes can be regarded as homogeneous clients to participate in FL. To resolve the heterogeneity in local datasets of devices inside each super node, the one-step synchronization protocol (e.g., SSGD) can be useful because it was proved to be equivalent to centralized SGD, which gives it robustness against data heterogeneity. Meanwhile, the multi-step synchronization protocol (e.g., FedAvg) can be used to keep FedGS communication efficient, since the class distributions of super nodes are aligned. In this way, the problem of data heterogeneity is decomposed from the entire population of devices to a small number of devices in multiple groups, making it efficiently and effectively solved by the compound-step synchronization protocol. In this way, the performance degradation of industrial FL is addressed.

The detailed design is given in Alg. 1. In the initialization stage, the top server first initializes the global ML model parameters and synchronizes to BSs. Then, it collects local class distributions

from all devices to estimate the real-world global class distribution



where is the local data size of device in group , and is a probability normalization function.

Client Selection. In each iteration , each BS selects devices from its group to obtain a homogeneous super node via the Select-Clients-Via-GBP-CS interface. The detailed algorithm GBP-CS is presented in Section V.

Local Training. In each group , each selected device fetches a mini-batch of data from local dataset with batch size . These mini-batch streaming data are one-shot and will not be used again. Then, the device downloads the model from the connected BS and trains for one mini-batch gradient descent step with learning rate ,


Internal Synchronization. The locally trained model will be uploaded to BS for internal synchronization,


where is the total data size of all used mini-batches in . Then, the BS updates its model with and synchronizes in its group.

We call the above client selection, local training, and internal synchronization as a one-step synchronization iteration because the local update on each device is only performed once before each synchronization. The one-step synchronization will loop iterations before each round of external synchronization.

External Synchronization. For every time that the one-step synchronization is performed iterations, BSs can upload their model to the top server for global aggregation,


The globally aggregated model will be used to update the global model on the top server and synchronized to BSs.

Since the external synchronization is performed every one-step synchronization iterations, we call it a multi-step synchronization round. The above steps will be repeated rounds (i.e., iterations) to obtain the converged model parameters .

The above workflow can be seen as an equivalent version of FedAvg, which performs local updates on FL super nodes for iterations with larger batch sizes, but homogeneous local datasets among super nodes. By capitalizing on an effective client selection strategy to make these super nodes homogeneous, the FL training process can be robust against data heterogeneity, and FL model performance can be improved. In the following section, we give our solution GBP-CS.

1:Number of iterations in each round ; Maximum training rounds ; Number of groups ; Number of selected devices per group .
2:Well-trained global FL model .
3:Initialize and and estimate by Eq. (2);
4:for each internal synchronization  do
5:     for each BS in in parallel do
6:         Client Selection: Select devices from group to form a homogeneous super node : Select-Clients-Via-GBP-CS();
7:         for each device in in parallel do
8:              Local Training: Fetch a mini-batch of data and update the local model by Eq. (3);
9:         end for
10:         Internal Synchronization: BS aggregates local models by Eq. (4) and update ;
11:         if  then
12:              External Synchronization: The top server globally aggregates by Eq. (5), and synchronizes the updated global model to BSs: ;
13:         end if
14:     end for
15:end for
16:return ;
Algorithm 1 Federated-Group-Synchronization (Main)

V Client Selection Via GBP-CS

In this section, we formulate the client selection problem as a 0-1 integer programming problem with vector weight constraints, and present our novel Gradient-based Binary Permutation approach, namely GBP-CS, to solve this problem in an acceptable short time, with a desirable solution.

V-a Problem Modeling

Given a factory (i.e., group) with industrial devices (). The next batch data of device follows the distribution () and the data size vector of label classes is . Our goal is to select devices from group at iteration to form a FL super node whose overall class distribution satisfies,


Note that if is fixed, Eq. (6) will always find a fixed device set and other devices will have no chance to be selected. To keep the randomness of client selection so that each device has the same probability to be selected, we use a trick to randomly pre-sample devices before strategically selecting the remaining devices. Formally speaking, in group , we first sample devices at random to obtain , whose next data batches have a total data size of vector . Then, devices are further selected from the remaining devices , whose data size matrix is , with the goal to minimize Eq. (6).

We use the following mathematical model to describe the above problem. Let and , the objective is to find a solution , where and , that


Let batch sizes of all data batches be the same , then we have and a simplified model,


Note that in the mathematical model above, we considered data size and data quality in the objective (i.e., minimizing the distribution divergence) for client selection, but assumed that IIoT devices have similar hardware capabilities. However, if system heterogeneity should be considered, ESync[43] is compatible and can be useful. Then, we give a simple proof to show that the above problem is NP-complete.

Lemma 1 (Problem A)

Given an integer matrix and an integer vector , the goal is to find whether there is a 0-1 vector such that . This 0-1 integer programming problem is NP-complete[44].

Proposition 1 (Problem B)

Let Lemma 1 hold and constrain the number of 1 in to be (Eq. (13)). This variant problem is at least NP-complete.

Proof 1

To solve problem A, we can solve problem B for times with , where is the size of vector . This input transformation has a linear complexity . Problem B outputs YES if Eq. (10) could reach 0, otherwise, it outputs NO. This output transformation has a constant complexity . Hence, problem A can reduce to problem B in a polynomial complexity, which makes problem B also NP-complete.

We can see from Proposition 1 that it is almost impossible to find the optimal solution in a polynomial complexity. To make FedGS time efficient, a sub-optimal but fast solution is preferred, as described in the following subsection.

V-B Gradient-based Binary Permutation Client Selection

To make the NP-complete client selection problem solvable, in this paper, we propose a novel gradient-based approximate approach, namely GBP-CS. The core idea is to permute (0,1) pairs of binary selection variables in with the steepest opposite gradients. In other words, the variable with the selection value of 0 and the smallest gradient will be permuted with the variable with the selection value of 1 and the largest gradient. In this way, the number of variables whose selection value equals one (i.e., the vector weight constraint) is maintained, and constraints (12) and (13) can be satisfied. This binary permutation operation will be performed iteratively to minimize the objective Eq. (10).

1:Number of selected devices per group ; Device set of group ; Global class distribution .
2: selected devices .
3:Pre-sample devices at random, and construct from and from ;
4:Initialize , by Eq. (11), and by Eq. (14);
5:Calculate the distance ;
7:     Calculate the gradient ;
8:     Select an index pair by Eq. (15)-(16);
9:     Make a copy and permute and by Eq. (17);
10:     Update the distance ;
11:     ;
12:until ;
13:Construct the set of selected devices , where is defined as Eq. (18);
14:return ;
Algorithm 2 Select-Clients-Via-GBP-CS

The pseudo code of GBP-CS is given in Alg. 2. Given the data size matrix and , GBP-CS first initializes the solution variable as follows,


where is the Moore-Penrose Inverse solution, means to set the largest values of to 1, and the others to 0. Then, GBP-CS calculates the objective distance and the gradient .

The gradient indicates the opposite direction in which should be updated. The greater the absolute value of , the smaller the can be obtained by updating , so appears as a key gradient[45]. Based on this idea, GBP-CS selects a pair of selection variables () with opposite key gradients for permutation. More specifically, is the identity of the device with the selection value of 0 and the smallest gradient,


Similarly, is the identity of the device with the selection value of 1 and the largest gradient,


Then, GBP-CS permutes the values of and to obtain a new solution ,


Eqs. (15)-(17) will be repeated until the objective distance no longer decreases. Finally, we can construct


and obtain the set of selected devices .

GBP-CS has a complexity of , where and is the number of GBP-CS iterations. In our experiment, GBP-CS can obtain a very desirable solution close to the optimal and has a high execution efficiency comparable to the random sampling approach.

Vi Performance Analysis

In this section, we analyze the optimality gap and convergence rate of FedGS, and qualitatively compare them with those of FedAvg in the presence of non-i.i.d. data. Then, we give the condition under which FedGS is time efficient.

Vi-a Convergence Analysis

Assumption 1

The local function are -strongly convex, -smooth, and -Lipschitz.

Proposition 2

For the gradient on FL node in the federated setting and the gradient in the centralized setting, we have an upper bound of the gradient divergence proportional to the distribution divergence .

Proof 2

It is easy to know that captures the impact of divergence in class distributions . Generally speaking, the smaller the distribution divergence, the smaller the upper bound of the gradient divergence .

Proposition 3

Let , the convergence upper bound of FedGS is , and its optimality gap is bounded by .

Proof 3

As mentioned above, the convergence performance of FedGS is theoretically equivalent to that of FedAvg, in which FL super nodes run mini-batch SGD with batch size for local iterations in each round. Then, the convergence upper bound of FedGS after iterations can be inferred from Lemma 2 in [34],

where . When , the optimality gap is bounded by

Since GBP-CS forces FL super nodes to have aligned class distributions, FedGS has an upper bound of the gradient divergence smaller than FedAvg . Therefore, it is easy to infer from Proposition 3 that, FedGS has both the convergence upper bound and the optimality gap smaller than those of FedAvg, thus it can improve the FL convergence speed and accuracy performance.

Vi-B Time-Efficiency Condition

We analyze the time cost of FedGS in each round ( one-step synchronization iterations) and that of FedAvg in each round ( local iterations), and give the condition under which FedGS can achieve higher time efficiency than FedAvg.

Time Cost of FedGS. The time cost of FedGS  is determined by communication, computation and client selection. The communication delays are brought by internal synchronizations and external synchronizations . For the internal synchronization, the delay of uploading local models of size from devices to their BS is , and the delay of synchronizing the model of size from the BS to devices is , where and are the uplink and downlink bandwidths between devices and BS, and

are the received signal-to-noise ratio (SNR) of BS and devices. For the external synchronization, the delay of uploading models of size

from BSs to the top server is , and the delay of synchronizing the global model of size from the top server to BSs is , where and are the uplink and downlink bandwidths between BSs and the top server, is the SNR of the top server. Hence, we have the communication time cost for each internal synchronization and each external synchronization as follows,


Let the delay of each local update be and the delay of client selection procedure be . Each round the internal synchronization is performed times, the total delay is


Time Cost of FedAvg. The time cost of FedAvg is mainly determined by communication and computation because the client selection procedure is simple random sampling so that the selection delay is negligible. The communication delays come from the uplink and downlink model transmission between the top server and devices. The delay of uploading local models of size from devices to the top server is , and the delay of synchronizing the global model of size from the top server to devices is . Hence, we have the communication time cost for each round of synchronization as follows,


Then, the total time cost when FedAvg performs local updates and one round of synchronization is


In order to simplify the analysis result, we make the following assumptions.

Assumption 2

(a) The uplink and downlink bandwidths are equal: , ; (b) The SNRs of the top server, BSs, and devices are equal: .

Then, we can give the following condition for hyperparameter setting, under which FedGS can achieve higher time efficiency than FedAvg.

Proposition 4

Let Assumption 2 hold and , the time cost per iterations in each round satisfies if , where

Proof 4

Eq. (24) can be obtained by combining Eqs. (19)-(21), and Eq. (25) can be obtained by combining Eqs. (22)-(23). Let , we have

In our experiment, GBP-CS is quite fast, whose time cost (15 milliseconds) is negligible compared to other delays. Therefore, we assume to simplify the result and obtain

In modern industrial applications, 5G enables indoor industrial use cases that were impossible before, supported by high data rate, ultra-low delay, and extreme density of wireless communications[46]. In reality, the data rate of 5G edge is about 10-100 times of that in WAN, that is, . Therefore, we can easily set to satisfy the condition in Proposition 4, so that the time efficiency of FedGS can be guaranteed.

Fig. 3: Distribution divergence optimization curves of GBP-CS with different initializers.
(a) Distribution Divergence
(b) Execution Time
(c) Optimization Curve
Fig. 4: Comparison of (a) distribution divergence, (b) execution time, and (c) optimization curve among different samplers.

Vii Experimental Evaluation

Vii-a Experiment Setup

Environment and Hyperparameter Setup. In the experiment, we consider an IIoT application where OCR technology is used in the identification of packing boxes, machines, robots, vehicles, and workers, through recognizing the optical characters on their badges. To this end, we aim to train a high-accuracy OCR model in the federated setting, where sensors’ local character images are confidential and skewed in the class distribution. The real-world FEMNIST[25] dataset is chosen to train our federated OCR model, as it is built by partitioning 805,263 optical digit and character images into 3,550 devices, following non-i.i.d.-like class distributions and uneven data sizes. Our experiment platform contains OCR cameras and factories, each factory has OCR cameras (hereinafter referred to as devices). In each iteration,

devices are selected from each factory to participate in the federated OCR training. A four-layer convolutional neural network [Conv2D(32), MaxPool, Conv2D(64), MaxPool, Dense(2048), Dense(62)] is used as the training model because it is lightweight and suitable for resource-constrained industrial devices. Unless otherwise specified, we use the standard mini-batch SGD to train local ML models, with the learning rate

, the batch size , the number of iterations per round and the maximum number of rounds .

GBP-CS Initialization. The choice of the initial point in GBP-CS is critical to the quality of the solution, because a bad initial point may cause GBP-CS to fall into a local minimum. In the experiment, devices are pre-sampled at random, and the other devices are selected using GBP-CS, with the following three initialization methods.

  1. Random Initialization. Set values in to 1 at random and leave other values at 0.

  2. Zero Initialization. All values in are first initialized to 0. Then, a warm-up step is performed to meet the vector weight constraint Eq. (13), in which one value with the smallest gradient is set to 1 iteratively until the number of value 1 in reaches . The warm-up step requires additional iterations.

  3. Moore-Penrose Inverse Initialization (MPInv). MPInv is first used to solve the least square solution of the unconstrained objective function . Then, elements with the largest values in are set to 1 and others are left at 0 to obtain the initial point .

Comparison Algorithms. To highlight the efficiency and effectiveness of the proposed GBP-CS, we consider the following five benchmark client selection methods for comparison.

  1. Random Sampler (Random): From each group, devices are uniformly and randomly sampled.

  2. Monte Carlo Sampler (MC): Repeat the random sampler 1000 times and the solution minimizes Eq. (10) is used.

  3. Brute Sampler (Brute): Brutely search for the optimal devices by traversing all feasible solutions to meet Eqs. (10)-(13).

  4. Bayesian Sampler (Bayesian): Search for a sub-optimal devices using Bayesian optimization[47] to meet Eqs. (10)-(13). By default, we set the number of initial points to 5 and exploration iterations to 25.

  5. Genetic Sampler (GA): Search for a sub-optimal

    devices using genetic algorithm

    [48] to meet Eqs. (10)-(13), in which the constrained 0-1 vector solutions are regarded as genes and suffer from selection, crossover, mutation and elimination. By default, we set the population size to 100, the mutation probability to 0.001, and the number of generations to 100.

Except for the baseline FedAvg[24], other nine advanced approaches are also experimentally compared with FedGS in the presence of non-i.i.d. data. They are FedMMD[26], FedFusion[27], FedProx[28], IDA[29], CGAU[30], FedAvgM[31], and FedAdagrad, FedAdam, FedYogi from [32].

Implementation. We implement FedGS on a standard FL simulator Leaf-MX111Leaf-MX: (an MXNET[49] implementation of LEAF[25]). The code implementation is open-available on Github:

Vii-B Results and Discussion

Comparison of initialization methods in GBP-CS. The optimization curves of the class distribution divergence of Zero, Random, and MPInv initializers are shown in Fig. 3. Both Zero and MPInv initializers successfully find high-quality solutions (0.029 and 0.030, respectively) close to the optimal of the brute force search (0.028). Instead, the Random initializer falls into a poor local optimal (0.044). Furthermore, the MPInv initializer is much faster because it does not require an additional warm-up procedure like the Zero initializer. Therefore, GBP-CS is default initialized with MPInv initializer.

(a) Effect of and
(b) Effect of and
Fig. 5: Accuracy surface of FedGS over different (a) batch size and iterations per round ; (b) number of groups and number of selected devices per group .

Comparison among GBP-CS and other samplers. Since the procedure of GBP-CS client selection is performed every iteration, both the quality and time cost of the solution are critical to FedGS performance.

We first compare the distribution divergence (defined as Eq. (6)) among GBP-CS and other five benchmark samplers. Generally speaking, the smaller the gap between the class distribution of the group and the global distribution , the smaller the distribution divergence and the better the sampler. The distribution divergence of factories is shown in Fig. (a)a. As expected, the most commonly used random sampler in FedAvg leads to a high divergence in class distribution () and causes non-i.i.d. data among groups, while the brute force sampler can always minimize the divergence (). The random sampler and the brute force sampler give the upper and lower bounds of the distribution divergence, and solutions of other samplers should locate in this interval, among which GA and GBP-CS samplers perform best ( and , respectively).

In terms of execution time, FedGS prefers samplers with a very short execution time, because high-frequency client selection may introduce a non-negligible latency to FedGS and significantly slow down FL training. Fig. (b)b compares the execution time of the above samplers. Let us focus on the brute force, GA, and GBP-CS samplers, because their solutions are of the best quality. The brute force sampler requires 979 seconds to find the optimal solution, whose latency is too long to be acceptable. Therefore, FedGS prefers a sub-optimal solution in an acceptable short time. GA and GBP-CS samplers seem to be good choices, and the proposed GBP-CS sampler is 66 faster than the GA sampler, with a negligible 15 milliseconds and a loss of distribution divergence by only 0.001.

To highlight GBP-CS more intuitively, we draw the optimization curve of distribution divergence over execution time in Fig. (c)c. The results show that the proposed GBP-CS sampler converges to a high-quality solution 0.029 closest to the optimal 0.028 in the shortest time, demonstrating the superior effectiveness and efficiency of GBP-CS.

Effects of hyperparameters in FedGS. Hyperparameters may have great effects on FedGS. To explore these effects, we perform a grid search on experimental hyperparameters, including the batch size , the number of iterations per round , the number of devices selected per group , as well as the environmental hyperparameter, the number of groups . Fig. (a)a visualizes the test accuracy over different and settings, where is chosen from and is chosen from . The results show that a moderately large can improve the accuracy of FedGS, while the batch size has little effect. Fig. (b)b visualizes the test accuracy over different and settings, where is chosen from and is chosen from . Without loss of generality, both more groups and more selected devices can bring gains in FL model accuracy because more devices’ data is included. In this paper, , and are used by default to meet the condition in Proposition 4. Please note that is determined by the real-world environment instead of an adjustable hyperparameter.

Test Accuracy Test Loss Rounds To 82%
FedAvg (Baseline) 82.1% 0.587 478
FedProx 82.0% 0.586 497
IDA 81.0% 0.628
IDA+INTRAC 81.0% 0.618
IDA+FedAvg 80.5% 0.687
CGAU 83.3% 0.509 202
FedMMD 83.0% 0.564 378
FedFusion+Conv 81.7% 0.624
FedFusion+Multi 82.0% 0.591 486
FedFusion+Single 80.7% 0.627
FedAvgM 84.4% 0.820 68
FedAdagrad 83.8% 0.583 264
FedAdam 85.0% 0.662 71
FedYogi 84.6% 0.590 76
FedGS 86.0% 0.435 147
TABLE II: Test accuracy, test loss and convergence speed of FedGS vs ten federated approaches.

Comparison among FedGS and other federated approaches. We take ten advanced federated approaches for comparison to show the state-of-the-art performance of the proposed FedGS in the presence of non-i.i.d. data. The test accuracy, test loss, and training rounds required to reach the accuracy of 82% are listed in Table II, and detailed training curves are given in Fig. 6. Unless otherwise specified, all the comparison approaches use the local epoch by default. In the following, we will compare these approaches and analyze their results, respectively.

(a) FedGS vs FedProx
(b) FedGS vs IDA
(c) FedGS vs CGAU
(d) FedGS vs FedProx
(e) FedGS vs IDA
(f) FedGS vs CGAU
(g) FedGS vs FedMMD
(h) FedGS vs FedFusion
(i) FedGS vs FedOpt
(j) FedGS vs FedMMD
(k) FedGS vs FedFusion
(l) FedGS vs FedOpt
Fig. 6: Comparison of FedGS and FedAvg, FedProx, IDA, CGAU, FedMMD, FedFusion, FedAvgM, FedAdagrad, FedAdam, FedYogi.

FedGS vs FedProx. FedProx adds a proximal term to local loss functions to penalize divergent local models. We tune the penalty constant to find the best result in Figs. (a)a and (d)d. However, FedProx performs poorly in our case, with the accuracy of 82.0%, not even exceeding the baseline accuracy of 82.1% of FedAvg. The reason may be that the proximal penalty term will slow convergence by forcing local models closer to the starting point[28]. Instead, the proposed FedGS improves the baseline accuracy by 3.9% and achieves the accuracy of 86.0%.

FedGS vs IDA. IDA weighs model parameters of devices based on their inverse distance to the averaged model parameter during aggregation. We combine IDA with inverse training accuracy coefficients (IDA+INTRAC) and normalized data size coefficients (IDA+FedAvg) as suggested by the authors. However, Figs. (b)b and (e)e shows that IDA-series approaches suffer an accuracy degradation (). That is because devices with large parameter deviations are over-suppressed, causing the global model to lose data knowledge on these devices. Besides, IDA should cache model parameters uploaded by all devices until the average model parameter and inverse distance coefficients are calculated, which takes up huge memory space on the server.

(a) FedAvgM
(b) FedAdagrad
(c) FedAdam
(d) FedYogi
Fig. 7: Accuracy Heatmap of (a) FedAvgM, (b) FedAdagrad, (c) FedAdam and (d) FedYogi.

FedGS vs CGAU.

CGAU uses gated activation units on top of a pre-trained model to enable client-specific expression of heterogeneous data. We train a 1-layer and a 2-layer CGAU classifier with 256 units, respectively (namely FineTunning+1

CGAU and FineTunning+2CGAU). The dropout layer is not used as the authors did because we observed a 3.3% drop in accuracy after using them. Figs. (c)c and (f)f show that FineTunning+1CGAU achieves a higher accuracy of 83.3%, which improves the baseline accuracy by 1.2% and benefits from the fast convergence speed of the pre-trained model. Despite these gains, the proposed FedGS can still achieve 2.7% higher accuracy, lower test loss, and faster convergence.

FedGS vs FedMMD.

FedMMD uses transfer learning

[50] to better merge the knowledge of the global model into the local model. As suggested by the authors, we use the MMD distance and the penalty coefficient . As shown in Figs. (g)g and (j)j, FedMMD improves the baseline accuracy by 0.9% and achieves the accuracy of 83.0%, but the proposed FedGS further improves that by another 3%.

FedGS vs FedFusion. FedFusion fuses the global and local features using operators such as convolution (FedFusion+Conv), vector weighted average (FedFusion+Multi) and scalar weighted average (FedFusion+Single). However, the results in Figs. (h)h and (k)k show that FedFusion+Multi and FedFusion+Conv only achieve the accuracy similar to the baseline (82.0% and 81.7%, respectively), and FedFusion+Single even decreases the accuracy by 1.4%. Instead, the proposed FedGS is obviously better and faster.

FedGS vs FedOpt. FedOpt is a general paradigm for a series of adaptive federated optimizers, which dynamically adjusts learning rates of all gradients to accelerate convergence, including FedAvgM, FedAdagrad, FedAdam, and FedYogi. Preliminary experiments show that the convergence performance of these approaches has indeed significantly improved, but they are particularly sensitive to initial learning rates. As the authors did, we search for the best setting of the client-side and server-side initial learning rates in Fig. 7 and give the best results in Figs. (i)i and (l)l. Other hyperparameters follow the authors’ setting, for example, for FedAvgM, for FedAdagrad, for FedAdam and FedYogi, and . The results show that these approaches can improve the baseline accuracy by with fast convergence speed, especially FedAdam. However, the accuracy of the proposed FedGS is still 1% higher.

To sum up, FedAvgM, FedAdagrad, FedAdam, and FedYogi are generally better than other comparison approaches (some of which cannot even reach the accuracy of 82%, marked with “” in Table II). Instead, FedGS achieves the state-of-the-art accuracy of 86.0%, which is 3.9% higher than the baseline and 3.5% higher than the averaged accuracy. In addition, FedGS can also reach the accuracy of 82% in only 147 rounds, which is 3.3 faster than FedAvg and reduces training rounds by 59% on average. These comprehensive experiments prove the effectiveness and efficiency of FedGS.

Viii Conclusion

FL in IIoT is emerging as a field of great value with increasing interest from both academia and industry, however, it still faces the challenge of non-i.i.d. data, which currently remains open. In this paper, we propose FedGS, which is a hierarchical cloud-edge-end FL framework for 5G empowered modern industries. To minimize the divergence in data distributions among factories, we propose a constrained gradient-based optimizer, namely GBP-CS, to select a subset of devices in each factory to construct homogeneous FL super nodes. GBP-CS can find a desirable selection strategy in a very short time, and can also be used for other practical cases such as game matching. Then, to eliminate the impact of residual non-i.i.d. data within the super nodes, we use a compound-step synchronization protocol to coordinate the training process. This protocol uses the data heterogeneity-insensitive one-step synchronization protocol within the super nodes to suppress the negative impact of data heterogeneity, then uses the multi-step synchronization protocol among the super nodes to reduce communication frequency. The proposed approach takes into account the natural geographical clustering property of factory devices and can adapt to rapidly changing streaming data at runtime, without exposing confidential data in high-risk data manipulation. Theoretical analysis shows that FedGS has both the convergence rate and optimality gap better than the benchmark FedAvg, and can be more time-efficient under a relaxed hyperparameter condition. Extensive experiments compared to ten advanced approaches demonstrate the state-of-the-art performance of FedGS on non-i.i.d. data.


  • [1] Chen, Fei, Bo Li, Rong Dong, et al.

    : “High-performance OCR on packing boxes in industry based on deep learning.” In:

    Pacific Rim International Conference on Artificial Intelligence

    (PRICAI), pp. 1018-1030. Springer, Cham, 2018.
  • [2] Liukkonen, Mika, and Tsung-Nan Tsai: “Toward decentralized intelligence in manufacturing: Recent trends in automatic identification of things.” International Journal of Advanced Manufacturing Technology (JAMT) 87, no. 9, pp. 2509-2531, 2016.
  • [3] Kairouz, Peter, H. Brendan McMahan, Brendan Avent, et al.: “Advances and open problems in federated learning.” Foundations and Trends in Machine Learning 14, no. 1-2, pp. 1-210, 2021.
  • [4] Hiessl, Thomas, Daniel Schall, Jana Kemnitz, et al.: “Industrial federated learning: Requirements and system design.” In: International Conference on Practical Applications of Agents and Multi-Agent Systems (PAAMS), pp. 42-53, Springer, Cham, 2020.
  • [5] Lim, Wei Yang Bryan, Nguyen Cong Luong, Dinh Thai Hoang, et al.: “Federated learning in mobile edge networks: A comprehensive survey.” IEEE Communications Surveys & Tutorials (COMST) 22, no. 3, pp. 2031-2063, 2020.
  • [6] Pham, Quoc-Viet, Kapal Dev, Praveen Kumar Reddy Maddikunta, et al.: “Fusion of federated learning and industrial internet of things: A survey.” arXiv preprint arXiv:2101.00798, 2021.
  • [7] Zhang, Weishan, Qinghua Lu, Qiuyu Yu, et al.: ”Blockchain-based federated learning for device failure detection in industrial IoT.” IEEE Internet of Things Journal (IOTJ) 8, no. 7, pp. 5926-5937, 2020.
  • [8] Luo, Jiahuan, Xueyang Wu, Yun Luo, et al.: “Real-world image datasets for federated learning.” arXiv preprint arXiv:1910.11089, 2019.
  • [9] Zhao, Yue, Meng Li, Liangzhen Lai, et al.: “Federated learning with non-iid data.” arXiv preprint arXiv:1806.00582, 2018.
  • [10] Yao, Xin, Tianchi Huang, Rui-Xiao Zhang, et al.: “Federated learning with unbiased gradient aggregation and controllable meta updating.” In: Workshop on Federated Learning for Data Privacy and Confidentiality (FL-NeurIPS 2019, in Conjunction with NeurIPS 2019), 2019.
  • [11] Yoshida, Naoya, Takayuki Nishio, Masahiro Morikura, et al.: “Hybrid-fl for wireless networks: Cooperative learning mechanism using non-iid data.” In: ICC 2020-2020 IEEE International Conference on Communications (ICC), pp. 1-7, IEEE, 2020.
  • [12] Zhang, Wenyu, Xiumin Wang, Pan Zhou, et al.: “Client selection for federated learning with non-iid data in mobile edge computing.” IEEE Access 9, pp. 24462-24474, 2021.
  • [13] Zhao, Zhongyuan, Chenyuan Feng, Wei Hong, et al.: “Federated learning with non-iid data in wireless networks.” IEEE Transactions on Wireless Communications (TWC), 2021.
  • [14] Duan, Moming, Duo Liu, Xianzhang Chen, et al.: “Astraea: Self-balancing federated learning for improving classification accuracy of mobile deep learning applications.” In: IEEE 37th International Conference on Computer Design (ICCD), pp. 246-254, 2019.
  • [15] Wang, Han, Luis Muñoz-González, David Eklund, et al.: “Non-iid data re-balancing at IoT edge with peer-to-peer federated learning for anomaly detection.” In: Proceedings of the 14th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec), pp. 153-163, 2021.
  • [16] Wen, Hui, Yue Wu, Chenming Yang, et al.: “A unified federated learning framework for wireless communications: Towards privacy, efficiency, and security.” In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops, pp. 653-658, IEEE, 2020.
  • [17] Sattler, Felix, Klaus-Robert Müller, and Wojciech Samek: “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints.”

    IEEE Transactions on Neural Networks and Learning Systems

    (TNNLS), 2020.
  • [18] Wang, Hao, Zakhary Kaplan, Di Niu, et al.: “Optimizing federated learning on non-iid data with reinforcement learning.” In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pp. 1698-1707, IEEE, 2020.
  • [19] Zeng, Shenglai, Zonghang Li, Hongfang Yu, et al.: “Heterogeneous federated learning via grouped sequential-to-parallel training.” In: 27th International Conference on Database Systems for Advanced Applications (DASFAA), 2022.
  • [20] Nishio, Takayuki, and Ryo Yonetani: “Client selection for federated learning with heterogeneous resources in mobile edge.” In: ICC 2019-2019 IEEE International Conference on Communications (ICC), pp. 1-7, IEEE, 2019.
  • [21] Pang, Junjie, Yan Huang, Zhenzhen Xie, et al.: “Realizing the heterogeneity: A self-organized federated learning framework for IoT.” IEEE Internet of Things Journal (IOTJ) 8, no. 5, pp. 3088-3098, 2020.
  • [22] Hiessl, Thomas: “Cohort-based federated learning services for industrial collaboration on the edge.” TechRxiv, Preprint, 2021.
  • [23] Zinkevich, Martin, Markus Weimer, Alexander J. Smola, et al.

    : “Parallelized stochastic gradient descent.” In:

    24th Conference on Neural Information Processing Systems (NeurIPS) 4, no. 1, p. 4, 2010.
  • [24] McMahan, Brendan, Eider Moore, Daniel Ramage, et al.: “Communication-efficient learning of deep networks from decentralized data.” In: 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1273-1282, 2017.
  • [25] Caldas, Sebastian, Sai Meher Karthik Duddu, Peter Wu, et al.: “Leaf: A benchmark for federated settings.” In: 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019.
  • [26] Yao, Xin, Chaofeng Huang, and Lifeng Sun: “Two-stream federated learning: Reduce the communication costs.” In: 2018 IEEE Visual Communications and Image Processing (VCIP), pp. 1-4, 2018.
  • [27] Yao, Xin, Tianchi Huang, Chenglei Wu, et al.: “Towards faster and better federated learning: A feature fusion approach.” In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 175-179, 2019.
  • [28] Li, Tian, Anit Kumar Sahu, Manzil Zaheer, et al.: “Federated optimization in heterogeneous networks.” In: Conference on Machine Learning and Systems (MLSys), 2018.
  • [29] Yeganeh, Yousef, Azade Farshad, Nassir Navab, et al.: “Inverse distance aggregation for federated learning with non-iid data.” In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pp. 150-159, Springer, Cham, 2020.
  • [30] Rieger, Laura, Rasmus M. Th Høegh, and Lars K. Hansen: “Client adaptation improves federated learning with simulated non-iid clients.” In: International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML, 2020.
  • [31] Hsu, Tzu-Ming Harry, Hang Qi, and Matthew Brown: “Measuring the effects of non-identical data distribution for federated visual classification.” In: International Workshop on Federated Learning for Data Privacy and Confidentiality in Conjunction with NeurIPS, 2019.
  • [32] Reddi, Sashank, Zachary Charles, Manzil Zaheer, et al.: “Adaptive federated optimization.” In: International Conference on Learning Representations (ICLR), 2021.
  • [33] Jeong, Eunjeong, Seungeun Oh, Hyesung Kim, et al.: “Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data.” In: Workshop on Machine Learning on the Phone and other Consumer Devices, Montréal, Canada, 2018.
  • [34] Wang, Shiqiang, Tiffany Tuor, Theodoros Salonidis, et al.: “Adaptive federated learning in resource constrained edge computing systems.” IEEE Journal on Selected Areas in Communications (JSAC) 37, no. 6, pp. 1205-1221, 2019.
  • [35] Yu, Hao, Sen Yang, and Shenghuo Zhu: “Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning.” In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 33, no. 1, pp. 5693-5700, 2019.
  • [36] Li, Xiang, Kaixuan Huang, Wenhao Yang, et al.: “On the convergence of fedavg on non-iid data.” In: International Conference on Learning Representations (ICLR), 2020.
  • [37] Nesterov, Yu: “Gradient methods for minimizing composite functions.” Mathematical Programming 140, no. 1, pp. 125-161, 2013.
  • [38] Ward, Rachel, Xiaoxia Wu, and Leon Bottou: “AdaGrad stepsizes: Sharp convergence over nonconvex landscapes.” In: International Conference on Machine Learning (ICML), pp. 6677-6686, 2019.
  • [39] Kingma, Diederik P., and Jimmy Ba: “Adam: A method for stochastic optimization.” In: International Conference on Learning Representations (ICLR), 2015.
  • [40] Zaheer, Manzil, Sashank Reddi, Devendra Sachan, et al.: “Adaptive methods for nonconvex optimization.” In: 32nd Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018.
  • [41] Liu, Yi, Sahil Garg, Jiangtian Nie, et al.: “Deep anomaly detection for time-series data in industrial IoT: A communication-efficient on-device federated learning approach.” IEEE Internet of Things Journal (IOTJ), 8, no. 8, pp. 6348-6358, 2020.
  • [42] Akpakwu, Godfrey Anuga, Bruno J. Silva, Gerhard P. Hancke, et al.: “A survey on 5G networks for the internet of things: Communication technologies and challenges.” IEEE Access 6, pp. 3619-3647, 2017.
  • [43] Li, Zonghang, Huaman Zhou, Tianyao Zhou, et al.: ”ESync: Accelerating intra-domain federated learning in heterogeneous data centers.” IEEE Transactions on Services Computing (TSC), (2020).
  • [44] Rice, Bart: “The 0-1 integer programming problem in a finite ring with identity.” Computers and Mathematics with Applications 7, no. 6, pp. 497-502, 1981.
  • [45] Zhou, Huaman, Zonghang Li, Qingqing Cai, et al.: “DGT: A contribution-aware differential gradient transmission mechanism for distributed machine learning.” Future Generation Computer Systems (FGCS) 121, pp. 35-47, 2021.
  • [46] Varga, Pal, Jozsef Peto, Attila Franko, et al.: “5G support for industrial IoT applications: Challenges, solutions, and research gaps.” Sensors 20, no. 3, p. 828, 2020.
  • [47] Nogueira, Fernando, et al.

    : “Bayesian optimization: Open source constrained global optimization tool for python.” [Online]:, 2014.
  • [48] Whitley, Darrell: “A genetic algorithm tutorial.” Statistics and Computing 4, no. 2, pp. 65-85, 1994.
  • [49] Chen, Tianqi, Mu Li, Yutian Li, et al.: “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems.” In: 30th Conference on Neural Information Processing Systems (NeurIPS), 2016.
  • [50] Pan, Sinno Jialin, and Qiang Yang: “A survey on transfer learning.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 22, no. 10, pp. 1345-1359, 2009.