I Introduction
In recent years, the Internet of Things (IoT) has played an increasingly integral role in the industrial community. Taking logistics sorting and automatic object identification as examples. Optical character recognition (OCR) cameras on logistics pipelines detect and read characters on packing boxes in order to sort them[1]. At the same time, the surrounding surveillance cameras are constantly monitoring, automatically identifying objects through optical recognition of the characters on their badges, and confirming whether the machines, robots, vehicles, and workers in the factory are legal entrants[2]. These optical sensors collect a huge amount of industrial data. In order to tap the value of these data, advanced data mining technologies are needed, especially machine learning (ML). However, gathering the industrial big data to the cloud leads to unbearable transmission overhead, and also violates data privacy regulations. Taking the idea of task offloading, federated learning (FL)[3] sinks model training from the cloud to the edge. OCR cameras use local optical data to train local OCR models, then upload their local model updates to the cloud to update the global model. The global model is then synchronized to OCR cameras. These steps are repeated until the global model converges. In this way, FL preserves data confidentiality because the raw data does not leave the devices.
The combination of industrial IoT (IIoT) and FL opens a door for smart industry[4, 5, 6]. FL provides powerful privacypreserving tools for mining decentralized industrial data, and IIoT technologies such as smart sensors and mobile robots provide rich resources (e.g., data, computation) for FL. Despite these benefits, compared with OCR in natural scenes, FL in industries requires higher accuracy to ensure the reliability of industrial operations. However, sensors’ local data distributions can be highly heterogeneous due to differences in times, locations, functions, and so on. Taking the logistics industry of crossborder ecommerce as an example, the Singapore warehouse transports more packing boxes to Singapore than other countries, so the optical characters in the word “Singapore” appear more times than other characters. For this reason, the number of optical images of each character captured by different OCR cameras (in different warehouses) is skewed and inconsistent. Other examples are device failure detection[7] and object detection[8], which also prove the existence of data heterogeneity in realworld IIoT. These skewed distributed data are called nonindependent and identically distributed (noni.i.d.) and can lead to FL performance degradation[9], which becomes more challenging when local data of sensors are constantly changing.
The noni.i.d. data challenge has inspired the research field of heterogeneous FL, especially the field of mobile edge computing (MEC)[5, 10, 12, 13, 11, 14, 15, 16, 17, 18, 19], which currently remains open. These kinds of literature have achieved great success in the context of MEC, but the following characteristics of IIoT make them still limited:

Higher requirements for data security. Industries (e.g., manufacturing, logistics, and transportation) often face even more serious data threats due to owning a vast amount of valuable information, so they possess the most urgent and critical requirement to increase security to protect data. Therefore, any form of disclosure[10, 11, 12, 13] or tampering[14, 15, 16] of the confidential raw data is not allowed.

Rapidly changing streaming data on dataintensive sensors. Dataintensive IIoT sensors such as OCR cameras require high sampling rates to capture the realtime phenomenon information and produce large amounts of data. In order to save storage space, these data will overwrite the old data that has been processed, forming a data stream similar to a firstinfirstout (FIFO) data queue. In such a dynamic environment, static approaches no longer work for IIoT, for example, [17], and the KCenter clustering algorithm in [18].

Natural geographical clustering property. In modern industrial parks, IIoT devices in each factory are geographically adjacent, which makes them naturally clustered into groups and interconnected by highly reliable networks, for example, through regional 5G base stations (see Fig. 1). However, this valuable property is often ignored, and the rich communication resources at the edge are not fully utilized[12, 19, 20], which limits the improvement of industrial FL.
The above characteristics distinguish “FL in IIoT” from “FL in nonIIoT” (e.g., FL in Edge). Few literature has been proposed to tackle the noni.i.d. data challenge of FL in IIoT, such as approaches based on centroid distance weighted averaging[7]
[21], and kmeansbased cohorts[22]. However, none of them take into account the changing local data distribution or the natural geographic clustering property of devices in IIoT. More importantly, they do not address the fundamental problem causing FL model performance to degrade, namely the divergence in class distributions[9].To address the root cause of noni.i.d. data, this paper aims to propose an effective approach to minimize the divergence in class distributions among heterogeneous devices. Taking advantage of the natural property of geographical clustering, we can select a subset of devices in each factory to construct “FL super nodes” with consistent class distributions. These super nodes can be treated as homogeneous clients participating in FL training, without exposing the confidential industrial FL process in risky data manipulation.
However, designing such an approach is not trivial. Firstly, selecting a subset of devices in each group to minimize the class distribution divergence among groups is a 01 integer programming problem with vector weight constraints, which is proved to be NPcomplete.
More challenging, this procedure needs to be invoked frequently to adapt to rapidly changing local data and the mobility of mobile IIoT devices (e.g., robots, drones), which places high demands on execution latency. Secondly, even if class distributions among FL super nodes are forced to be homogeneous, devices’ local data in each FL super node can still be skewed. If not handled properly, these challenges will still degrade FL model performance.To minimize the data heterogeneity among groups and realize efficient client selection, this paper proposes a novel gradientbased binary permutation optimizer GBPCS to solve the above NPcomplete client selection problem. GBPCS runs a constraintpreserving gradient descent optimization procedure directly in the 01 integer space, and can build homogeneous FL super nodes in a very short time. Then, we propose Federated Group Synchronization (FedGS), which is a hierarchical cloudedgeend FL framework for 5G empowered modern industries, to improve industrial FL performance on noni.i.d. data. FedGS uses a compoundstep synchronization protocol to train ML models, which can suppress data heterogeneity within and among FL super nodes. More specifically, FedGS uses a singlestep synchronization protocol (e.g., SSGD[23]) within super nodes because of its robustness against data heterogeneity, and a multistep synchronization protocol (e.g., FedAvg[24]) among homogeneous super nodes to reduce communication overhead. Theoretical analysis shows that FedGS has both the convergence upper bound and optimality gap better than FedAvg in the presence of noni.i.d. data, and can be more timeefficient under a relaxed condition. Finally, we evaluate FedGS on the most widely adopted noni.i.d. benchmark dataset FEMNIST[25], and compare it with 10 advanced approaches, including FedAvg, FedMMD[26], FedFusion[27], FedProx[28], IDA[29], CGAU[30], FedAvgM[31], and FedAdagrad, FedAdam, FedYogi from [32]. The main contributions of this paper are summarized as follows.

We propose a hierarchical cloudedgeend FL framework FedGS for 5G empowered modern industries, which uses a novel compoundstep synchronization protocol to coordinate the training process within and among groups. The new protocol is robust against data heterogeneity and can effectively improve industrial FL performance.

We propose a novel GBPCS algorithm to select a subset of devices from each group to build homogeneous FL super nodes, which can find a desirable selection strategy in a very short time. GBPCS is a general optimizer for constrained 01 integer programming problems and can be used for other practical cases such as game matching.

We analyze the convergence rate and optimality gap of FedGS and give a relaxed condition under which FedGS is more timeefficient than FedAvg. Theoretical results show that FedGS not only converges closer to the optimal, but also faster.

Extensive experiments compared to 10 advanced approaches show that FedGS improves FL accuracy by 3.5% and reduces training rounds by 59% on average. The results highlight the superior effectiveness and efficiency of FedGS on noni.i.d. data.
Ii Related Work
In this section, we categorize related works into four types according to the techniques they use.
Data Sharing and Augmentation. This type of approach aims to minimize the class distribution divergence among devices by sharing or augmenting FL clients’ local datasets. For the sharingbased approaches, Zhao et al. propose to distribute a small portion of globally shared data (e.g., open available data) to clients’ devices[9]. Yao et al. collect metadata shared by voluntary clients to perform controllable meta updating[10]. Yoshida et al. reward FL clients for contributing local datasets and propose a hybrid learning mechanism wherein the server updates the model using the shared data and clients’ local models[11]. These approaches achieve a considerable improvement in FL accuracy, but they are suspected of leaking private data due to the need to share clients’ local datasets. Besides, openavailable datasets do not always exist, especially in fields where data is highly confidential.
For the augmentationbased approaches, Duan et al. observe that the imbalance among different classes can also degrade FL accuracy[14]. Hence, they augment classes with fewer samples by simply random offset, rotation, cropping, and scaling. Jeong et al.[33]
propose to generate new samples using a globally trained conditional generative adversarial network (CGAN) to build unskewed local datasets.
Similarly, Wang et al.generate synthetic data in the minority class based on linear interpolation to rebalance local datasets on edge devices
[15]. These approaches avoid the leakage of FL clients’ private data. However, they still bring credibility crises. Speculative clients can use the synthetic data generated out of thin air to participate in FL training while hiding their original data. Also, they can pretend that the synthetic data is a large volume of highquality data for more rewards. Therefore, these operations (i.e., data sharing, data augmentation) are high risk and should be prohibited.Hyperparameter Tuning.Hyperparameters play an important role in FL training convergence. Some works have been explored in hyperparameter tunings, such as tunning the number of local iterations and the learning rate. Wang et al. point out that the optimal performance can be achieved when the number of local iterations equals one[34]. However, the constrained resources (e.g., bandwidth, time, power) prevent us from doing this. In practice, large local iterations are more commonly used. For example, Yu et al. carefully set the number of local iterations and obtain a considerable convergence rate[35]. In addition, Li et al. point out that decaying the learning rate is necessary for FL convergence with large local iterations[36]. For a strongly convex and smooth objective function, FedAvg can converge to the optimal after applying learning rate decay, with a convergence rate of , where is the total number of local updates on a single device. These works give rigorous proofs for convergence analysis, which guides followup optimization on FL. However, carefully tuning these hyperparameters (e.g., number of local iterations, learning rate, and decay rate) requires multiple attempts and incurs high time costs.
ClientSide Adaption. This type of approach emphasizes that FL clients should adaptively retain global knowledge while improving local knowledge. Some examples are given to integrate them. Yao et al. point out that the global model contains more global knowledge and should be kept as a reference, rather than simply thrown away. Based on this idea, they adopt a twostream model to transfer the global knowledge to the local model[26]. By minimizing the maximum mean discrepancy (MMD) loss, the twostream model can extract more generalized features and learn better local representations. Then, in [27], they use the convolution, vector weighted average, and scalar weighted average operators to fuse the global and local features. Li et al. point out that too many local updates will cause the FL training to diverge, especially under the noni.i.d. data setting[28]
. Hence, they add a proximal penalty term to local objective loss functions to constrain the local model to be closer to the global model and avoid excessive divergence. Rieger
et al. point out that clients express representations in different patterns and their shared knowledge may be obfuscated after synchronization[30]. Hence, they adopt conditional gated activation units to enable clients to condition their units. In this way, clients can identify whether the global feature is expressed and how to modulate the global pattern. These approaches impose more storage footprint and computation on resourceconstrained clientside devices, requiring higher resource allocation and also higher energy consumption.ServerSide Adaption. This type of approach explores how local models can be adaptively aggregated and how the global model can be adaptively optimized on the serverside. Yeganeh et al. aggregate clients’ local models according to their weights, by capitalizing an adaptive weighting approach based on the inverse distance between the local model parameters and the averaged model parameters[29]
. By using this approach, outofdistribution models will be weighed down and the global model can have a lower variance. The authors also explored the combination with other metrics, such as the training accuracy and the data size.
However, these variants did not perform well in our experiments, probably because some honest but “outofdistribution” devices were oversuppressed.
On the other hand, inspired by the ability of momentum accumulation to dampen oscillations[37], Hsu et al. adopt the momentum optimizer on the serverside and observe a significant improvement in FL accuracy[31]. Then, Reddi et al. introduce three advanced adaptive optimizers (i.e., Adagrad[38], Adam[39], and Yogi[40]) to update the serverside global model[32]. These adaptive federated optimizers enable the use of adaptive learning rates for different gradients and achieve great success, but unfortunately, they also require careful tunning of initial learning rates, and we observed drastic accuracy oscillation in the experiments.Iii System Model
Symbol  Explanation 

Loss function (e.g., crossentropy).  
Maximum training rounds.  
Number of iterations in each round.  
,  Number of devices (in factory ). 
Number of devices to be selected per factory.  
Number of devices to be randomly presampled per factory.  
Number of devices to be selected by GBPCS per factory.  
Number of factories (also the number of groups).  
Number of classification classes.  
Local learning rate.  
Parameters of the global ML model at th iteration.  
Parameters of the ML model on BS at th iteration.  
Parameters of the local ML model on device in factory at th iteration.  
Local dataset of device in factory .  
A minibatch data of device in factory at th iteration.  
Size of local dataset of device in factory .  
,  Batch data size (of device in factory ). 
Total size of data batches in factory .  
Data size vector of label classes of minibatch data .  
Data size matrix of of devices.  
Total data size vector of the presampled devices.  
Set of all devices in factory .  
Set of selected devices in factory at th iteration.  
Local data distribution of device in factory .  
Local data distribution of minibatch data .  
Mean data distribution of over selected devices .  
Realworld global data distribution. 
IoT devices in modern industrial parks can be divided into two types: Fixed devices (e.g., monitoring cameras, temperature and humidity sensors) and mobile devices (e.g., patrol drones and logistics robots). In the industrial park, due to the advantages of improved performance, reduced communication cost, and decentralized scalability, FL plays an important role in many industrial applications. For example, an anomaly detection application based on ondevice federated monitors could be applied for IIoT scenarios, where sensing and monitoring devices may locate in harsh environments with high voltage and high radiation, and they may move around the factory, making them impractical to access wired networks
[41]. 5G mobile networks have been considered to be enhanced to support key performance features of industrial applications such as high throughput, low latency, and high scalability[42]. These features enable industrial FL applications to transmit model data of a large number of IIoT devices at a high cycle frequency, with a high data rate and low latency. Therefore, we consider a hierarchical cloudedgeend network architecture empowered by 5G cellular wireless networks for training industrial FL applications, as shown in Fig. 1.In this case, a modern industrial park has factories, each factory has smart devices. We consider the devices in the same factory as a group. These devices are equipped with embedded computing chips to perform lightweight processing, for example, training ML models on local data streams. The devices connect to the nearby base station (BS) (also identified as ) through 5G cellular wireless networks. Then, BSs communicate FL model data with the cloud center through the Internet. Specifically, the device of factory collects streaming sensory data in realtime and changes the local dataset , whose class distribution is defined as . The class distributions of different devices can be highly heterogeneous due to diverse local usage patterns. Thus, we have , where is the realworld global class distribution. The goal of industrial FL is to find the optimal model parameters that can minimize the global loss function,
(1) 
The traditional workflow is briefly described below. In each round, a small subset of devices is randomly selected to participate in FL training. These devices utilize local datasets for several epochs to train their local ML models and upload these local models to connected BSs. BSs aggregate these local models and upload the aggregated models to the top server in the cloud. The top server globally aggregates BSs’ models, updates the global ML model, and synchronizes the updated model back to all BSs and end devices. These steps are repeated until the global model parameters
converge. However, this approach causes performance degradation to industrial FL due to data heterogeneity among devices, and it ignores the streaming nature of industrial data. Hence, in Section IV, a data heterogeneityrobust federated group synchronization approach is presented to address this issue.Iv FedGS: Framework and Workflow
The core idea is to strategically select a small subset of devices in each group to form FL super nodes with homogeneous data distributions. Then, these super nodes can be regarded as homogeneous clients to participate in FL. To resolve the heterogeneity in local datasets of devices inside each super node, the onestep synchronization protocol (e.g., SSGD) can be useful because it was proved to be equivalent to centralized SGD, which gives it robustness against data heterogeneity. Meanwhile, the multistep synchronization protocol (e.g., FedAvg) can be used to keep FedGS communication efficient, since the class distributions of super nodes are aligned. In this way, the problem of data heterogeneity is decomposed from the entire population of devices to a small number of devices in multiple groups, making it efficiently and effectively solved by the compoundstep synchronization protocol. In this way, the performance degradation of industrial FL is addressed.
The detailed design is given in Alg. 1. In the initialization stage, the top server first initializes the global ML model parameters and synchronizes to BSs. Then, it collects local class distributions
from all devices to estimate the realworld global class distribution
,(2) 
where is the local data size of device in group , and is a probability normalization function.
Client Selection. In each iteration , each BS selects devices from its group to obtain a homogeneous super node via the SelectClientsViaGBPCS interface. The detailed algorithm GBPCS is presented in Section V.
Local Training. In each group , each selected device fetches a minibatch of data from local dataset with batch size . These minibatch streaming data are oneshot and will not be used again. Then, the device downloads the model from the connected BS and trains for one minibatch gradient descent step with learning rate ,
(3) 
Internal Synchronization. The locally trained model will be uploaded to BS for internal synchronization,
(4) 
where is the total data size of all used minibatches in . Then, the BS updates its model with and synchronizes in its group.
We call the above client selection, local training, and internal synchronization as a onestep synchronization iteration because the local update on each device is only performed once before each synchronization. The onestep synchronization will loop iterations before each round of external synchronization.
External Synchronization. For every time that the onestep synchronization is performed iterations, BSs can upload their model to the top server for global aggregation,
(5) 
The globally aggregated model will be used to update the global model on the top server and synchronized to BSs.
Since the external synchronization is performed every onestep synchronization iterations, we call it a multistep synchronization round. The above steps will be repeated rounds (i.e., iterations) to obtain the converged model parameters .
The above workflow can be seen as an equivalent version of FedAvg, which performs local updates on FL super nodes for iterations with larger batch sizes, but homogeneous local datasets among super nodes. By capitalizing on an effective client selection strategy to make these super nodes homogeneous, the FL training process can be robust against data heterogeneity, and FL model performance can be improved. In the following section, we give our solution GBPCS.
V Client Selection Via GBPCS
In this section, we formulate the client selection problem as a 01 integer programming problem with vector weight constraints, and present our novel Gradientbased Binary Permutation approach, namely GBPCS, to solve this problem in an acceptable short time, with a desirable solution.
Va Problem Modeling
Given a factory (i.e., group) with industrial devices (). The next batch data of device follows the distribution () and the data size vector of label classes is . Our goal is to select devices from group at iteration to form a FL super node whose overall class distribution satisfies,
(6) 
Note that if is fixed, Eq. (6) will always find a fixed device set and other devices will have no chance to be selected. To keep the randomness of client selection so that each device has the same probability to be selected, we use a trick to randomly presample devices before strategically selecting the remaining devices. Formally speaking, in group , we first sample devices at random to obtain , whose next data batches have a total data size of vector . Then, devices are further selected from the remaining devices , whose data size matrix is , with the goal to minimize Eq. (6).
We use the following mathematical model to describe the above problem. Let and , the objective is to find a solution , where and , that
(7)  
(8)  
(9) 
Let batch sizes of all data batches be the same , then we have and a simplified model,
(10)  
(11)  
(12)  
(13) 
Note that in the mathematical model above, we considered data size and data quality in the objective (i.e., minimizing the distribution divergence) for client selection, but assumed that IIoT devices have similar hardware capabilities. However, if system heterogeneity should be considered, ESync[43] is compatible and can be useful. Then, we give a simple proof to show that the above problem is NPcomplete.
Lemma 1 (Problem A)
Given an integer matrix and an integer vector , the goal is to find whether there is a 01 vector such that . This 01 integer programming problem is NPcomplete[44].
Proposition 1 (Problem B)
Proof 1
To solve problem A, we can solve problem B for times with , where is the size of vector . This input transformation has a linear complexity . Problem B outputs YES if Eq. (10) could reach 0, otherwise, it outputs NO. This output transformation has a constant complexity . Hence, problem A can reduce to problem B in a polynomial complexity, which makes problem B also NPcomplete.
We can see from Proposition 1 that it is almost impossible to find the optimal solution in a polynomial complexity. To make FedGS time efficient, a suboptimal but fast solution is preferred, as described in the following subsection.
VB Gradientbased Binary Permutation Client Selection
To make the NPcomplete client selection problem solvable, in this paper, we propose a novel gradientbased approximate approach, namely GBPCS. The core idea is to permute (0,1) pairs of binary selection variables in with the steepest opposite gradients. In other words, the variable with the selection value of 0 and the smallest gradient will be permuted with the variable with the selection value of 1 and the largest gradient. In this way, the number of variables whose selection value equals one (i.e., the vector weight constraint) is maintained, and constraints (12) and (13) can be satisfied. This binary permutation operation will be performed iteratively to minimize the objective Eq. (10).
The pseudo code of GBPCS is given in Alg. 2. Given the data size matrix and , GBPCS first initializes the solution variable as follows,
(14) 
where is the MoorePenrose Inverse solution, means to set the largest values of to 1, and the others to 0. Then, GBPCS calculates the objective distance and the gradient .
The gradient indicates the opposite direction in which should be updated. The greater the absolute value of , the smaller the can be obtained by updating , so appears as a key gradient[45]. Based on this idea, GBPCS selects a pair of selection variables () with opposite key gradients for permutation. More specifically, is the identity of the device with the selection value of 0 and the smallest gradient,
(15) 
Similarly, is the identity of the device with the selection value of 1 and the largest gradient,
(16) 
Then, GBPCS permutes the values of and to obtain a new solution ,
(17) 
Eqs. (15)(17) will be repeated until the objective distance no longer decreases. Finally, we can construct
(18) 
and obtain the set of selected devices .
GBPCS has a complexity of , where and is the number of GBPCS iterations. In our experiment, GBPCS can obtain a very desirable solution close to the optimal and has a high execution efficiency comparable to the random sampling approach.
Vi Performance Analysis
In this section, we analyze the optimality gap and convergence rate of FedGS, and qualitatively compare them with those of FedAvg in the presence of noni.i.d. data. Then, we give the condition under which FedGS is time efficient.
Via Convergence Analysis
Assumption 1
The local function are strongly convex, smooth, and Lipschitz.
Proposition 2
For the gradient on FL node in the federated setting and the gradient in the centralized setting, we have an upper bound of the gradient divergence proportional to the distribution divergence .
Proof 2
It is easy to know that captures the impact of divergence in class distributions . Generally speaking, the smaller the distribution divergence, the smaller the upper bound of the gradient divergence .
Proposition 3
Let , the convergence upper bound of FedGS is , and its optimality gap is bounded by .
Proof 3
As mentioned above, the convergence performance of FedGS is theoretically equivalent to that of FedAvg, in which FL super nodes run minibatch SGD with batch size for local iterations in each round. Then, the convergence upper bound of FedGS after iterations can be inferred from Lemma 2 in [34],
where . When , the optimality gap is bounded by
Since GBPCS forces FL super nodes to have aligned class distributions, FedGS has an upper bound of the gradient divergence smaller than FedAvg . Therefore, it is easy to infer from Proposition 3 that, FedGS has both the convergence upper bound and the optimality gap smaller than those of FedAvg, thus it can improve the FL convergence speed and accuracy performance.
ViB TimeEfficiency Condition
We analyze the time cost of FedGS in each round ( onestep synchronization iterations) and that of FedAvg in each round ( local iterations), and give the condition under which FedGS can achieve higher time efficiency than FedAvg.
Time Cost of FedGS. The time cost of FedGS is determined by communication, computation and client selection. The communication delays are brought by internal synchronizations and external synchronizations . For the internal synchronization, the delay of uploading local models of size from devices to their BS is , and the delay of synchronizing the model of size from the BS to devices is , where and are the uplink and downlink bandwidths between devices and BS, and
are the received signaltonoise ratio (SNR) of BS and devices. For the external synchronization, the delay of uploading models of size
from BSs to the top server is , and the delay of synchronizing the global model of size from the top server to BSs is , where and are the uplink and downlink bandwidths between BSs and the top server, is the SNR of the top server. Hence, we have the communication time cost for each internal synchronization and each external synchronization as follows,(19)  
(20) 
Let the delay of each local update be and the delay of client selection procedure be . Each round the internal synchronization is performed times, the total delay is
(21) 
Time Cost of FedAvg. The time cost of FedAvg is mainly determined by communication and computation because the client selection procedure is simple random sampling so that the selection delay is negligible. The communication delays come from the uplink and downlink model transmission between the top server and devices. The delay of uploading local models of size from devices to the top server is , and the delay of synchronizing the global model of size from the top server to devices is . Hence, we have the communication time cost for each round of synchronization as follows,
(22) 
Then, the total time cost when FedAvg performs local updates and one round of synchronization is
(23) 
In order to simplify the analysis result, we make the following assumptions.
Assumption 2
(a) The uplink and downlink bandwidths are equal: , ; (b) The SNRs of the top server, BSs, and devices are equal: .
Then, we can give the following condition for hyperparameter setting, under which FedGS can achieve higher time efficiency than FedAvg.
Proposition 4
Let Assumption 2 hold and , the time cost per iterations in each round satisfies if , where
(24)  
(25) 
Proof 4
Eq. (24) can be obtained by combining Eqs. (19)(21), and Eq. (25) can be obtained by combining Eqs. (22)(23). Let , we have
In our experiment, GBPCS is quite fast, whose time cost (15 milliseconds) is negligible compared to other delays. Therefore, we assume to simplify the result and obtain
In modern industrial applications, 5G enables indoor industrial use cases that were impossible before, supported by high data rate, ultralow delay, and extreme density of wireless communications[46]. In reality, the data rate of 5G edge is about 10100 times of that in WAN, that is, . Therefore, we can easily set to satisfy the condition in Proposition 4, so that the time efficiency of FedGS can be guaranteed.
Vii Experimental Evaluation
Viia Experiment Setup
Environment and Hyperparameter Setup. In the experiment, we consider an IIoT application where OCR technology is used in the identification of packing boxes, machines, robots, vehicles, and workers, through recognizing the optical characters on their badges. To this end, we aim to train a highaccuracy OCR model in the federated setting, where sensors’ local character images are confidential and skewed in the class distribution. The realworld FEMNIST[25] dataset is chosen to train our federated OCR model, as it is built by partitioning 805,263 optical digit and character images into 3,550 devices, following noni.i.d.like class distributions and uneven data sizes. Our experiment platform contains OCR cameras and factories, each factory has OCR cameras (hereinafter referred to as devices). In each iteration,
devices are selected from each factory to participate in the federated OCR training. A fourlayer convolutional neural network [Conv2D(32), MaxPool, Conv2D(64), MaxPool, Dense(2048), Dense(62)] is used as the training model because it is lightweight and suitable for resourceconstrained industrial devices. Unless otherwise specified, we use the standard minibatch SGD to train local ML models, with the learning rate
, the batch size , the number of iterations per round and the maximum number of rounds .GBPCS Initialization. The choice of the initial point in GBPCS is critical to the quality of the solution, because a bad initial point may cause GBPCS to fall into a local minimum. In the experiment, devices are presampled at random, and the other devices are selected using GBPCS, with the following three initialization methods.

Random Initialization. Set values in to 1 at random and leave other values at 0.

Zero Initialization. All values in are first initialized to 0. Then, a warmup step is performed to meet the vector weight constraint Eq. (13), in which one value with the smallest gradient is set to 1 iteratively until the number of value 1 in reaches . The warmup step requires additional iterations.

MoorePenrose Inverse Initialization (MPInv). MPInv is first used to solve the least square solution of the unconstrained objective function . Then, elements with the largest values in are set to 1 and others are left at 0 to obtain the initial point .
Comparison Algorithms. To highlight the efficiency and effectiveness of the proposed GBPCS, we consider the following five benchmark client selection methods for comparison.

Random Sampler (Random): From each group, devices are uniformly and randomly sampled.

Monte Carlo Sampler (MC): Repeat the random sampler 1000 times and the solution minimizes Eq. (10) is used.

Genetic Sampler (GA): Search for a suboptimal
devices using genetic algorithm
[48] to meet Eqs. (10)(13), in which the constrained 01 vector solutions are regarded as genes and suffer from selection, crossover, mutation and elimination. By default, we set the population size to 100, the mutation probability to 0.001, and the number of generations to 100.
Except for the baseline FedAvg[24], other nine advanced approaches are also experimentally compared with FedGS in the presence of noni.i.d. data. They are FedMMD[26], FedFusion[27], FedProx[28], IDA[29], CGAU[30], FedAvgM[31], and FedAdagrad, FedAdam, FedYogi from [32].
Implementation. We implement FedGS on a standard FL simulator LeafMX^{1}^{1}1LeafMX: https://github.com/Lizonghang/leafmx (an MXNET[49] implementation of LEAF[25]). The code implementation is openavailable on Github: https://github.com/Lizonghang/fedgs.
ViiB Results and Discussion
Comparison of initialization methods in GBPCS. The optimization curves of the class distribution divergence of Zero, Random, and MPInv initializers are shown in Fig. 3. Both Zero and MPInv initializers successfully find highquality solutions (0.029 and 0.030, respectively) close to the optimal of the brute force search (0.028). Instead, the Random initializer falls into a poor local optimal (0.044). Furthermore, the MPInv initializer is much faster because it does not require an additional warmup procedure like the Zero initializer. Therefore, GBPCS is default initialized with MPInv initializer.
Comparison among GBPCS and other samplers. Since the procedure of GBPCS client selection is performed every iteration, both the quality and time cost of the solution are critical to FedGS performance.
We first compare the distribution divergence (defined as Eq. (6)) among GBPCS and other five benchmark samplers. Generally speaking, the smaller the gap between the class distribution of the group and the global distribution , the smaller the distribution divergence and the better the sampler. The distribution divergence of factories is shown in Fig. (a)a. As expected, the most commonly used random sampler in FedAvg leads to a high divergence in class distribution () and causes noni.i.d. data among groups, while the brute force sampler can always minimize the divergence (). The random sampler and the brute force sampler give the upper and lower bounds of the distribution divergence, and solutions of other samplers should locate in this interval, among which GA and GBPCS samplers perform best ( and , respectively).
In terms of execution time, FedGS prefers samplers with a very short execution time, because highfrequency client selection may introduce a nonnegligible latency to FedGS and significantly slow down FL training. Fig. (b)b compares the execution time of the above samplers. Let us focus on the brute force, GA, and GBPCS samplers, because their solutions are of the best quality. The brute force sampler requires 979 seconds to find the optimal solution, whose latency is too long to be acceptable. Therefore, FedGS prefers a suboptimal solution in an acceptable short time. GA and GBPCS samplers seem to be good choices, and the proposed GBPCS sampler is 66 faster than the GA sampler, with a negligible 15 milliseconds and a loss of distribution divergence by only 0.001.
To highlight GBPCS more intuitively, we draw the optimization curve of distribution divergence over execution time in Fig. (c)c. The results show that the proposed GBPCS sampler converges to a highquality solution 0.029 closest to the optimal 0.028 in the shortest time, demonstrating the superior effectiveness and efficiency of GBPCS.
Effects of hyperparameters in FedGS. Hyperparameters may have great effects on FedGS. To explore these effects, we perform a grid search on experimental hyperparameters, including the batch size , the number of iterations per round , the number of devices selected per group , as well as the environmental hyperparameter, the number of groups . Fig. (a)a visualizes the test accuracy over different and settings, where is chosen from and is chosen from . The results show that a moderately large can improve the accuracy of FedGS, while the batch size has little effect. Fig. (b)b visualizes the test accuracy over different and settings, where is chosen from and is chosen from . Without loss of generality, both more groups and more selected devices can bring gains in FL model accuracy because more devices’ data is included. In this paper, , and are used by default to meet the condition in Proposition 4. Please note that is determined by the realworld environment instead of an adjustable hyperparameter.
Test Accuracy  Test Loss  Rounds To 82%  
FedAvg (Baseline)  82.1%  0.587  478 
FedProx  82.0%  0.586  497 
IDA  81.0%  0.628  
IDA+INTRAC  81.0%  0.618  
IDA+FedAvg  80.5%  0.687  
CGAU  83.3%  0.509  202 
FedMMD  83.0%  0.564  378 
FedFusion+Conv  81.7%  0.624  
FedFusion+Multi  82.0%  0.591  486 
FedFusion+Single  80.7%  0.627  
FedAvgM  84.4%  0.820  68 
FedAdagrad  83.8%  0.583  264 
FedAdam  85.0%  0.662  71 
FedYogi  84.6%  0.590  76 
FedGS  86.0%  0.435  147 
Comparison among FedGS and other federated approaches. We take ten advanced federated approaches for comparison to show the stateoftheart performance of the proposed FedGS in the presence of noni.i.d. data. The test accuracy, test loss, and training rounds required to reach the accuracy of 82% are listed in Table II, and detailed training curves are given in Fig. 6. Unless otherwise specified, all the comparison approaches use the local epoch by default. In the following, we will compare these approaches and analyze their results, respectively.
FedGS vs FedProx. FedProx adds a proximal term to local loss functions to penalize divergent local models. We tune the penalty constant to find the best result in Figs. (a)a and (d)d. However, FedProx performs poorly in our case, with the accuracy of 82.0%, not even exceeding the baseline accuracy of 82.1% of FedAvg. The reason may be that the proximal penalty term will slow convergence by forcing local models closer to the starting point[28]. Instead, the proposed FedGS improves the baseline accuracy by 3.9% and achieves the accuracy of 86.0%.
FedGS vs IDA. IDA weighs model parameters of devices based on their inverse distance to the averaged model parameter during aggregation. We combine IDA with inverse training accuracy coefficients (IDA+INTRAC) and normalized data size coefficients (IDA+FedAvg) as suggested by the authors. However, Figs. (b)b and (e)e shows that IDAseries approaches suffer an accuracy degradation (). That is because devices with large parameter deviations are oversuppressed, causing the global model to lose data knowledge on these devices. Besides, IDA should cache model parameters uploaded by all devices until the average model parameter and inverse distance coefficients are calculated, which takes up huge memory space on the server.
FedGS vs CGAU.
CGAU uses gated activation units on top of a pretrained model to enable clientspecific expression of heterogeneous data. We train a 1layer and a 2layer CGAU classifier with 256 units, respectively (namely FineTunning+1
CGAU and FineTunning+2CGAU). The dropout layer is not used as the authors did because we observed a 3.3% drop in accuracy after using them. Figs. (c)c and (f)f show that FineTunning+1CGAU achieves a higher accuracy of 83.3%, which improves the baseline accuracy by 1.2% and benefits from the fast convergence speed of the pretrained model. Despite these gains, the proposed FedGS can still achieve 2.7% higher accuracy, lower test loss, and faster convergence.FedGS vs FedMMD.
FedMMD uses transfer learning
[50] to better merge the knowledge of the global model into the local model. As suggested by the authors, we use the MMD distance and the penalty coefficient . As shown in Figs. (g)g and (j)j, FedMMD improves the baseline accuracy by 0.9% and achieves the accuracy of 83.0%, but the proposed FedGS further improves that by another 3%.FedGS vs FedFusion. FedFusion fuses the global and local features using operators such as convolution (FedFusion+Conv), vector weighted average (FedFusion+Multi) and scalar weighted average (FedFusion+Single). However, the results in Figs. (h)h and (k)k show that FedFusion+Multi and FedFusion+Conv only achieve the accuracy similar to the baseline (82.0% and 81.7%, respectively), and FedFusion+Single even decreases the accuracy by 1.4%. Instead, the proposed FedGS is obviously better and faster.
FedGS vs FedOpt. FedOpt is a general paradigm for a series of adaptive federated optimizers, which dynamically adjusts learning rates of all gradients to accelerate convergence, including FedAvgM, FedAdagrad, FedAdam, and FedYogi. Preliminary experiments show that the convergence performance of these approaches has indeed significantly improved, but they are particularly sensitive to initial learning rates. As the authors did, we search for the best setting of the clientside and serverside initial learning rates in Fig. 7 and give the best results in Figs. (i)i and (l)l. Other hyperparameters follow the authors’ setting, for example, for FedAvgM, for FedAdagrad, for FedAdam and FedYogi, and . The results show that these approaches can improve the baseline accuracy by with fast convergence speed, especially FedAdam. However, the accuracy of the proposed FedGS is still 1% higher.
To sum up, FedAvgM, FedAdagrad, FedAdam, and FedYogi are generally better than other comparison approaches (some of which cannot even reach the accuracy of 82%, marked with “” in Table II). Instead, FedGS achieves the stateoftheart accuracy of 86.0%, which is 3.9% higher than the baseline and 3.5% higher than the averaged accuracy. In addition, FedGS can also reach the accuracy of 82% in only 147 rounds, which is 3.3 faster than FedAvg and reduces training rounds by 59% on average. These comprehensive experiments prove the effectiveness and efficiency of FedGS.
Viii Conclusion
FL in IIoT is emerging as a field of great value with increasing interest from both academia and industry, however, it still faces the challenge of noni.i.d. data, which currently remains open. In this paper, we propose FedGS, which is a hierarchical cloudedgeend FL framework for 5G empowered modern industries. To minimize the divergence in data distributions among factories, we propose a constrained gradientbased optimizer, namely GBPCS, to select a subset of devices in each factory to construct homogeneous FL super nodes. GBPCS can find a desirable selection strategy in a very short time, and can also be used for other practical cases such as game matching. Then, to eliminate the impact of residual noni.i.d. data within the super nodes, we use a compoundstep synchronization protocol to coordinate the training process. This protocol uses the data heterogeneityinsensitive onestep synchronization protocol within the super nodes to suppress the negative impact of data heterogeneity, then uses the multistep synchronization protocol among the super nodes to reduce communication frequency. The proposed approach takes into account the natural geographical clustering property of factory devices and can adapt to rapidly changing streaming data at runtime, without exposing confidential data in highrisk data manipulation. Theoretical analysis shows that FedGS has both the convergence rate and optimality gap better than the benchmark FedAvg, and can be more timeefficient under a relaxed hyperparameter condition. Extensive experiments compared to ten advanced approaches demonstrate the stateoftheart performance of FedGS on noni.i.d. data.
References

[1]
Chen, Fei, Bo Li, Rong Dong, et al.
: “Highperformance OCR on packing boxes in industry based on deep learning.” In:
Pacific Rim International Conference on Artificial Intelligence
(PRICAI), pp. 10181030. Springer, Cham, 2018.  [2] Liukkonen, Mika, and TsungNan Tsai: “Toward decentralized intelligence in manufacturing: Recent trends in automatic identification of things.” International Journal of Advanced Manufacturing Technology (JAMT) 87, no. 9, pp. 25092531, 2016.
 [3] Kairouz, Peter, H. Brendan McMahan, Brendan Avent, et al.: “Advances and open problems in federated learning.” Foundations and Trends in Machine Learning 14, no. 12, pp. 1210, 2021.
 [4] Hiessl, Thomas, Daniel Schall, Jana Kemnitz, et al.: “Industrial federated learning: Requirements and system design.” In: International Conference on Practical Applications of Agents and MultiAgent Systems (PAAMS), pp. 4253, Springer, Cham, 2020.
 [5] Lim, Wei Yang Bryan, Nguyen Cong Luong, Dinh Thai Hoang, et al.: “Federated learning in mobile edge networks: A comprehensive survey.” IEEE Communications Surveys & Tutorials (COMST) 22, no. 3, pp. 20312063, 2020.
 [6] Pham, QuocViet, Kapal Dev, Praveen Kumar Reddy Maddikunta, et al.: “Fusion of federated learning and industrial internet of things: A survey.” arXiv preprint arXiv:2101.00798, 2021.
 [7] Zhang, Weishan, Qinghua Lu, Qiuyu Yu, et al.: ”Blockchainbased federated learning for device failure detection in industrial IoT.” IEEE Internet of Things Journal (IOTJ) 8, no. 7, pp. 59265937, 2020.
 [8] Luo, Jiahuan, Xueyang Wu, Yun Luo, et al.: “Realworld image datasets for federated learning.” arXiv preprint arXiv:1910.11089, 2019.
 [9] Zhao, Yue, Meng Li, Liangzhen Lai, et al.: “Federated learning with noniid data.” arXiv preprint arXiv:1806.00582, 2018.
 [10] Yao, Xin, Tianchi Huang, RuiXiao Zhang, et al.: “Federated learning with unbiased gradient aggregation and controllable meta updating.” In: Workshop on Federated Learning for Data Privacy and Confidentiality (FLNeurIPS 2019, in Conjunction with NeurIPS 2019), 2019.
 [11] Yoshida, Naoya, Takayuki Nishio, Masahiro Morikura, et al.: “Hybridfl for wireless networks: Cooperative learning mechanism using noniid data.” In: ICC 20202020 IEEE International Conference on Communications (ICC), pp. 17, IEEE, 2020.
 [12] Zhang, Wenyu, Xiumin Wang, Pan Zhou, et al.: “Client selection for federated learning with noniid data in mobile edge computing.” IEEE Access 9, pp. 2446224474, 2021.
 [13] Zhao, Zhongyuan, Chenyuan Feng, Wei Hong, et al.: “Federated learning with noniid data in wireless networks.” IEEE Transactions on Wireless Communications (TWC), 2021.
 [14] Duan, Moming, Duo Liu, Xianzhang Chen, et al.: “Astraea: Selfbalancing federated learning for improving classification accuracy of mobile deep learning applications.” In: IEEE 37th International Conference on Computer Design (ICCD), pp. 246254, 2019.
 [15] Wang, Han, Luis MuñozGonzález, David Eklund, et al.: “Noniid data rebalancing at IoT edge with peertopeer federated learning for anomaly detection.” In: Proceedings of the 14th ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec), pp. 153163, 2021.
 [16] Wen, Hui, Yue Wu, Chenming Yang, et al.: “A unified federated learning framework for wireless communications: Towards privacy, efficiency, and security.” In: IEEE INFOCOM 2020IEEE Conference on Computer Communications Workshops, pp. 653658, IEEE, 2020.

[17]
Sattler, Felix, KlausRobert Müller, and Wojciech Samek: “Clustered federated learning: Modelagnostic distributed multitask optimization under privacy constraints.”
IEEE Transactions on Neural Networks and Learning Systems
(TNNLS), 2020.  [18] Wang, Hao, Zakhary Kaplan, Di Niu, et al.: “Optimizing federated learning on noniid data with reinforcement learning.” In: IEEE INFOCOM 2020IEEE Conference on Computer Communications, pp. 16981707, IEEE, 2020.
 [19] Zeng, Shenglai, Zonghang Li, Hongfang Yu, et al.: “Heterogeneous federated learning via grouped sequentialtoparallel training.” In: 27th International Conference on Database Systems for Advanced Applications (DASFAA), 2022.
 [20] Nishio, Takayuki, and Ryo Yonetani: “Client selection for federated learning with heterogeneous resources in mobile edge.” In: ICC 20192019 IEEE International Conference on Communications (ICC), pp. 17, IEEE, 2019.
 [21] Pang, Junjie, Yan Huang, Zhenzhen Xie, et al.: “Realizing the heterogeneity: A selforganized federated learning framework for IoT.” IEEE Internet of Things Journal (IOTJ) 8, no. 5, pp. 30883098, 2020.
 [22] Hiessl, Thomas: “Cohortbased federated learning services for industrial collaboration on the edge.” TechRxiv, Preprint, 2021.

[23]
Zinkevich, Martin, Markus Weimer, Alexander J. Smola, et al.
: “Parallelized stochastic gradient descent.” In:
24th Conference on Neural Information Processing Systems (NeurIPS) 4, no. 1, p. 4, 2010.  [24] McMahan, Brendan, Eider Moore, Daniel Ramage, et al.: “Communicationefficient learning of deep networks from decentralized data.” In: 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 12731282, 2017.
 [25] Caldas, Sebastian, Sai Meher Karthik Duddu, Peter Wu, et al.: “Leaf: A benchmark for federated settings.” In: 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019.
 [26] Yao, Xin, Chaofeng Huang, and Lifeng Sun: “Twostream federated learning: Reduce the communication costs.” In: 2018 IEEE Visual Communications and Image Processing (VCIP), pp. 14, 2018.
 [27] Yao, Xin, Tianchi Huang, Chenglei Wu, et al.: “Towards faster and better federated learning: A feature fusion approach.” In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 175179, 2019.
 [28] Li, Tian, Anit Kumar Sahu, Manzil Zaheer, et al.: “Federated optimization in heterogeneous networks.” In: Conference on Machine Learning and Systems (MLSys), 2018.
 [29] Yeganeh, Yousef, Azade Farshad, Nassir Navab, et al.: “Inverse distance aggregation for federated learning with noniid data.” In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pp. 150159, Springer, Cham, 2020.
 [30] Rieger, Laura, Rasmus M. Th Høegh, and Lars K. Hansen: “Client adaptation improves federated learning with simulated noniid clients.” In: International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML, 2020.
 [31] Hsu, TzuMing Harry, Hang Qi, and Matthew Brown: “Measuring the effects of nonidentical data distribution for federated visual classification.” In: International Workshop on Federated Learning for Data Privacy and Confidentiality in Conjunction with NeurIPS, 2019.
 [32] Reddi, Sashank, Zachary Charles, Manzil Zaheer, et al.: “Adaptive federated optimization.” In: International Conference on Learning Representations (ICLR), 2021.
 [33] Jeong, Eunjeong, Seungeun Oh, Hyesung Kim, et al.: “Communicationefficient ondevice machine learning: Federated distillation and augmentation under noniid private data.” In: Workshop on Machine Learning on the Phone and other Consumer Devices, Montréal, Canada, 2018.
 [34] Wang, Shiqiang, Tiffany Tuor, Theodoros Salonidis, et al.: “Adaptive federated learning in resource constrained edge computing systems.” IEEE Journal on Selected Areas in Communications (JSAC) 37, no. 6, pp. 12051221, 2019.
 [35] Yu, Hao, Sen Yang, and Shenghuo Zhu: “Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning.” In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) 33, no. 1, pp. 56935700, 2019.
 [36] Li, Xiang, Kaixuan Huang, Wenhao Yang, et al.: “On the convergence of fedavg on noniid data.” In: International Conference on Learning Representations (ICLR), 2020.
 [37] Nesterov, Yu: “Gradient methods for minimizing composite functions.” Mathematical Programming 140, no. 1, pp. 125161, 2013.
 [38] Ward, Rachel, Xiaoxia Wu, and Leon Bottou: “AdaGrad stepsizes: Sharp convergence over nonconvex landscapes.” In: International Conference on Machine Learning (ICML), pp. 66776686, 2019.
 [39] Kingma, Diederik P., and Jimmy Ba: “Adam: A method for stochastic optimization.” In: International Conference on Learning Representations (ICLR), 2015.
 [40] Zaheer, Manzil, Sashank Reddi, Devendra Sachan, et al.: “Adaptive methods for nonconvex optimization.” In: 32nd Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, 2018.
 [41] Liu, Yi, Sahil Garg, Jiangtian Nie, et al.: “Deep anomaly detection for timeseries data in industrial IoT: A communicationefficient ondevice federated learning approach.” IEEE Internet of Things Journal (IOTJ), 8, no. 8, pp. 63486358, 2020.
 [42] Akpakwu, Godfrey Anuga, Bruno J. Silva, Gerhard P. Hancke, et al.: “A survey on 5G networks for the internet of things: Communication technologies and challenges.” IEEE Access 6, pp. 36193647, 2017.
 [43] Li, Zonghang, Huaman Zhou, Tianyao Zhou, et al.: ”ESync: Accelerating intradomain federated learning in heterogeneous data centers.” IEEE Transactions on Services Computing (TSC), (2020).
 [44] Rice, Bart: “The 01 integer programming problem in a finite ring with identity.” Computers and Mathematics with Applications 7, no. 6, pp. 497502, 1981.
 [45] Zhou, Huaman, Zonghang Li, Qingqing Cai, et al.: “DGT: A contributionaware differential gradient transmission mechanism for distributed machine learning.” Future Generation Computer Systems (FGCS) 121, pp. 3547, 2021.
 [46] Varga, Pal, Jozsef Peto, Attila Franko, et al.: “5G support for industrial IoT applications: Challenges, solutions, and research gaps.” Sensors 20, no. 3, p. 828, 2020.

[47]
Nogueira, Fernando, et al.
: “Bayesian optimization: Open source constrained global optimization tool for python.” [Online]:
https://github.com/fmfn/BayesianOptimization, 2014.  [48] Whitley, Darrell: “A genetic algorithm tutorial.” Statistics and Computing 4, no. 2, pp. 6585, 1994.
 [49] Chen, Tianqi, Mu Li, Yutian Li, et al.: “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems.” In: 30th Conference on Neural Information Processing Systems (NeurIPS), 2016.
 [50] Pan, Sinno Jialin, and Qiang Yang: “A survey on transfer learning.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 22, no. 10, pp. 13451359, 2009.