I Introduction
Beyond 5G systems are expected to leverage crossfertilizations between wireless systems, core networking, Machine Learning (ML) and Artificial Intelligence (AI) techniques, targeting not only communication and networking tasks, but also augmented environmental perception services
[1]. The combination of powerful AI tools, e.g. Deep Neural Networks (DNN), with massive usage of Internet of Things (IoT), is expected to provide advanced services [2] in several domains such as Industry 4.0 [3], CyberPhysical Systems (CPS) [4] and smart mobility [5]. Considering this envisioned landscape, it is of paramount importance to integrate emerging deep learning breakthroughs within future generation wireless networks, characterized by arbitrary distributed connectivity patterns (e.g., mesh, cooperative, peertopeer, or spontaneous), along with strict constraints in terms of latency
[6] (i.e., to support UltraReliable LowLatency Communications  URLLC) and battery lifetime.Recently, federated optimization, or federated learning (FL) [7][8] has emerged as a new paradigm in distributed ML setups [9]. The goal of FL systems is to train a shared global model (i.e., a Neural Network  NN) from a federation of participating devices acting as local learners under the coordination of a central server for models aggregation. As shown in Fig. 1.a, FL alternates between a local model computation at each device and a round of communication with a server. Devices, or workers, derive a set of local learning parameters from the available training data, referred to as local model. The local model at time and device
is typically obtained via backpropagation and Stochastic Gradient Descent (SGD)
[10] methods using local training examples . The server obtains a global model by fusion of local models and then feeds back such model to the devices. Multiple rounds are repeated until convergence is reached. The objective of FL is thus to build a global model by the cooperation of a number of devices. Model is characterized by parameters (i.e., NN weights and biases for each layer) for the output quantity and the observed (input) data . Since it decouples the ML stages from the need to send data to the server, FL provides strong privacy advantages compared to conventional centralized learning methods.Ia Decentralized FL and related works
Next generation networks are expected to be underpinned by new forms of decentralized, infrastructureless communication paradigms [11] enabling devices to cooperate directly over devicetodevice (D2D) spontaneous connections (e.g., multihop or mesh). These networks are designed to operate  when needed  without the support of a central coordinator, or with limited support for synchronization and signalling. They are typically deployed in missioncritical control applications where edge nodes cannot rely on a remote unit for fast feedback and have to manage part of the computing tasks locally [12], cooperating with neighbors to self disclose the information. Typical examples are lowlatency safetyrelated services for vehicular [13] or industrial [14] applications. Considering these trends, research activities are now focusing on fully decentralized (i.e., serverless) learning approaches. Optimization of the model running on the devices with privacy constraints is also critical [16] for humanrelated applications.
To the authors knowledge, few attempts have been made to address the problem of decentralized FL. In [17][18], a gossip protocol is adopted for ML systems. Through sumweight gossip, local models are propagated over a peertopeer network. However, in FL over D2D networks, gossipbased methods cannot be fully used because of medium access control (MAC) and halfduplex constraints, which are ignored in these early approaches. More recently, [19] considers an application of distributed learning for medical data centers where several servers collaborate to learn a common model over a fully connected network. However, network scalability/connectivity issues are not considered at all. In [20], a segmented gossip aggregation is proposed. The global model is split into non overlapping subsets and local learners aggregate the segmentation sets from other learners. Having split the model, by introducing adhoc subsets and segment management tasks, the approach is extremely applicationdependent and not suitable for more general ML contexts. Finally, [21][22]
propose a peertopeer Bayesianlike approach to iteratively estimate the posterior model distribution. Focus is on convergence speed and load balancing, yet simulations are limited to a few nodes, and the proposed method cannot be easily generalized to NN systems trained by incremental gradient (
i.e., SGD) or momentum based methods.IB Contributions
This paper proposes the application of FL principles to massively dense and fully decentralized networks that do not rely upon a central server coordinating the learning process. As shown in Fig. 1.b, the proposed FL algorithms leverage the mutual cooperation of devices that perform data operations inside the network (innetwork) via consensusbased methods [23]. Devices independently perform training steps on their local dataset (batches) based on a local objective function, by using SGD and the fused models received from the neighbors. Next, similarly to gossip [17], devices forward the model updates to their onehop neighborhood for a new consensus step. Unlike the methods in [17][22], the proposed approach is general enough to be seamlessly applied to any NN model trained by SGD or momentum methods. In addition, in this paper we investigate the scalability problem for varying NN model layer size, considering large and dense D2D networks with different connectivity graphs. These topics are discussed, for the first time, by focusing on an experimental Industrial IoT (IIoT) setting.
The paper contributions are summarized in the following:

a novel class of FL algorithms based on the iteratively exchange of both model updates and gradients is proposed to improve convergence and minimize the number of communication rounds, in exchange for a more intensive use of D2D links and local computing;

all presented algorithms are validated over large scale massive networks with intermittent, sporadic or varying connectivity, focusing in particular on an experimental IIoT setup, and considering both complexity, convergence speed, communication overhead, and average execution time on embedded devices.
The paper is organized as follows. Sect. II reviews the FL problem. Sect. III proposes two consensusbased algorithms for decentralized FL. Validation of the proposed methods is first addressed in Sect. IV on a simple network and scenario. Then, in Sect. V the validation is extended to a large scale setup by focusing on a realworld problem in the IIoT domain. Finally, Sect. VI draws some conclusions and proposes future investigations.
Ii FL for model optimization
The FL approach defines an incremental algorithm for model optimization over a large population of devices. It belongs to the family of incremental gradient algorithms [26][29] but, unlike these setups, optimization typically focuses on nonconvex objectives that are commonly found in NN problems. The goal is to learn a global model for inference problems (i.e.,
classification or regression applications) that transforms the input observation vector
into the outputs , with model parameters embodied in the matrix while is the output size. Observations, or input data, are stored across devices connected with a central server that coordinates the learning process. A common, and practical, assumption is that the number of participating devices is large () and they have intermittent connectivity with the server. The cost of communication is also much higher than local computation, in terms of capacity, latency and energy consumption [25]. In this paper, we will focus specifically on NN models. Therefore, considering a NN of layers, the model iteratively computes a nonlinear function of a weighted sum of the input values, namely(1) 
with , , being the hidden layers and the input vector. The matrix
(2) 
collects all the parameters of the model, namely the weights and the biases for each defined layer, with and the corresponding input and output layer dimensions, respectively^{1}^{1}1To simplify the notation, here we assume that the layers have equal input/output size in , but the model can be generalized to account for different dimensions (see Sect. V).. FL applies independently to each layer^{2}^{2}2Weights and biases are also optimized independently. of the network. Therefore, in what follows, optimization focuses on one layer and the matrix (2) reduces to . The parameters of the convolutional layers, namely input, output and kernel dimensions, can be easily reshaped to conform with the above general representation.
In FL it is commonly assumed [7][8] that a large database of examples is (unevenly) partitioned among the devices, under non Identical Independent Distribution (nonIID) assumptions. Examples are thus organized as the tuples , where represents the data, while are the desired model outputs . The set of examples, or training data, available at device is , where is the size of the kth dataset under the nonIID assumption. The training data on a given device is thus not representative of the full population distribution. In practical setups (see Sect. V), data is collected individually by the devices based on their local/partial observations of the given phenomenon.
Unlike incremental gradient algorithms [30][31], FL of model is applicable to any finitesum objective of the general form
(3) 
where is the loss, or cost, associated with the th device
(4) 
and is the loss of the predicted model over the examples observed by the device , assuming model parameters to hold.
In conventional centralized ML (i.e., learning without federation), used here as benchmark, the server collects all local training data from the devices and obtains the optimization of model parameters by applying an incremental gradient method over a number of batches from the training dataset. For iteration , the model parameters are thus updated by the server according to
(5) 
where is the SGD step size and the gradient of the loss in (3) over the assigned batches and w.r.t. the model
. Backpropagation is used here for gradients computation. The model estimate at convergence is denoted as
.Rather than sharing the training data with the server, in FL the model parameters are optimized collectively by interconnected devices, acting as local learners. On each round of communication, the server distributes the current global model to a subset of devices. The devices independently update the model using the gradients (SGD) from local training data as
(6) 
where represents the gradient of the loss (4) observed by the th device and w.r.t. the model . Updates (6), or local models , are sent back to the server, after quantization, anonymization [7] and compression stages, modelled here by the operator . A global model update is obtained by the server through aggregation according to
(7) 
Convergence towards the centralized approach (5) is achieved if . Notice that the learning rate is typically kept smaller [9] compared with centralized learning (5) on large datasets. Aggregation model (7) is referred to as Federated Averaging (FA) [7][15]. As far as convergence is concerned, for strongly convex objective and generic local solvers, the general upper bound on global iteration number is given in [24] and relates both to global () and local () accuracy according to the equation .
Iii A consensusbased approach to innetwork FL
The approaches proposed in this section allow the devices to learn the model parameters, solution of (3), by relying solely on local cooperation with neighbors, and local innetwork (as opposed to centralized) processing. The interaction topology of the network is modelled as a directed graph with the set of nodes and edges (links) . As depicted in Fig. 1, the distributed devices are connected through a decentralized communication architecture based on D2D communications. The neighbor set of device is denoted as , with cardinality . Notice that we include node in the set , while does not. As introduced in the previous section, each device has a database of examples that are used to train a local NN model at some time
(epoch). The model maps input features
into outputs as in (1). A cost function, generally nonconvex, as in (4), is used to optimize the weights of the local model.The proposed FL approaches exploit both adaptive diffusion [31] and consensus tools [23][27] to optimally leverage the (possibly large) population of federated devices that cooperate for the distributed estimation of the global model , while retaining the trained data. Convergence is thus obtained if it is . Distributed innetwork model optimization must satisfy convergence time constraints, as well as minimize the number of communication rounds. In what follows, we propose two strategies that differ in the way the model updates and gradients are computed and updated.
Iiia Consensus based Federated Averaging (CFA)
The first strategy extends the centralized FA and it is described in the pseudocode fragment of Algorithm 1. It is referred to as Consensusbased Federated Averaging (CFA).
After initialization^{3}^{3}3Each device hosts a model of the same architecture and initialized similarly. of at time , on each communication round , device sends its model updates (once per round) and receives weights from neighbors , . Based on received data, the device updates its model sequentially to obtain the aggregated model
(8) 
where is the consensus stepsize and , , are the mixing weights for the models. Next, gradient update is performed using the aggregated model as
(9) 
by running SGD over a number of minibatches of size . Model aggregation (8) is similar to the sumweight gossip protocol [17, 18], when setting . However, mixing weights are used here to combine model innovations, , . In addition, the stepsize controls the consensus stability.
Inspired by FA approaches (Sect. II), the mixing weights are chosen as
(10) 
Other choices are based on weighted consensus strategies [23], where the mixing weights are adapted on each epoch based on current validation accuracy or loss metrics. The consensus stepsize can be chosen as where , namely the maximum degree of the graph [28] that models the interaction topology of the network. Notice that the graph has adjacency matrix where iff and otherwise. Beside consensus stepsize, it is additionally assumed that the SGD stepsize is optimized for convergence: namely, the objective function value is decreasing with each iteration of gradient descent, or after some threshold. Convergence is further analyzed in Sect. V with experimental data.
By defining as the set of parameters to be exchanged among neighbors, CFA requires the iterative exchange of model updates , therefore
(11) 
IiiB Consensus based Federated Averaging with Gradients Exchange (CFAGE)
The second strategy proposes the joint exchange of local gradients and model updates by following the fourstage iterative procedure illustrated in the Fig. 2.a for epoch . The new algorithm is referred to as Consensusbased Federated Averaging with Gradients Exchange (CFAGE). The first stage (step #1) is similar to CFA and obtains by consensusbased model aggregation in (8). Before using for the local model update, it is fed back to the same neighbors (“negotiation” stage in step #2 of Fig. 2.b). Model is then used by the neighbors to compute the gradients
(12) 
using their local data. Notice that all gradients are computed over a single batch^{4}^{4}4Sending multiple gradients (corresponding to minibatches) is an alternative option, not considered here for bandwidth limitations. (or minibatch) of local data, while the chosen batch/minibatch can change on consecutive communication rounds. Gradients are sent back to the device in step #3. Compared with CFA, this step allows every device to exploit additional gradients using neighbor data, and makes the learning much faster. On the device , the local model is thus updated using the received gradients (12) according to
(13) 
where are the mixing weights for the gradients. Finally, as done for CFA in (9), the gradient update is performed using now the aggregated model (13) and local data minibatches
(14) 
To summarize, for each device , CFAGE combines the gradients computed over the local data with the gradients , obtained by the neighbors over their batches. The negotiation stage (13)(14) is similar to the diffusion strategy proposed in [30][31]. In particular, we aggregate the model first (8), then we run one gradient descent round using the received gradients (13), and finally, a number of SGD rounds (14) using local minibatches. As revealed in Sect. V, optimization of the mixing weights for the gradients is critical for convergence. Considering that the gradients in (12), obtained from neighbors, are computed over a single batch of data, as opposed to local data minibatches, a reasonable choice is , . This aspect is further discussed in Sect. V.
IiiC Twostage negotiation and implementation aspects
Unlike CFA, CFAGE requires a larger use of the bandwidth and more communication rounds for the synchronous exchange of the gradients. More specifically, it requires a more intensive use of the D2D wireless links for sharing models first during the negotiations (step #2) and then forwarding gradients (step #3). In addition, each device should wait for neighbor gradients before applying any model update. Here, the proposed implementation simplifies the negotiation procedure to improve convergence time (and latency). In particular, it resorts to a twostage scheme, while, likewise CFA, each device can perform the updates without waiting for a reply from neighbors. Pseudocode is highlighted in Algorithm 2. Communication rounds vs. epoch for CFAGE are detailed in Fig. 2.b and compared with CFA. Considering the device , with straightforward generalization, the following parameters are exchanged with neighbors at epoch as
(15) 
namely the model updates (aggregations) and the gradients , organized as
(16) 
In the proposed two stage implementation, the negotiation step (step #2 in Fig. 2.a) is not implemented as it requires a synchronous model sharing. Therefore, the model aggregations are not available by device at epoch , or, equivalently, the device does not wait for such information from the neighbors. The gradients are now predicted as using the past (outdated) models , from the neighbors. In line with momentum based techniques (see [32] and also Appendix B), for the predictions we use a multivariate exponentially weighted moving average (MEWMA) of the past gradients
(17) 
ruled by the hyperparameter . Setting , the gradient is estimated using the last available model (): . A smaller value introduces a memory with depending on the past models . This is shown, in Sect. V, to be beneficial on real data.
Assuming that the device is able to correctly receive and decode the messages from the neighbors , at epoch , the model aggregation step changes from (8) to
(18) 
while the model update step using the received gradients is now
(19) 
and replaces (13). Finally, a gradient update on local data is done as in (14). Notice that Algorithm 2 implements (19) by running one gradient descent round per received gradient (lines ) to allow for asynchronous updates. In the Appendix B, we discuss the application of CFA and CFAGE to advanced SGD strategies designed to leverage momentum information [10][32].
IiiD Communication overhead and complexity analysis
With respect to FA, the proposed decentralized algorithms take some load off of the server, at the cost of additional innetwork operations and increased D2D communication overhead. Overhead is quantified here for both CFA and CFAGE in terms of the size of the parameters that need to be exchanged among neighbors. CFA extends FA and, similarly, requires each node to exchange only local model updates at most once per round. The overhead, or the size of , thus corresponds to the model size (11). For a generic DNN model of layers, the model size can be approximated in the order of . This is several order of magnitude lower than the size of the input training data, in typical ML and deep ML problems. As in (15), CFAGE requires the exchange of local model aggregations and one gradient per neighbor, . Overhead now scales with , where . This is still considerably lower than the training dataset size, provided that the number of participating neighbors is limited. In the examples of Sect. V, we show that neighbors are sufficient, in practice, to achieve convergence: notice that the number of active neighbors is also typically small to avoid traffic issues [37]. Finally, quantization of the parameters can be also applied to limit the transmission payload, with the side effect to improve also global model generalization [38].
Besides overhead, CFA and CFAGE computational complexity scales with the global model size and it is ruled by the number of local SGD rounds. However, unlike FA, model aggregations and local SGD are both implemented on the device. With respect to CFA, CFAGE computes up to additional gradients using neighbor models and up to additional gradient descent rounds (19) for local model update using the neighbor gradients. A quantitative evaluation of the overhead and the execution time of local computations is proposed in Sect. V by comparing FA, CFA and CFAGE using real data and lowpower System on Chip (SoC) devices.
Considering now networking aspects, the cost of a D2D communication is much lower than the cost of a server connection, typically longrange. D2D links cover shorter ranges and require lower transmit power: communication cost is thus ruled by the energy spent during receiving operations (radio wakeup, synchronization, decoding). Besides, in largescale and massive IIoT networks, sending model updates to the server, as done in conventional FA, might need several D2D communication rounds as relaying information via neighbor devices. D2D communications can serve as an underlay to the infrastructure (server) network and can thus exploit the same radio resources. Such twotier networks are a key concept in next generation IoT and 5G [39] scenarios.
Finally, optimal trading between innetwork and serverside operations is also possible by alternating rounds of FA with rounds of innetwork consensus (CFA or CFAGE). This corresponds to a realworld scenario where communication with the server is available, but intermittent, sporadic, unreliable or too costly. During initialization, i.e. at time , devices might use the last available global model received from the server, , after communication rounds of the previous FA phase, and obtain a local update via SGD: . This is fedback to neighbors to start CFA or CFAGE iterations.
Iv Consensusbased FL: an introductory example
In this section, we give an introductory example of consensusbased FL approaches comparing their performance to conventional FL methods. We resort here to a network of wireless devices communicating via multihop as depicted in Fig. 3 without any central coordination. Although simple, the proposed topology is still useful in practice to validate the performance of FL under the assumption that no device has direct (i.e., singlehop) connection with all nodes in the network. More practical usage scenarios are considered in Sect. V. Without affecting generality, the devices collaboratively learn a global model that is simplified here as a NN model with only one fully connected layer ():
(20) 
Considering the node network layout, the neighbor sets consist of , , , . Each th device has a database of local training data, , that are here taken from the MNIST (Modified National Institute of Standards and Technology) image database [40] of handwritten digits. Output labels take different values (from digit up to ), model inputs have size (each image is represented by grayscale pixels), while outputs have dimension . In Fig. 3, each device obtains the same number of training data images) taken randomly (IID) from the database consisting of images. NonIID data distribution is investigated in Fig. 4.
We assume that each device has prior knowledge of the model (20) structure at the initial stage (), namely the input/output size () and the nonlinear activation . Moreover, each of the local models starts from the same random initialization for [15]. Every new epoch , the devices perform consensus iterations using the model parameters received from the available neighbors during the previous epoch . Local model updates for CFA (9) and CFAGE (14) use the crossentropy loss for gradient computation
(21) 
where the sum is computed over minibatches of size . The devices thus make one training pass over their local dataset consisting of minibatches. For CFA, we choose , and mixing parameters as in (10). For CFAGE, the mixing parameters for gradients (13) are selected as and , , while the MEWMA hyperparameter is set to .
On every epoch , performance is analyzed in terms of validation loss (21) for all models. For testing, we considered the full MNIST validation dataset () consisting of images. The loss decreases over consecutive epochs as far as the model updates converge to the true global model .
In Figs. 34, we validate the performances of the CFA algorithms in case of uniform (Fig. 3) and uneven (Fig. 4) data distribution among the devices. More specifically, Fig. 3 compares CFA and CFAGE, with CFAGE using the twostage negotiation algorithm of Sect. IIIC and starting^{5}^{5}5At initial epochs we use the stage negotiation algorithm, described in Sect. IIIB. at epoch . On the other hand, in Fig. 4, we consider the general case where the data is unevenly distributed, while partitioning among devices is also non IID. Herein, we compare two cases. In the first one, device (Fig. 4 on the left) is located at the edge of the network and connected to one neighbor only. It obtains images from only of the available classes, namely the % of the training database of images. Device , connected with neighbors, retains a larger database ( images, % of the training database). In the second case (on the right), the situation is reversed. As expected, compared with the first case, convergence is more penalized in the second case, although CFA running on device (red lines) can still converge. As shown in Fig. 3, CFAGE (solid lines without markers) further reduces the loss, compared with CFA (circle markers). Effect of an unbalanced database for CFAGE is also considered in Sect. V.
FL and consensus schemes have been implemented using the TensorFlow library
[33], while realtime D2D connectivity is simulated by a MonteCarlo approach. All simulations are running for a maximum of epochs. Besides the proposed consensus strategies, validation loss is also computed for three different scenarios. The first one is labelled as “isolated training” and it is evaluated in Fig. 3 for IID and in Fig. 4 for non IID data. In this scenario, models are trained without any cooperation from neighbors (or server) by using locally trained data only. This use case is useful to highlight the benefits of mutual cooperation that are significant after epoch for IID and after epochs for non IID, according to the considered network layout. Notice that isolated training is also limited by overfitting effects after iteration , as clearly observable in Fig. 3 and in Fig. 4, since local/isolated model optimization is based only on few training images, compared with the validation database of images. Consensus and mutual cooperation among devices thus prevents such overfitting. The second scenario “centralized ML without federation” (dashed lines in Figs. 3 and 4) corresponds to (5) and gives the validation loss obtained when all nodes are sending all their locally trained data directly to the server. It serves as benchmark for convergence analysis as provides the optimal parameter set considering images for training. Notice that the CFAGE method closely approaches the optimal parameter set and converges faster than CFA. The third scenario implements the FA strategy (see Sect. II) that relies on server coordination, while cooperation among devices through D2D links is not enabled. As depicted in Fig. 3, the convergence of the FA validation loss is similar to those of devices and , although convergence speed is slightly faster after epoch . In fact, for the considered network layout, devices and can be considered a good replacement of the server, being directly connected with most of the devices. In the next section, we consider a more complex device deployment in a IIoT challenging scenario.V Validation in an experimental IIoT scenario
The proposed innetwork FL approaches of Sect. IV are validated here on a realworld IIoT use case. Data are partitioned over IIoT devices and D2D connectivity [14] is used here as a replacement to centralized communication infrastructure [34]. As depicted in Fig. 5, the reference scenario consists of a largescale and dense network [35] of autonomous IIoT devices that are sensing their surroundings using Frequency Modulated Continuous Wave (FMCW) radars [45] working in the GHz (subThz) band. Radars in the mmWave (or subTHz) bands are very effective in industrial production lines (or robotic cells, as in Fig. 5.b) for environment/obstacle detection [36], velocity/distance measurement and virtual reality applications [43]. In addition, mmWave radios have been also considered as candidates for 5G new radio (NR) allocation. They thus represent promising solutions towards the convergence of dense communications and advanced sensing technologies [3].
In the proposed setup, the above cited devices are employed to monitor a shared industrial workspace during HumanRobot Collaboration (HRC) tasks to detect and track the position of the human operators (i.e., the range distance from the individuals) that are moving nearby a robotic manipulator inside a fenceless space [42]. In industrial shared workplaces, measuring positions and distance is mandatory to enforce a worker protection policy, to implement collision avoidance, reduction of speed, anticipating contacts of limited entity, etc. In addition, it is highly desirable that operators are set free from wearable devices to generate location information [3]. Tracking of body movements must also not depend on active human involvement. For static background, the problem of passive body detection and ranging can be solved via background subtraction methods, and ML tools (see [44] and references therein). The presence of the robot, often characterized by a massive metallic size, that moves inside the shared workplace, poses additional remarkable challenges in ranging and positioning, because robots induce large, nonstationary, and fast RF perturbation effects [42].
The radars collect a large amount of data, that cannot be shipped back to the server for training and inference, due to the latency constraints imposed by the worker safety policies. In addition, direct communication with the server is available but reserved to monitor the robot activities (and replanning robotic tasks in case of dangerous situations) [42] and should not be used for data distribution. Therefore, to solve the scalability challenge while addressing latency, reliability and bandwidth efficiency, we let the devices perform model training without any server coordination but using only mutual cooperation with neighbors. We thus adopt the proposed innetwork FL algorithms relying solely on local model exchanges over the D2D active links.
In what follows, we first describe the dataset and the ML model adopted for body motion recognition. Next, we investigate the convergence properties of CFA and CFAGE solutions, namely the required number of communication rounds (i.e., latency) for varying connectivity layouts, network size and hyperparameters choices, such as mixing weights and step sizes. Finally, we provide a quantitative evaluation of the communication overhead and of the local computational complexity comparing all proposed algorithms.
Va Data collection and processing
In the proposed setup, the radar (see [43] for a review) transmitting antennas radiate a sweeped modulated waveform [45] with bandwidth equal to GHz, carrier frequency GHz, and ramp (pulse) duration set to
ms. The radar echoes, reflected by moving objects are mixed at the receiver with the transmitted signal to obtain the beat signal. Beat signals are then converted in the frequency domain (
i.e., beat signal spectrum) by using apoint Fast Fourier Transform (FFT) and averaged over
consecutive frames (i.e., frequency sweeps or ramps). FFT samples are used as model inputs and serve as training data collected by the individual devices. The network of radars is designed to discriminate body movements from robots and, in turn, to detect the distance of the worker from the robot with the purpose of identifying potential unsafe conditions [36]. The ML model is here trained to classify
potential HR collaborative situations characterized by different HR distances corresponding to safe or unsafe conditions. In particular, class (model output ) corresponds to the robot and the worker cooperating at a safe distance (distance m), class () identifies the human operator as working closeby the robot, at distance m. The remaining classes are: ( distance m), ( distance m), ( distance m), ( distance m), ( distance m), ( distance m). The FFT range measurements (i.e., beat signal spectrum) and the corresponding true labels in Fig. 5.c, are collected independently by the individual devices and stored locally. During the initial FL stage, each device independently obtains FFT range measurements. Data distribution is also nonIID: in other words, most of the devices have local examples only for a (random) subset of the classes of the global model. However, we assume that there are sufficient examples for each class considering the data stored on all devices. Local datasets correspond to the % of the full training database. Minibatches for local gradients have size equal to , training passes thus consist of minibatches, for fast model update. On the contrary, validation data consists of range measurements collected inside the industrial plant.CNN  2NN  

NN model 


Layer  
:  :  
Layer  :  : 
:  : 
Unlike the previous section, we now choose a ML global model characterized by a NN with trainable layers. In particular, two networks are considered with hyperparameters and corresponding dimensions for weights and biases that are detailed in Table I. The first convolutional NN model (CNN) consists of a 1D convolutional layer ( filters with
taps) followed by maxpooling (non trainable, size
and stride
) and a fully connected (FC) layer of dimension . The second model (2NN) replaces the convolutional layer with an FC layer of hidden nodes (dimension ) followed by a ReLu layer and a second FC layer of dimension . The examples are useful to assess the convergence properties of the proposed distributed strategies for different layer types, dimensions and number of trainable parameters. As before, we further assume that, during the initial stage, each device has knowledge of the ML global model structure (see layers and dimensions in Table I). At each new communication round, model parameters for each layer are multiplexed and propagated simultaneously by using a Time Division Multiple Access (TDMA) scheme [14].FL has been simulated on a virtual environment but using real data from the plant. This virtual environment creates an arbitrary number of virtual devices, each configured to process an assigned training dataset and exchanging parameters that are saved in realtime on temporary cache files. Files may be saved on RAM disks to speed up the simulation time. The software is written in Python and uses TensorFlow and multiprocessing modules: simplified configurations for testing both CFA and CFAGE setups are also provided in the repository [46]
. The code script examples are available as open source and show the application of CFA and CFAGE for different NN models. Hyperparameters such as learning rates for weights
, gradients , number of neighbors and in (17) are fully configurable. The datasets obtained in the scenario of Fig. 5.b are also available in the same repository. Finally, examples have been provided for implementation and analysis of execution time on low power devices (Sect. VC). The current optimization toolkit does not simulate, or account for, packet losses during communication: this is considered as negligible for shortrange connections. However, the network and connectivity can be timevarying and arbitrarily defined.Configuration (1)  Configuration (2)  Configuration (3) 
CFA  
CFAGE  
VB Gradient exchange optimization for NN
In what follows, FL is verified by varying the number of devices and number of neighbors to test different D2D connectivity layouts. In Figure 6, we validate the consensus based FL tool, for both CNN and 2NN models, over networks with increasing number of devices from to . To simplify the analysis of different connectivity scenarios, the network is simulated as kregular (i.e., all network devices have the same number of neighbors, or degree) while we verify realistic topologies characterized by neighbors per node. First, in Fig. 6, we compare decentralized CFA and CFAGE with FA and conventional centralized ML without federation in (5). The chosen optimization hyperparameters are summarized in Table II. For all FL cases (CFA, CFAGE and FA), we plot the validation loss vs. communication rounds (i.e., epochs) averaged over all devices. For centralized ML (dashed lines), validation loss is analyzed over epochs now running inside the server. The CFA plots (circle markers) approach slowly the curve corresponding to FA and centralized ML, while performance improves in dense networks (dotted lines). CFAGE curves (solid and dotted lines without markers) are comparable with FA, and converge after communication rounds. Use of neighbors (solid lines) is sufficient to approach FA performances. Increasing the number of neighbors to (dotted lines) makes the validation loss comparable with the centralized ML without federation. Running local SGD on received gradients as in eq. (19) causes some fluctuations of the validation loss as approaching convergence. Fluctuations are due to the (large) step size used to combine the gradients every communication round: learning rate adaptation techniques [10] can be applied for fine tuning. Considering the 2NN model, validation loss is larger for all cases as the result of the larger number of model parameters to train, compared with CNN. CFAGE is still comparable with FA mostly after rounds and converges towards centralized ML after rounds. In all cases, neighbors are sufficient to approach FA results. More neighbors provide performance improvements mostly in small networks ( devices), while it is still useful to match centralized ML performances.
In Fig. 7, we consider the CFAGE method and analyze more deeply the effect of the hyperparameter choice on convergence, for devices and varying network degrees . The first case () is representative of a multihop wireless network; networks with larger degrees (, ) are useful to verify the performance of FL over denser networks. Line plots with bars in Fig. 7 are used to graphically represent the variability of the validation loss observed by the devices. We analyze the learning rate () used to combine the gradients received from the neighbors in (19). Other hyperparameters are selected as in Table II. As expected, increasing the network degree helps convergence and makes the validation loss to decrease faster since less communication rounds are required. However, while for degree networks the learning rates can be chosen arbitrarily in the range without affecting performance, denser networks, i.e., with large degree , require the optimization of the learning rate: smaller rates improve convergence for and .
In Table III, we analyze the latency of the FL process that is measured here in terms of number of communication rounds. CFAGE is considered in detail, while performance of CFA can be inferred from Fig. 6. The Table III reports the number ) of communication rounds (or epochs) that are required to achieve a target validation loss for all devices, such that , . For the considered case, the chosen validation loss of corresponds to a (global) accuracy of . Focusing on CNN layers, a network with neighbors per device requires a max of communication rounds (and a minimum of ) to achieve a target loss of . This is in line with the theoretical bound [24] for local accuracy (obtained by isolated training). Considering FA (not shown in the Table), the number of required rounds ranges from to , and it is again comparable with decentralized optimization. Increasing the number of neighbors to , the required communication rounds reduce to and to for . For 2NN layers, the required number of epochs increases due to the smaller local accuracy as well as the larger number of parameters to be trained for each NN layer. Finally, for the proposed setup, we noticed that performance improves by keeping the learning rate for the hidden layer parameters , () slightly larger than the rate for the output layer parameters , (). This is particularly evident when convolutional layers are used.
Layers  CNN ()  2NN ( 
Comments
There are no comments yet.