Federated Learning with Cooperating Devices: A Consensus Approach for Massive IoT Networks

12/27/2019
by   Stefano Savazzi, et al.
0

Federated learning (FL) is emerging as a new paradigm to train machine learning models in distributed systems. Rather than sharing, and disclosing, the training dataset with the server, the model parameters (e.g. neural networks weights and biases) are optimized collectively by large populations of interconnected devices, acting as local learners. FL can be applied to power-constrained IoT devices with slow and sporadic connections. In addition, it does not need data to be exported to third parties, preserving privacy. Despite these benefits, a main limit of existing approaches is the centralized optimization which relies on a server for aggregation and fusion of local parameters; this has the drawback of a single point of failure and scaling issues for increasing network size. The paper proposes a fully distributed (or server-less) learning approach: the proposed FL algorithms leverage the cooperation of devices that perform data operations inside the network by iterating local computations and mutual interactions via consensus-based methods. The approach lays the groundwork for integration of FL within 5G and beyond networks characterized by decentralized connectivity and computing, with intelligence distributed over the end-devices. The proposed methodology is verified by experimental datasets collected inside an industrial IoT environment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

01/09/2021

Robust Blockchained Federated Learning with Model Validation and Proof-of-Stake Inspired Consensus

Federated learning (FL) is a promising distributed learning solution tha...
03/18/2021

A Framework for Energy and Carbon Footprint Analysis of Distributed and Federated Edge Learning

Recent advances in distributed learning raise environmental concerns due...
04/01/2021

Decentralized and Model-Free Federated Learning: Consensus-Based Distillation in Function Space

This paper proposes a decentralized FL scheme for IoE devices connected ...
11/10/2021

DACFL: Dynamic Average Consensus Based Federated Learning in Decentralized Topology

Federated learning (FL) is a burgeoning distributed machine learning fra...
07/19/2021

Trends in Blockchain and Federated Learning for Data Sharing in Distributed Platforms

With the development of communication technologies in 5G networks and th...
10/28/2021

Computational Intelligence and Deep Learning for Next-Generation Edge-Enabled Industrial IoT

In this paper, we investigate how to deploy computational intelligence a...
07/09/2021

FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning

Applying Federated Learning (FL) on Internet-of-Things devices is necess...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Beyond 5G systems are expected to leverage cross-fertilizations between wireless systems, core networking, Machine Learning (ML) and Artificial Intelligence (AI) techniques, targeting not only communication and networking tasks, but also augmented environmental perception services

[1]. The combination of powerful AI tools, e.g. Deep Neural Networks (DNN), with massive usage of Internet of Things (IoT), is expected to provide advanced services [2] in several domains such as Industry 4.0 [3], Cyber-Physical Systems (CPS) [4] and smart mobility [5]

. Considering this envisioned landscape, it is of paramount importance to integrate emerging deep learning breakthroughs within future generation wireless networks, characterized by arbitrary distributed connectivity patterns (e.g., mesh, cooperative, peer-to-peer, or spontaneous), along with strict constraints in terms of latency

[6] (i.e., to support Ultra-Reliable Low-Latency Communications - URLLC) and battery lifetime.

Fig. 1: From left to right: a) FL based on centralized fusion of local model (or gradient) updates; b) proposed consensus-based FL with distributed fusion over an infrastructure-less network. Learning of global model parameters from local data examples is obtained by mutual cooperation between neighbors sharing the local model updates .

Recently, federated optimization, or federated learning (FL) [7][8] has emerged as a new paradigm in distributed ML setups [9]. The goal of FL systems is to train a shared global model (i.e., a Neural Network - NN) from a federation of participating devices acting as local learners under the coordination of a central server for models aggregation. As shown in Fig. 1.a, FL alternates between a local model computation at each device and a round of communication with a server. Devices, or workers, derive a set of local learning parameters from the available training data, referred to as local model. The local model at time and device

is typically obtained via back-propagation and Stochastic Gradient Descent (SGD)

[10] methods using local training examples . The server obtains a global model by fusion of local models and then feeds back such model to the devices. Multiple rounds are repeated until convergence is reached. The objective of FL is thus to build a global model by the cooperation of a number of devices. Model is characterized by parameters (i.e., NN weights and biases for each layer) for the output quantity and the observed (input) data . Since it decouples the ML stages from the need to send data to the server, FL provides strong privacy advantages compared to conventional centralized learning methods.

I-a Decentralized FL and related works

Next generation networks are expected to be underpinned by new forms of decentralized, infrastructure-less communication paradigms [11] enabling devices to cooperate directly over device-to-device (D2D) spontaneous connections (e.g., multihop or mesh). These networks are designed to operate - when needed - without the support of a central coordinator, or with limited support for synchronization and signalling. They are typically deployed in mission-critical control applications where edge nodes cannot rely on a remote unit for fast feedback and have to manage part of the computing tasks locally [12], cooperating with neighbors to self disclose the information. Typical examples are low-latency safety-related services for vehicular [13] or industrial [14] applications. Considering these trends, research activities are now focusing on fully decentralized (i.e., server-less) learning approaches. Optimization of the model running on the devices with privacy constraints is also critical [16] for human-related applications.

To the authors knowledge, few attempts have been made to address the problem of decentralized FL. In [17][18], a gossip protocol is adopted for ML systems. Through sum-weight gossip, local models are propagated over a peer-to-peer network. However, in FL over D2D networks, gossip-based methods cannot be fully used because of medium access control (MAC) and half-duplex constraints, which are ignored in these early approaches. More recently, [19] considers an application of distributed learning for medical data centers where several servers collaborate to learn a common model over a fully connected network. However, network scalability/connectivity issues are not considered at all. In [20], a segmented gossip aggregation is proposed. The global model is split into non overlapping subsets and local learners aggregate the segmentation sets from other learners. Having split the model, by introducing ad-hoc subsets and segment management tasks, the approach is extremely application-dependent and not suitable for more general ML contexts. Finally, [21][22]

propose a peer-to-peer Bayesian-like approach to iteratively estimate the posterior model distribution. Focus is on convergence speed and load balancing, yet simulations are limited to a few nodes, and the proposed method cannot be easily generalized to NN systems trained by incremental gradient (

i.e., SGD) or momentum based methods.

I-B Contributions

This paper proposes the application of FL principles to massively dense and fully decentralized networks that do not rely upon a central server coordinating the learning process. As shown in Fig. 1.b, the proposed FL algorithms leverage the mutual cooperation of devices that perform data operations inside the network (in-network) via consensus-based methods [23]. Devices independently perform training steps on their local dataset (batches) based on a local objective function, by using SGD and the fused models received from the neighbors. Next, similarly to gossip [17], devices forward the model updates to their one-hop neighborhood for a new consensus step. Unlike the methods in [17]-[22], the proposed approach is general enough to be seamlessly applied to any NN model trained by SGD or momentum methods. In addition, in this paper we investigate the scalability problem for varying NN model layer size, considering large and dense D2D networks with different connectivity graphs. These topics are discussed, for the first time, by focusing on an experimental Industrial IoT (IIoT) setting.

The paper contributions are summarized in the following:

  • the federated averaging algorithm [9][15] is revisited to allow local learners to implement consensus techniques by exchanging local model updates: consensus also extends existing gossip approaches;

  • a novel class of FL algorithms based on the iteratively exchange of both model updates and gradients is proposed to improve convergence and minimize the number of communication rounds, in exchange for a more intensive use of D2D links and local computing;

  • all presented algorithms are validated over large scale massive networks with intermittent, sporadic or varying connectivity, focusing in particular on an experimental IIoT setup, and considering both complexity, convergence speed, communication overhead, and average execution time on embedded devices.

The paper is organized as follows. Sect. II reviews the FL problem. Sect. III proposes two consensus-based algorithms for decentralized FL. Validation of the proposed methods is first addressed in Sect. IV on a simple network and scenario. Then, in Sect. V the validation is extended to a large scale setup by focusing on a real-world problem in the IIoT domain. Finally, Sect. VI draws some conclusions and proposes future investigations.

Ii FL for model optimization

The FL approach defines an incremental algorithm for model optimization over a large population of devices. It belongs to the family of incremental gradient algorithms [26][29] but, unlike these setups, optimization typically focuses on non-convex objectives that are commonly found in NN problems. The goal is to learn a global model for inference problems (i.e.,

classification or regression applications) that transforms the input observation vector

into the outputs , with model parameters embodied in the matrix while is the output size. Observations, or input data, are stored across devices connected with a central server that coordinates the learning process. A common, and practical, assumption is that the number of participating devices is large () and they have intermittent connectivity with the server. The cost of communication is also much higher than local computation, in terms of capacity, latency and energy consumption [25]. In this paper, we will focus specifically on NN models. Therefore, considering a NN of layers, the model iteratively computes a non-linear function of a weighted sum of the input values, namely

(1)

with , , being the hidden layers and the input vector. The matrix

(2)

collects all the parameters of the model, namely the weights and the biases for each defined layer, with and the corresponding input and output layer dimensions, respectively111To simplify the notation, here we assume that the layers have equal input/output size in , but the model can be generalized to account for different dimensions (see Sect. V).. FL applies independently to each layer222Weights and biases are also optimized independently. of the network. Therefore, in what follows, optimization focuses on one layer and the matrix (2) reduces to . The parameters of the convolutional layers, namely input, output and kernel dimensions, can be easily reshaped to conform with the above general representation.

In FL it is commonly assumed [7][8] that a large database of examples is (unevenly) partitioned among the devices, under non Identical Independent Distribution (non-IID) assumptions. Examples are thus organized as the tuples , where represents the data, while are the desired model outputs . The set of examples, or training data, available at device is , where is the size of the k-th dataset under the non-IID assumption. The training data on a given device is thus not representative of the full population distribution. In practical setups (see Sect. V), data is collected individually by the devices based on their local/partial observations of the given phenomenon.

Unlike incremental gradient algorithms [30][31], FL of model is applicable to any finite-sum objective of the general form

(3)

where is the loss, or cost, associated with the -th device

(4)

and is the loss of the predicted model over the examples observed by the device , assuming model parameters to hold.

In conventional centralized ML (i.e., learning without federation), used here as benchmark, the server collects all local training data from the devices and obtains the optimization of model parameters by applying an incremental gradient method over a number of batches from the training dataset. For iteration , the model parameters are thus updated by the server according to

(5)

where is the SGD step size and the gradient of the loss in (3) over the assigned batches and w.r.t. the model

. Backpropagation is used here for gradients computation. The model estimate at convergence is denoted as

.

Rather than sharing the training data with the server, in FL the model parameters are optimized collectively by interconnected devices, acting as local learners. On each round of communication, the server distributes the current global model to a subset of devices. The devices independently update the model using the gradients (SGD) from local training data as

(6)

where represents the gradient of the loss (4) observed by the -th device and w.r.t. the model . Updates (6), or local models , are sent back to the server, after quantization, anonymization [7] and compression stages, modelled here by the operator . A global model update is obtained by the server through aggregation according to

(7)

Convergence towards the centralized approach (5) is achieved if . Notice that the learning rate is typically kept smaller [9] compared with centralized learning (5) on large datasets. Aggregation model (7) is referred to as Federated Averaging (FA) [7][15]. As far as convergence is concerned, for strongly convex objective and generic local solvers, the general upper bound on global iteration number is given in [24] and relates both to global () and local () accuracy according to the equation .

Iii A consensus-based approach to in-network FL

The approaches proposed in this section allow the devices to learn the model parameters, solution of (3), by relying solely on local cooperation with neighbors, and local in-network (as opposed to centralized) processing. The interaction topology of the network is modelled as a directed graph with the set of nodes and edges (links) . As depicted in Fig. 1, the distributed devices are connected through a decentralized communication architecture based on D2D communications. The neighbor set of device is denoted as , with cardinality . Notice that we include node in the set , while does not. As introduced in the previous section, each device has a database of examples that are used to train a local NN model at some time

(epoch). The model maps input features

into outputs as in (1). A cost function, generally non-convex, as in (4), is used to optimize the weights of the local model.

The proposed FL approaches exploit both adaptive diffusion [31] and consensus tools [23][27] to optimally leverage the (possibly large) population of federated devices that cooperate for the distributed estimation of the global model , while retaining the trained data. Convergence is thus obtained if it is . Distributed in-network model optimization must satisfy convergence time constraints, as well as minimize the number of communication rounds. In what follows, we propose two strategies that differ in the way the model updates and gradients are computed and updated.

1:procedure CFA()
2:     initialize device
3:     for each round  do Main loop
4:          RX from neighbors
5:         
6:         for all devices  do
7:              
8:         end for
9:         
10:         () TX to neighbors
11:     end for
12:end procedure
13:procedure ModelUpdate() Local SGD
14:      mini-batches of size
15:     for batch  do Local model update
16:         
17:     end for
18:     
19:     return()
20:end procedure
Algorithm 1 Consensus-based Federated Averaging

Iii-a Consensus based Federated Averaging (CFA)

The first strategy extends the centralized FA and it is described in the pseudocode fragment of Algorithm 1. It is referred to as Consensus-based Federated Averaging (CFA).

After initialization333Each device hosts a model of the same architecture and initialized similarly. of at time , on each communication round , device sends its model updates (once per round) and receives weights from neighbors , . Based on received data, the device updates its model sequentially to obtain the aggregated model

(8)

where is the consensus step-size and , , are the mixing weights for the models. Next, gradient update is performed using the aggregated model as

(9)

by running SGD over a number of mini-batches of size . Model aggregation (8) is similar to the sum-weight gossip protocol [17, 18], when setting . However, mixing weights are used here to combine model innovations, , . In addition, the step-size controls the consensus stability.

Inspired by FA approaches (Sect. II), the mixing weights are chosen as

(10)

Other choices are based on weighted consensus strategies [23], where the mixing weights are adapted on each epoch based on current validation accuracy or loss metrics. The consensus step-size can be chosen as where , namely the maximum degree of the graph [28] that models the interaction topology of the network. Notice that the graph has adjacency matrix where iff and otherwise. Beside consensus step-size, it is additionally assumed that the SGD step-size is optimized for convergence: namely, the objective function value is decreasing with each iteration of gradient descent, or after some threshold. Convergence is further analyzed in Sect. V with experimental data.

By defining as the set of parameters to be exchanged among neighbors, CFA requires the iterative exchange of model updates , therefore

(11)

Iii-B Consensus based Federated Averaging with Gradients Exchange (CFA-GE)

Fig. 2: From top to bottom: a) CFA-GE; b) CFA-GE with two-stage negotiation and implementation (left) compared against CFA w/o gradient exchange (right).

The second strategy proposes the joint exchange of local gradients and model updates by following the four-stage iterative procedure illustrated in the Fig. 2.a for epoch . The new algorithm is referred to as Consensus-based Federated Averaging with Gradients Exchange (CFA-GE). The first stage (step #1) is similar to CFA and obtains by consensus-based model aggregation in (8). Before using for the local model update, it is fed back to the same neighbors (“negotiation” stage in step #2 of Fig. 2.b). Model is then used by the neighbors to compute the gradients

(12)

using their local data. Notice that all gradients are computed over a single batch444Sending multiple gradients (corresponding to mini-batches) is an alternative option, not considered here for bandwidth limitations. (or mini-batch) of local data, while the chosen batch/mini-batch can change on consecutive communication rounds. Gradients are sent back to the device in step #3. Compared with CFA, this step allows every device to exploit additional gradients using neighbor data, and makes the learning much faster. On the device , the local model is thus updated using the received gradients (12) according to

(13)

where are the mixing weights for the gradients. Finally, as done for CFA in (9), the gradient update is performed using now the aggregated model (13) and local data mini-batches

(14)

To summarize, for each device , CFA-GE combines the gradients computed over the local data with the gradients , obtained by the neighbors over their batches. The negotiation stage (13)-(14) is similar to the diffusion strategy proposed in [30][31]. In particular, we aggregate the model first (8), then we run one gradient descent round using the received gradients (13), and finally, a number of SGD rounds (14) using local mini-batches. As revealed in Sect. V, optimization of the mixing weights for the gradients is critical for convergence. Considering that the gradients in (12), obtained from neighbors, are computed over a single batch of data, as opposed to local data mini-batches, a reasonable choice is , . This aspect is further discussed in Sect. V.

1:procedure CFA-GE()
2:     initialize
3:     for each round  do Main loop
4:          RX
5:         
6:         for all devices  do
7:              
8:               gradients
9:               in MEWMA update
10:         end for
11:         
12:         for all devices  do
13:              
14:         end for
15:         
16:         
17:         
18:         () TX to neighbors
19:     end for
20:end procedure
Algorithm 2 CFA with gradients exchange

Iii-C Two-stage negotiation and implementation aspects

Unlike CFA, CFA-GE requires a larger use of the bandwidth and more communication rounds for the synchronous exchange of the gradients. More specifically, it requires a more intensive use of the D2D wireless links for sharing models first during the negotiations (step #2) and then forwarding gradients (step #3). In addition, each device should wait for neighbor gradients before applying any model update. Here, the proposed implementation simplifies the negotiation procedure to improve convergence time (and latency). In particular, it resorts to a two-stage scheme, while, likewise CFA, each device can perform the updates without waiting for a reply from neighbors. Pseudocode is highlighted in Algorithm 2. Communication rounds vs. epoch for CFA-GE are detailed in Fig. 2.b and compared with CFA. Considering the device , with straightforward generalization, the following parameters are exchanged with neighbors at epoch as

(15)

namely the model updates (aggregations) and the gradients , organized as

(16)

In the proposed two stage implementation, the negotiation step (step #2 in Fig. 2.a) is not implemented as it requires a synchronous model sharing. Therefore, the model aggregations are not available by device at epoch , or, equivalently, the device does not wait for such information from the neighbors. The gradients are now predicted as using the past (outdated) models , from the neighbors. In line with momentum based techniques (see [32] and also Appendix B), for the predictions we use a multivariate exponentially weighted moving average (MEWMA) of the past gradients

(17)

ruled by the hyper-parameter . Setting , the gradient is estimated using the last available model (): . A smaller value introduces a memory with depending on the past models . This is shown, in Sect. V, to be beneficial on real data.

Assuming that the device is able to correctly receive and decode the messages from the neighbors , at epoch , the model aggregation step changes from (8) to

(18)

while the model update step using the received gradients is now

(19)

and replaces (13). Finally, a gradient update on local data is done as in (14). Notice that Algorithm 2 implements (19) by running one gradient descent round per received gradient (lines -) to allow for asynchronous updates. In the Appendix B, we discuss the application of CFA and CFA-GE to advanced SGD strategies designed to leverage momentum information [10][32].

Iii-D Communication overhead and complexity analysis

With respect to FA, the proposed decentralized algorithms take some load off of the server, at the cost of additional in-network operations and increased D2D communication overhead. Overhead is quantified here for both CFA and CFA-GE in terms of the size of the parameters that need to be exchanged among neighbors. CFA extends FA and, similarly, requires each node to exchange only local model updates at most once per round. The overhead, or the size of , thus corresponds to the model size (11). For a generic DNN model of layers, the model size can be approximated in the order of . This is several order of magnitude lower than the size of the input training data, in typical ML and deep ML problems. As in (15), CFA-GE requires the exchange of local model aggregations and one gradient per neighbor, . Overhead now scales with , where . This is still considerably lower than the training dataset size, provided that the number of participating neighbors is limited. In the examples of Sect. V, we show that neighbors are sufficient, in practice, to achieve convergence: notice that the number of active neighbors is also typically small to avoid traffic issues [37]. Finally, quantization of the parameters can be also applied to limit the transmission payload, with the side effect to improve also global model generalization [38].

Besides overhead, CFA and CFA-GE computational complexity scales with the global model size and it is ruled by the number of local SGD rounds. However, unlike FA, model aggregations and local SGD are both implemented on the device. With respect to CFA, CFA-GE computes up to additional gradients using neighbor models and up to additional gradient descent rounds (19) for local model update using the neighbor gradients. A quantitative evaluation of the overhead and the execution time of local computations is proposed in Sect. V by comparing FA, CFA and CFA-GE using real data and low-power System on Chip (SoC) devices.

Considering now networking aspects, the cost of a D2D communication is much lower than the cost of a server connection, typically long-range. D2D links cover shorter ranges and require lower transmit power: communication cost is thus ruled by the energy spent during receiving operations (radio wake-up, synchronization, decoding). Besides, in large-scale and massive IIoT networks, sending model updates to the server, as done in conventional FA, might need several D2D communication rounds as relaying information via neighbor devices. D2D communications can serve as an underlay to the infrastructure (server) network and can thus exploit the same radio resources. Such two-tier networks are a key concept in next generation IoT and 5G [39] scenarios.

Finally, optimal trading between in-network and server-side operations is also possible by alternating rounds of FA with rounds of in-network consensus (CFA or CFA-GE). This corresponds to a real-world scenario where communication with the server is available, but intermittent, sporadic, unreliable or too costly. During initialization, i.e. at time , devices might use the last available global model received from the server, , after communication rounds of the previous FA phase, and obtain a local update via SGD: . This is fed-back to neighbors to start CFA or CFA-GE iterations.

Iv Consensus-based FL: an introductory example

In this section, we give an introductory example of consensus-based FL approaches comparing their performance to conventional FL methods. We resort here to a network of wireless devices communicating via multihop as depicted in Fig. 3 without any central coordination. Although simple, the proposed topology is still useful in practice to validate the performance of FL under the assumption that no device has direct (i.e., single-hop) connection with all nodes in the network. More practical usage scenarios are considered in Sect. V. Without affecting generality, the devices collaboratively learn a global model that is simplified here as a NN model with only one fully connected layer ():

(20)

Considering the -node network layout, the neighbor sets consist of , , , . Each -th device has a database of local training data, , that are here taken from the MNIST (Modified National Institute of Standards and Technology) image database [40] of handwritten digits. Output labels take different values (from digit up to ), model inputs have size (each image is represented by grayscale pixels), while outputs have dimension . In Fig. 3, each device obtains the same number of training data images) taken randomly (IID) from the database consisting of images. Non-IID data distribution is investigated in Fig. 4.

We assume that each device has prior knowledge of the model (20) structure at the initial stage (), namely the input/output size () and the non-linear activation . Moreover, each of the local models starts from the same random initialization for [15]. Every new epoch , the devices perform consensus iterations using the model parameters received from the available neighbors during the previous epoch . Local model updates for CFA (9) and CFA-GE (14) use the cross-entropy loss for gradient computation

(21)

where the sum is computed over mini-batches of size . The devices thus make one training pass over their local dataset consisting of mini-batches. For CFA, we choose , and mixing parameters as in (10). For CFA-GE, the mixing parameters for gradients (13) are selected as and , , while the MEWMA hyper-parameter is set to .

On every epoch , performance is analyzed in terms of validation loss (21) for all models. For testing, we considered the full MNIST validation dataset () consisting of images. The loss decreases over consecutive epochs as far as the model updates converge to the true global model .

Fig. 3: Comparison of FL methods over a multihop wireless network of devices. Validation loss vs. iterations over the full MNIST dataset for all devices: CFA (circle markers), CFA-GE (solid lines without markers), FA (red line), isolated model training (diamond markers), and centralized ML without federation (dashed line). Iterations correspond to epochs when running inside the server, or communication rounds when running consensus or FA.
Fig. 4: Effect of non-IID unbalanced training data over device and (red lines) of a multihop wireless network composed of devices. Validation loss over the full MNIST dataset vs. epochs (or consensus iterations). Comparison between CFA (solid lines), isolated model training (diamond markers), and centralized ML without federation (dashed line) is also presented. Non-IID data distribution is shown visually on top, for each case.

In Figs. 3-4, we validate the performances of the CFA algorithms in case of uniform (Fig. 3) and uneven (Fig. 4) data distribution among the devices. More specifically, Fig. 3 compares CFA and CFA-GE, with CFA-GE using the two-stage negotiation algorithm of Sect. III-C and starting555At initial epochs we use the -stage negotiation algorithm, described in Sect. III-B. at epoch . On the other hand, in Fig. 4, we consider the general case where the data is unevenly distributed, while partitioning among devices is also non IID. Herein, we compare two cases. In the first one, device (Fig. 4 on the left) is located at the edge of the network and connected to one neighbor only. It obtains images from only of the available classes, namely the % of the training database of images. Device , connected with neighbors, retains a larger database ( images, % of the training database). In the second case (on the right), the situation is reversed. As expected, compared with the first case, convergence is more penalized in the second case, although CFA running on device (red lines) can still converge. As shown in Fig. 3, CFA-GE (solid lines without markers) further reduces the loss, compared with CFA (circle markers). Effect of an unbalanced database for CFA-GE is also considered in Sect. V.

Fig. 5:

From left to right: a) experimental setup: network and sensing model for in-network federated learning with Convolutional Neural Network (CNN, top-left corner) examples whose parameters are shown in Table

I; b) industrial scenario (CNR-STIIMA de-manufacturing pilot plant) and deployed radars; c) examples of measured beat signal spectrum ( point FFT) for selected classes ().

FL and consensus schemes have been implemented using the TensorFlow library

[33], while real-time D2D connectivity is simulated by a MonteCarlo approach. All simulations are running for a maximum of epochs. Besides the proposed consensus strategies, validation loss is also computed for three different scenarios. The first one is labelled as “isolated training” and it is evaluated in Fig. 3 for IID and in Fig. 4 for non IID data. In this scenario, models are trained without any cooperation from neighbors (or server) by using locally trained data only. This use case is useful to highlight the benefits of mutual cooperation that are significant after epoch for IID and after epochs for non IID, according to the considered network layout. Notice that isolated training is also limited by overfitting effects after iteration , as clearly observable in Fig. 3 and in Fig. 4, since local/isolated model optimization is based only on few training images, compared with the validation database of images. Consensus and mutual cooperation among devices thus prevents such overfitting. The second scenario “centralized ML without federation” (dashed lines in Figs. 3 and 4) corresponds to (5) and gives the validation loss obtained when all nodes are sending all their locally trained data directly to the server. It serves as benchmark for convergence analysis as provides the optimal parameter set considering images for training. Notice that the CFA-GE method closely approaches the optimal parameter set and converges faster than CFA. The third scenario implements the FA strategy (see Sect. II) that relies on server coordination, while cooperation among devices through D2D links is not enabled. As depicted in Fig. 3, the convergence of the FA validation loss is similar to those of devices and , although convergence speed is slightly faster after epoch . In fact, for the considered network layout, devices and can be considered a good replacement of the server, being directly connected with most of the devices. In the next section, we consider a more complex device deployment in a IIoT challenging scenario.

V Validation in an experimental IIoT scenario

The proposed in-network FL approaches of Sect. IV are validated here on a real-world IIoT use case. Data are partitioned over IIoT devices and D2D connectivity [14] is used here as a replacement to centralized communication infrastructure [34]. As depicted in Fig. 5, the reference scenario consists of a large-scale and dense network [35] of autonomous IIoT devices that are sensing their surroundings using Frequency Modulated Continuous Wave (FMCW) radars [45] working in the GHz (sub-Thz) band. Radars in the mmWave (or sub-THz) bands are very effective in industrial production lines (or robotic cells, as in Fig. 5.b) for environment/obstacle detection [36], velocity/distance measurement and virtual reality applications [43]. In addition, mmWave radios have been also considered as candidates for 5G new radio (NR) allocation. They thus represent promising solutions towards the convergence of dense communications and advanced sensing technologies [3].

In the proposed setup, the above cited devices are employed to monitor a shared industrial workspace during Human-Robot Collaboration (HRC) tasks to detect and track the position of the human operators (i.e., the range distance from the individuals) that are moving nearby a robotic manipulator inside a fenceless space [42]. In industrial shared workplaces, measuring positions and distance is mandatory to enforce a worker protection policy, to implement collision avoidance, reduction of speed, anticipating contacts of limited entity, etc. In addition, it is highly desirable that operators are set free from wearable devices to generate location information [3]. Tracking of body movements must also not depend on active human involvement. For static background, the problem of passive body detection and ranging can be solved via background subtraction methods, and ML tools (see [44] and references therein). The presence of the robot, often characterized by a massive metallic size, that moves inside the shared workplace, poses additional remarkable challenges in ranging and positioning, because robots induce large, non-stationary, and fast RF perturbation effects [42].

The radars collect a large amount of data, that cannot be shipped back to the server for training and inference, due to the latency constraints imposed by the worker safety policies. In addition, direct communication with the server is available but reserved to monitor the robot activities (and re-planning robotic tasks in case of dangerous situations) [42] and should not be used for data distribution. Therefore, to solve the scalability challenge while addressing latency, reliability and bandwidth efficiency, we let the devices perform model training without any server coordination but using only mutual cooperation with neighbors. We thus adopt the proposed in-network FL algorithms relying solely on local model exchanges over the D2D active links.

In what follows, we first describe the dataset and the ML model adopted for body motion recognition. Next, we investigate the convergence properties of CFA and CFA-GE solutions, namely the required number of communication rounds (i.e., latency) for varying connectivity layouts, network size and hyper-parameters choices, such as mixing weights and step sizes. Finally, we provide a quantitative evaluation of the communication overhead and of the local computational complexity comparing all proposed algorithms.

V-a Data collection and processing

In the proposed setup, the radar (see [43] for a review) transmitting antennas radiate a sweeped modulated waveform [45] with bandwidth equal to GHz, carrier frequency GHz, and ramp (pulse) duration set to

ms. The radar echoes, reflected by moving objects are mixed at the receiver with the transmitted signal to obtain the beat signal. Beat signals are then converted in the frequency domain (

i.e., beat signal spectrum) by using a

-point Fast Fourier Transform (FFT) and averaged over

consecutive frames (i.e., frequency sweeps or ramps). FFT samples are used as model inputs and serve as training data collected by the individual devices. The network of radars is designed to discriminate body movements from robots and, in turn, to detect the distance of the worker from the robot with the purpose of identifying potential unsafe conditions [36]

. The ML model is here trained to classify

potential HR collaborative situations characterized by different HR distances corresponding to safe or unsafe conditions. In particular, class (model output ) corresponds to the robot and the worker cooperating at a safe distance (distance m), class () identifies the human operator as working close-by the robot, at distance m. The remaining classes are: ( distance m), ( distance m), ( distance m), ( distance m), ( distance m), ( distance m). The FFT range measurements (i.e., beat signal spectrum) and the corresponding true labels in Fig. 5.c, are collected independently by the individual devices and stored locally. During the initial FL stage, each device independently obtains FFT range measurements. Data distribution is also non-IID: in other words, most of the devices have local examples only for a (random) subset of the classes of the global model. However, we assume that there are sufficient examples for each class considering the data stored on all devices. Local datasets correspond to the % of the full training database. Mini-batches for local gradients have size equal to , training passes thus consist of mini-batches, for fast model update. On the contrary, validation data consists of range measurements collected inside the industrial plant.

Fig. 6: From left to right: a) FL for CNN model with , b) devices and c) 2NN model with devices. The network is characterized by (solid lines) and (dotted lines) neighbors per node. Comparative analysis of CFA, CFA-GE, FA (red lines), and centralized ML i.e. learning w/o federation (dashed lines). CNN and 2NN parameters are described in Table I.
CNN 2-NN
NN model

Layer
: :
Layer : :
: :
TABLE I: NN models and trainable parameters (weights and biases) for classes.

Unlike the previous section, we now choose a ML global model characterized by a NN with trainable layers. In particular, two networks are considered with hyper-parameters and corresponding dimensions for weights and biases that are detailed in Table I. The first convolutional NN model (CNN) consists of a 1D convolutional layer ( filters with

taps) followed by max-pooling (non trainable, size

and stride

) and a fully connected (FC) layer of dimension . The second model (2NN) replaces the convolutional layer with an FC layer of hidden nodes (dimension ) followed by a ReLu layer and a second FC layer of dimension . The examples are useful to assess the convergence properties of the proposed distributed strategies for different layer types, dimensions and number of trainable parameters. As before, we further assume that, during the initial stage, each device has knowledge of the ML global model structure (see layers and dimensions in Table I). At each new communication round, model parameters for each layer are multiplexed and propagated simultaneously by using a Time Division Multiple Access (TDMA) scheme [14].

FL has been simulated on a virtual environment but using real data from the plant. This virtual environment creates an arbitrary number of virtual devices, each configured to process an assigned training dataset and exchanging parameters that are saved in real-time on temporary cache files. Files may be saved on RAM disks to speed up the simulation time. The software is written in Python and uses TensorFlow and multiprocessing modules: simplified configurations for testing both CFA and CFA-GE setups are also provided in the repository [46]

. The code script examples are available as open source and show the application of CFA and CFA-GE for different NN models. Hyper-parameters such as learning rates for weights

, gradients , number of neighbors and in (17) are fully configurable. The data-sets obtained in the scenario of Fig. 5.b are also available in the same repository. Finally, examples have been provided for implementation and analysis of execution time on low power devices (Sect. V-C). The current optimization toolkit does not simulate, or account for, packet losses during communication: this is considered as negligible for short-range connections. However, the network and connectivity can be time-varying and arbitrarily defined.

Configuration (1) Configuration (2) Configuration (3)
CFA
CFA-GE
TABLE II: Optimized hyper-parameters for CFA and CFA-GE.

V-B Gradient exchange optimization for NN

In what follows, FL is verified by varying the number of devices and number of neighbors to test different D2D connectivity layouts. In Figure 6, we validate the consensus based FL tool, for both CNN and 2NN models, over networks with increasing number of devices from to . To simplify the analysis of different connectivity scenarios, the network is simulated as k-regular (i.e., all network devices have the same number of neighbors, or degree) while we verify realistic topologies characterized by neighbors per node. First, in Fig. 6, we compare decentralized CFA and CFA-GE with FA and conventional centralized ML without federation in (5). The chosen optimization hyper-parameters are summarized in Table II. For all FL cases (CFA, CFA-GE and FA), we plot the validation loss vs. communication rounds (i.e., epochs) averaged over all devices. For centralized ML (dashed lines), validation loss is analyzed over epochs now running inside the server. The CFA plots (circle markers) approach slowly the curve corresponding to FA and centralized ML, while performance improves in dense networks (dotted lines). CFA-GE curves (solid and dotted lines without markers) are comparable with FA, and converge after communication rounds. Use of neighbors (solid lines) is sufficient to approach FA performances. Increasing the number of neighbors to (dotted lines) makes the validation loss comparable with the centralized ML without federation. Running local SGD on received gradients as in eq. (19) causes some fluctuations of the validation loss as approaching convergence. Fluctuations are due to the (large) step size used to combine the gradients every communication round: learning rate adaptation techniques [10] can be applied for fine tuning. Considering the 2NN model, validation loss is larger for all cases as the result of the larger number of model parameters to train, compared with CNN. CFA-GE is still comparable with FA mostly after rounds and converges towards centralized ML after rounds. In all cases, neighbors are sufficient to approach FA results. More neighbors provide performance improvements mostly in small networks ( devices), while it is still useful to match centralized ML performances.

Fig. 7: From top to bottom: validation loss for varying rates () for devices and k-regular networks with varying connectivity, ranging from a) , b) up to c) .

In Fig. 7, we consider the CFA-GE method and analyze more deeply the effect of the hyper-parameter choice on convergence, for devices and varying network degrees . The first case () is representative of a multihop wireless network; networks with larger degrees (, ) are useful to verify the performance of FL over denser networks. Line plots with bars in Fig. 7 are used to graphically represent the variability of the validation loss observed by the devices. We analyze the learning rate () used to combine the gradients received from the neighbors in (19). Other hyper-parameters are selected as in Table II. As expected, increasing the network degree helps convergence and makes the validation loss to decrease faster since less communication rounds are required. However, while for degree networks the learning rates can be chosen arbitrarily in the range without affecting performance, denser networks, i.e., with large degree , require the optimization of the learning rate: smaller rates improve convergence for and .

In Table III, we analyze the latency of the FL process that is measured here in terms of number of communication rounds. CFA-GE is considered in detail, while performance of CFA can be inferred from Fig. 6. The Table III reports the number ) of communication rounds (or epochs) that are required to achieve a target validation loss for all devices, such that , . For the considered case, the chosen validation loss of corresponds to a (global) accuracy of . Focusing on CNN layers, a network with neighbors per device requires a max of communication rounds (and a minimum of ) to achieve a target loss of . This is in line with the theoretical bound [24] for local accuracy (obtained by isolated training). Considering FA (not shown in the Table), the number of required rounds ranges from to , and it is again comparable with decentralized optimization. Increasing the number of neighbors to , the required communication rounds reduce to and to for . For 2NN layers, the required number of epochs increases due to the smaller local accuracy as well as the larger number of parameters to be trained for each NN layer. Finally, for the proposed setup, we noticed that performance improves by keeping the learning rate for the hidden layer parameters , () slightly larger than the rate for the output layer parameters , (). This is particularly evident when convolutional layers are used.

Layers CNN () 2NN (