Recent advances in machine learning (ML) have revolutionized many domains and industrial scenarios. However, such improvements have been achieved at the cost of large computational and communication resources, resulting in significant energy and CO2 (carbon) footprints. Traditional centralized learning (CL) requires all training procedures to be conducted inside data centers that are in charge of collecting training data from data producers (e.g. sensors, machines and personal devices), fusing large datasets, and continuously learning from them . Data centers are thus energy-hungry and responsible for significant carbon emissions that amount to about % of the global emissions of the entire Information and Communication Technology (ICT) ecosystem .
of Deep Neural Networks (DNN), are collectively optimized across several resource-constrained edge/fog devices, that act asboth data producers and local learners. FL distributes the computing task across many devices characterized by low-power consumption profiles, compared with data centers, and owning small datasets .
As shown in Fig. 1, using FL policies, such as federated averaging , allows devices to learn a local model under the orchestration of a centralized parameter server (PS). The PS fuses the received local models to obtain a global model that is fed back to the devices. PS functions are substantially less energy-hungry compared to CL and can be implemented at the network edge. This suggests that FL could bring significant reduction in the energy footprints, as the consumption is distributed across devices obviating the need for a large infrastructure for cooling or power delivery. However, vanilla FL architectures still leverage the server-client architecture which not only represents a single-point of failure, but also lacks scalability and, if not optimized, can further increase the energy footprint. To tackle these drawbacks, recent developments in FL architectures target fully decentralized solutions relying solely on in-network processing, thus replacing PS functions with a consensus-based federation model. In consensus-based FL (CFL), the participating devices mutually exchange their local ML model parameters, possibly via mesh, or device-to-device (D2D) communication links , and implement distributed weighted averaging [6, 7, 8]. Devices might be either co-located in the same geographic area or distributed.
Contributions: the paper develops a novel framework for the analysis of energy and carbon footprints in distributed ML, including, for the first time, comparisons and trade-off considerations about vanilla FL, consensus-based (CFL) and data center based centralized learning. Despite an initial attempt to assess the carbon footprint for FL , the problem of quantifying an end-to-end analysis of the energy footprint still remains unexplored. To fill this void, we develop an end-to-end framework and validate it using real world data.
The paper is organized as follows: Sections II and III describe the framework for energy consumption and carbon footprint evaluation of different FL strategies, and the impact of energy efficiency in terms of communication and computing costs. In Section IV, we consider a case study in a real-world industrial workplace targeting the learning of a ML model to localize human operators in a human-robot cooperative manufacturing plant. Carbon emissions are quantified and discussed in continuous industrial workflow applications requiring periodic model training updates.
Ii Energy footprint modeling framework
The proposed framework provides insights into how different components of the FL architecture, i.e. the local learners, the core network and the PS, contribute to the energy bill. The learning system consists of devices and one data center (). Each device has a dataset of (labeled) examples that are typically collected independently. The objective of the learning system is to train a DNN model that transforms the input data into the desired outputs where is the number of the output classes. Model parameters are specified by the matrix . The training system uses the examples in
to minimize the loss functioniteratively, over a pre-defined number of learning rounds.
Considering a device , the total amount of energy consumed by the learning process can be broken down into computing and communication components. The energy cost is thus modelled as a function of the energy due to computing per learning round, and the energy per correctly received/transmitted bit over the wireless link (). In particular, the latter can be further broken down into uplink (UL) communication () with the data center (or the PS), and downlink (DL) communication (), from the PS to the device. The energy cost for communication includes the power dissipated in the RF front-end, in the conversion, baseband processing and transceiver stages. We neglect the cost of on-off radio switching. In addition, communication energy costs are quantified on average, as routing through the radio access and the core network can vary (but might be assumed as stationary apart from failures or replacements). Finally, the energy for computing includes the cost of the learning round, namely the local gradient-based optimizer and data storage. In what follows, we quantify the energy cost of model training implemented either inside the data center (CL) or distributed across multiple devices (FL). Numerical examples are given in Table I and in the case study in Section IV.
Ii-a Centralized Learning (CL)
Under CL, model training is carried out inside the data center , while the energy cost per round depends on the GPU/CPU power consumption , the time span required for processing an individual batch of data, i.e. minimizing the loss , and the number of batches per round. We neglect here the cost of initial dataset loading since it is a one-step process. For rounds, and a target loss , the total, end-to-end, energy in Joule [J] is given by:
where is the Power Usage Effectiveness (PUE) of the considered data center [10, 11]. The cost for UL communication for data fusion, , scales with the data size of the -th local database and the number of devices . PUE accounts for the additional power consumed by the data center infrastructure for data storage, power delivery and cooling; values are typically .
|Parameters||Data center/PS ()||Devices ()|
|Batch time :||ms||ms|
|Raw data size:||MB||MB|
|Utilization :||(model averaging)|
|ML model:||DeepMind , layers, . Optimizer: Adam|
|Comm. :||Downlink (DL):||Uplink (UL):|
|Mesh or D2D (M):|
|CL (data center):|
|FL (with PS):|
|CFL : ,||,|
Ii-B Federated Learning (FL)
Unlike CL, FL distributes the learning process across a selected subset of active devices as shown in Fig. 1. At each round , the local dataset is used to train a local model , in order to minimize the local loss as . The local model is then forwarded to the PS  over the UL. The PS is in charge of updating the global model for the following round through the aggregation of the received models : , with and () being the number of local and global examples, respectively. The new model is finally sent back to the devices over the DL. Other strategies are discussed in . Notice that, while active devices run the local optimizer and share the local model with the PS on the assigned round, the remaining devices have their computing hardware turned off, while the communication interface is powered on to decode the updated global model.
For rounds, now consisting of learning and communication tasks, the total end-to-end energy includes both devices and PS consumption, namely:
PS energy is given by and depends on the time, , needed for model averaging. This is considerably smaller than the batch time at the data center (i.e., ). The energy cost per round for device is due to the local optimization over the data batches : . Notice that, while data centers employ high-performance CPUs, GPUs or other specialized hardware (e.g., NPUs or TPUs), the devices are usually equipped with embedded low-consumption CPUs or microcontrollers. Thus, it is reasonable to assume . Model size quantifies the size in bits of model parameters to be exchanged, which is typically much smaller compared with the raw data : . In addition, the parameters size is roughly the same for each device, unless lossy/lossless compression  is implemented. Sending data regularly in small batches simplifies medium access control resource allocation and frame aggregation operations. As shown in , the PUE for all devices is set to .
Ii-C Consensus-driven Federated Learning (CFL)
In decentralized FL driven by consensus, devices mutually exchange the local model parameters using a low-power distributed mesh network as backbone [2, 7, 12]. As shown in the example of Fig. 1, devices exchange a compressed version [12, 13, 14] of their local models following an assigned graph connecting the learners, and update them by distributed weighted averaging [7, 8]. Let be the set that contains the chosen neighbors of node at round , in every new round () the device updates the local model using the parameters obtained from the neighbor device(s) as ). Weights can be chosen as . Averaging is followed by gradient-based model optimization on .
For active devices in the set and rounds, the energy footprint is captured only by device consumption:
The sum models the total energy spent by the device to diffuse the local model parameters to selected neighbors at round .
Iii Carbon footprint assessment
The carbon footprint evaluation assumes that each device , including the server, is located in a specific geographical region characterized by a known carbon intensity () of electricity generation . CI is measured in kg CO2-equivalent emissions per kWh (kgCO2-eq/kWh) which quantifies how much carbon emissions are produced per kilowatt hour of generated electricity. In the following, we consider the CI figures reported in EU back in 2019 . Considering the energy models (1)-(3), carbon emission is evaluated by multiplying each individual energy contribution, namely and by the corresponding intensity values . Carbon footprints and the proposed framework are summarized in Table II for CL () and FL policies () and ().
To analyze the main factors that impact the estimated carbon emissions, a few simplifications to the energy models (1)-(3) are introduced in the following. Communication and computing costs are quantified on average, in terms of the corresponding energy efficiencies (EE). Communication EE for DL (), UL () and mesh networking () are measured in bit/Joule [bit/J] and describe how much energy is consumed per correctly received information bit . Efficiencies depend on device/server consumption for communication and net UL/DL or mesh throughput . Depending on network implementations, we consider different choices of , and . The computing efficiency, , quantifies the number of rounds per Joule [round/J], namely how much energy per learning round is consumed at the data center (or PS). Devices equipped with embedded low-consumption CPUs typically experience a larger time span to process an individual batch of data; on the other hand, they use much lower power (). Device computing is typically larger and modeled here as with . Typical values for communication and computing are in Table I.
In the proposed FL implementation, the set of active FL devices changes according to a round robin scheduling, other options are proposed in . Considering typical CFL implementations, such as gossip , we let the devices choose up to neighbors per round. When ad-hoc mesh, or D2D, communication interfaces are not available, the energy cost to implement the generic peer-to-peer link () roughly corresponds to an UL transmission from the source to the core network access point (i.e., router), followed by a DL communication from the router(s) to the destination device , namely , or equivalently . Router can be a host or base-station. In mesh networks, further optimization via power control  may be also possible depending on the node deployment. Since devices do not need the router to relay information to the PS, which may be located in a different country, substantial energy savings are expected.
Iv Industry 4.0 robotized environment
According to , in 2019 industry was responsible for about % of the world greenhouse gas emissions. To counter this impact, Industry 4.0 (I4.0) and other mitigation policies have been recently introduced . In line with the I4.0 paradigm, we resort to a common Industrial Internet of Things (IIoT) scenario where AI-based sensors and machines are interconnected and co-located in the same plant . These sensors interact within an industrial workspace where human workers are co-present. Devices are served by a WiFi (IEEE 802.11ac) network and a router ( W ) is in charge of orchestrating the mesh communication or forwarding to the data center, or PS.
Iv-a Case study: scenario-dependent setup
The goal of the training task is to learn a ML model for the detection (classification) of the position of the human operators sharing the workspace, namely the human-robot distance and the direction of arrival (DOA) . Further details about the the robotic manipulators, the industrial environment and the deployed sensors are given in , . Input data , available online , are range-azimuth maps obtained from time-division multiple-input-multiple output (TD-MIMO) frequency modulated continuous wave (FMCW) radars working in the GHz band . During the on-line workflow, position (, ) information are obtained from the trained ML model and sent to a programmable logic controller for robot safety control (e.g., emergency stop or replanning tasks). The ML model adopted for the classification of the operator location is a simplified version of the DeepMind . It consists of trainable layers and M parameters, of which k are compressed, encoded by bits and exchanged during FL. Model outputs are reduced to for the detection of subject locations around the robot, detailed in . Batch times and size of exchanged model parameters (kB) are reported in Table I. Adam optimizer is used with a Huber loss . The number of devices () is in the range , data can be identically distributed (IID) or non-IID. Moreover, and are assumed.
Energy and carbon footprints are influenced by data center and device hardware configurations. The data center hardware consumption is reported in Table I and uses CPU (Intel i7 8700K, GHz, GB) and GPU (Nvidia Geforce GTX 1060, GHz, GB). For FL devices, we use Raspberry Pi 4 boards based on a low-power CPU (ARM-Cortex-A72 SoC type BCM2711, GHz,
GB). These devices can be viewed as a realistic pool of FL learners embedded in various IIoT applications. FL is implemented using Tensorflow vbackend (sample code available also in ). In what follows, rather than choosing a specific communication protocol, we follow a what-if analysis approach, and thus we quantify the estimated carbon emissions under the assumption of different DL/UL communication efficiencies (). Since actual emissions may be larger than the estimated ones depending on the specific protocol overhead and implementation, we will highlight relative comparisons.
Iv-B Case study: carbon footprint analysis
Fig. 2 provides an estimate of the carbon footprint under varying settings as detailed in Table I. Fig. 2(a) shows the carbon footprint for varying number of learning rounds (), comparing CL with devices and FL with . For CL (dashed line), an initial energy cost shall be paid for UL raw data transmission, which depends on the data size and the communication EE; in this example, kbit/J. Next, the energy cost is only due to computing ( J/round), unless new labelled data are produced by devices before the learning process ends on the data center. In contrast to CL, FL footprint depends on communication and computing energy costs per round. CFL (cross markers) has a cost of J/round, smaller than FL, namely J/round (circle markers) as PS is not required. Notice that mesh communication is replaced by UL and DL WiFi transmissions to/from a router.
Energy and accuracy loss can be traded off to optimize efficiency. For example, CL needs rounds at the data center to achieve a loss of and a carbon footprint of gCO2-eq. Model training should be typically repeated every hours to track modifications of the robotic cell layout, which corresponds to a total carbon emission of equivalent kgCO2-eq per year. CFL trains for more rounds (here ) to achieve a slightly larger loss (), but reduces the emissions down to gCO2-eq, or kgCO2-eq per year, if training is repeated every hours. Finally, FL achieves a similar footprint, however this comes in exchange for a larger validation loss () due to communication with the PS. Although not considered here, tuning of model as well as changing the aggregation strategy at the PS  would reduce the training time and thus emissions.
The end-to-end energy cost is investigated in Figs. 2(b) and 2(c). Energy vs. loss trade-off is first analyzed in Fig. 2(b). We consider setups where the data center and the devices are placed in different geographical areas featuring different carbon indexes (CIs). In particular, the first scenario (max EU, red) is characterized by devices located in a region that produces considerable emissions as kgCO2-eq/kWh. This corresponds to the max emission rate in EU . In the second (IT, black) and third (FI, blue) scenarios, devices and data center are located in Italy, kgCO2-eq/kWh, and Finland, kgCO2-eq/kWh, respectively. When the availability of green energy is small (i.e., max EU scenario, ), the learning loss and accuracy must be traded with carbon emissions. For example, for an amount of gas emission equal, or lower, than CL, the learning loss of CFL should be increased to , corresponding to an average accuracy of . Considering FL, this should be increased to . For smaller carbon indexes, i.e. IT and FI scenarios, the cost per round reduces. Therefore, FL can train for all the required rounds and experience the same loss as in CL with considerable emission savings ( for Finland). A promising roadmap for FL optimization is to let local learners contribute to the training process if, or when, green energy, namely small , is made available.
In Fig. 2(c) we now quantify the carbon emissions of CL, FL and CFL for varying communication EE, ranging from kbit/J to kbit/J, and number of devices, , and (). An increase of the network size or a decrease of the network kb/J efficiency cause communication to emit much more CO2 than training. Since FL is more communication efficient as (compressed) model parameters are exchanged, in line with , the best operational condition of FL is under limited communication regimes. For the considered scenario, the optimal EE below which FL leaves a smaller carbon footprint than CL is in the range kbit/J for FL () and kbit/J for CFL (). Finally, notice that for all cases FL can efficiently operate under kbit/J, typically observed in low power communications , and 4G/5G NB-IoT .
Table III compares the energy and carbon footprints for IID and non-IID data distributions. Computing, communication energy costs and corresponding carbon emissions for different target losses are evaluated with respect to the max EU scenario. Considering FL and CFL, federated computations are now distributed across devices, therefore larger computing costs are needed. Non-IID data generally penalizes both FL and CFL as energy consumption increases up to
in some cases. For example, while CFL with IID data limits the number of required epochs (targeting) to a maximum of , it is less effective for non-IID distributions as the required rounds now increase up to for some devices. CFL and FL thus experience an increase in energy costs, but CFL still emits lower carbon emissions. More advanced gradient-based CFL methods  might be considered when data distributions across devices are extremely unbalanced.
This work developed a framework for the analysis of energy and carbon footprints in distributed and federated learning (FL). It provides, for the first time, a trade-off analysis between vanilla and consensus FL on local datasets, and centralized learning inside the data center. A simulation framework has been developed for the performance analysis over arbitrarily complex wireless network structures. Carbon equivalent emissions are quantified and discussed for a continual industrial workflow monitoring application that tracks the movements of workers inside human-robot shared workspaces. The ML model is periodically (re)trained to track changes in data distributions. In many cases, energy and accuracy should be traded to optimize FL energy efficiency. Furthermore, by eliminating the parameter server, as made possible by emerging decentralized FL architectures, further reducing the energy footprint is a viable solution. Novel opportunities for energy-aware optimizations are also highlighted. These will target the migration of on-device computations where the availability of green energy is larger. Finally, FL requires a frequent and intensive use of the communication interfaces. This mandates a co-design of the federation policy and the communication architecture, rooted in the novel 6G paradigms.
-  M. Dayarathna, et al., “Data Center Energy Consumption Modeling: A Survey,” IEEE Communications Surveys & Tutorials, vol. 18, no. 1, pp. 732-794, First quarter 2016.
-  S. Savazzi et al. “Opportunities of Federated Learning in Connected, Cooperative and Automated Industrial Systems,” IEEE Communications Magazine, vol. 52, no. 2, February 2021. [Online]. Available: https://arxiv.org/abs/2101.03367.
-  X. Qiu, et al.. “Can Federated Learning Save the Planet?” NeurIPS - Tackling Climate Change with Machine Learning, Dec. 2020, Vancouver, Canada. [Online]. Available: https://arxiv.org/abs/2010.06537
-  J. Konečný, et al. “Federated optimization: Distributed machine learning for on-device intelligence,” CoRR, 2016. [Online]. Available: http://arxiv.org/abs/ 1610.02527.
-  P. Kairouz, et al., “Advances and open problems in federated learning,” [Online]. Available: https://arxiv.org/abs/1912.04977.
-  M. Blot, et al., “Gossip training for deep learning,” 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016. [Online]. Available: https://arxiv.org/abs/1611.09726.
-  S. Savazzi, et al. “Federated Learning with Cooperating Devices: A Consensus Approach for Massive IoT Networks,” IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4641-4654, May 2020.
-  Z. Chen, et al., “Consensus-Based Distributed Computation of Link-Based Network Metrics,” IEEE Signal Processing Letters, vol. 28, pp. 249-253, 2021.
V. Mnih, K. Kavukcuoglu, D. Silver, et al. “Human-level control through deep reinforcement learning,’ Nature 518, 529?533, 2015.
-  E. Masanet, et al., “Characteristics of low-carbon data centres,” Nature Climate Change, vol. 3, no. 7, pp. 627–630, 2013.
-  A. Capozzoli, et al. “Cooling systems in data centers: state of art and emerging technologies,” Energy Procedia, vol. 83, pp. 484–493, 2015.
-  H. Xing, et al., “Decentralized Federated Learning via SGD over Wireless D2D Networks,” Proc. IEEE 21st Int. Workshop on Signal Processing Advances in Wireless Comm. (SPAWC), Atlanta, GA, USA, pp. 1–5, 2020.
-  N. Shlezinger, et al. “Federated Learning with Quantization Constraints,” Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, pp. 8851–8855, 2020.
-  A. Elgabli, et al., “GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning,” Journal of Machine Learning Research, vol. 21, no. 76, pp. 1–39, 2020.
L.F.W. Anthony, et al., “Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models,” Proc of ICML Workshop on Challenges in Deploying and monitoring Machine Learning Systems, 2020.
-  European Environment Agency, Data and maps: “Greenhouse gas emission intensity of electricity generation,” Dec. 2020. [Online]. Available: https://tinyurl.com/36l5v5ht
-  E. Björnson and E. G. Larsson, “How Energy-Efficient Can a Wireless Communication System Become?,” Proc. 52nd Asilomar Conf. on Sig., Syst., and Comp., Pacific Grove, CA, USA, 2018, pp. 1252–1256.
-  S. Savazzi, et al. “A Joint Decentralized Federated Learning and Communications Framework for Industrial Networks,” Proc. of IEEE CAMAD, Pisa, Italy, pp. 1–7, 2020.
-  M. M. Amiri, et al., “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, May 2020.
-  J. G. J. Olivier, et al., “Trend in global CO2 and total greenhouse gas emissions,” 2020 Report, PBL Netherland Environmental Assessment Agency, Dec. 2020. [Online]. Available: https://tinyurl.com/xzz7btj6
-  H. Fekete, et al. “A review of successful climate change mitigation policies in major emitting economies and the potential of global replication,” Renewable and Sustainable Energy Reviews, vol. 137, art. 110602, 2021.
-  S. Kianoush, et al., “A Multisensory Edge-Cloud Platform for Opportunistic Radio Sensing in Cobot Environments,” IEEE Internet of Things Journal, vol. 8, no. 2, pp. 1154–1168, 2021.
-  The Power Consumption Database: WiFi routers. [Online]. Available: http://www.tpcdb.com/. Accessed: 08/03/2021.
-  Dataset: “Federated Learning: mmWave MIMO radar dataset for testing, ” IEEE Dataport, 2020. [Online]. Available: http://dx.doi.org/10.21227/0wmc-hq36 Accessed: March. 2021.
-  X. Vilajosana, et al., “IETF 6TiSCH: A Tutorial,” IEEE Communications Surveys & Tutorials, vol. 22, no. 1, pp. 595–615, Firstquarter 2020.
-  S. Zhang, et al., “Energy efficiency for NPUSCH in NB-IoT with guard band,” ZTE Commun., vol. 16, no. 4, pp. 46–51, Dec. 2018.