1 Introduction
Over the last years, the power system has undergone considerable changes, and this will continue over the upcoming years [cahllenges_power_system]. There are three main trends: decarbonization, decentralization, and digitalization [3_ds]. However, the core challenge of the power system is still to balance supply and demand (load) of electricity at every single point in time due to the power volatility of variable energy resources (VERs). The incremental share of VERs in the power system contributes to the balancing challenge. Forecasting is necessary to ensure balance, maintain quality, secure electricity supply, and operate the power system at lower costs. Amongst various types of forecasting, load forecasting is crucial for energy actors, such as market agents, because it allows for a better understanding of consumption and pricing patterns [forecasting_importance]. Within load forecasting, short term load forecasting (STLF) focuses on the load side with a time window from a few minutes or hours to one dayahead or a week [etnso_forecasting_def]. STLF is vital for many operational processes in the power system, such as planning, operating, and scheduling [inbook]. Concerning residential STLF, it aims to forecast electrical household consumption [kWh] and to assist for example market agents in tackling energy deviations. These energy deviations impact the energy price, which in turn, has a direct impact on the electricity costs that customers face. Therefore, the importance of load forecasting and its categories increases with demand, not only from an economical point of view but, as previously stated, also from an operational point of view.
According to data provided by the International Energy Agency (IEA) [iea_consumption], electricity consumption increased by 47% worldwide between 2010 and 2018. In line with the IEA data, residential electricity consumption follows the global upward trend. The European Union (EU) is an exception as residential electricity consumption remained stable during the same period due to efficient policies. Still, electricity consumption in the EU could increase from electrification processes [electrification]. For example, the electrification of private transportation (e.g. electric vehicles) will further affect electricity consumption patterns. Additionally, there is a recent shift towards an increase in residential electricity consumption due to the recent COVID19 pandemic and its consequent increase in adoption of remote work [en14040980]. While COVID19 is a historical exception, this general shift towards increased residential electricity consumption might not be an anomaly in the future. Thus, it is expected that residential STLF particularly gains more importance in the overall power system.
There are traditional methods for STLF but these methods build on limiting assumptions. Early techniques use statistical time series models relying on seasonal autoregressive integrated moving average (ARIMA) [kaur2017timeARIMA]
, and exponential smoothing for double seasonality or linear transfer functions. These techniques fall short by assuming linearity of data. Hence, there is a need for models capable to cope with nonlinear dependencies – an area where Artificial Intelligence (AI) methods are gaining momentum.
For STLF, [inbook, s20051399, forecasting_approaches, Electric_load_forecasting, 910780, ardabili2019advances]
deployed with success AI methods such as centralized machine learning (ML), deep learning (DL) and hybrid models (HM). For example, ML techniques use but are not limited to Support Vector Regression
[CHEN2017659SVR], Random forest regressor
[bogomolov2016energyRFR]or boosting algorithms. Likewise, DL techniques also rely on various architectures used when creating Neural Networks (NN). For example, the primary use of Convolutional Neural Networks (CNN) is spatial recognition due to its similarity to human retinas’ ability to extract patterns. Areas such as timeseries analysis and natural language processing or speech recognition use Recurrent Neural Networks (RNN) due to their capacity to remember previous data patterns. However, the challenges to STLF are not exclusively computer vision or patternbased issues. Finding models that can deal with spatial and sequential problems requires more complex approaches that offer more accurate results when forecasting, despite the higher computational cost.
Whether statistical or AI based, forecasting techniques need data and data scarcity drastically reduces their accuracy [problem_data_forecasting]. Residential STLF is not an exception and there is already a solution which reduces data scarcity through digitalization: the push for advanced metering infrastructure (AMI) through smart meters increases the frequency and granularity when collecting data [smart_meters_survey]. As a result, STLF can utilize the available data aggregated from smart meters.
However, the aggregation of smart meter data faces a twofold problem. There are considerable challenges of privacy concerning smart meter data use due to the sensibility and correlatability of granular data. Data collected from smart meters installed in residences are granular enough such that one can extract individual customer’s behaviour [hinterstocker2017disaggregation]. Furthermore, the transfer and aggregation of smart meter data is challenging under current regulatory regimes such as the EU’s General Data Protection Regulation (GDPR) – the framework introducing a set of guidelines for collecting and processing of personal information from European citizens.
According to [smart_meters] and [smart_meter_ownership], despite smart meters virtues for forecasting techniques, there is a limit in their usability due to restrictions on data use [smart_meters] and device ownership [smart_meter_ownership]. As a result of these two constraints, STLF systems have limited access to smart meter data.
Nonetheless, there are several approaches to tackle data scarcity. First, it is possible to solve data scarcity by simply scaling the number of devices that collect data. Yet, the scaling of devices is not a trivial endeavour as technical (reliability, computational resources and manageability) [Potenciano_Menci_2020] and economic factors play an important role [Cossent2020]. Second, another solution is to combine data using decentralized collaborative approaches or centralized ones. Centralized collaborative approaches such as Belgium’s Atrias [atrias_2021], or Norway’s Elhub [elhub_2021] provide socalled data lakes. However, centralized collaborative approaches are not possible in every market and jurisdiction as previously mentioned.
Decentralized approaches tackle the previous stated issues by connecting different and distributed entities rather than creating a central data pool such as a data lake. A recent divergent and collaborative approach for forecasting is Federated Learning (FL) [mcmahan2017communicationefficient, konecny2016federated]. It offers a collaboration framework to share prediction models instead of raw data. However, FL as a standalone is not defacto private. While FL partially addresses the data scarcity, data ownership, and device ownership issues, but it does not offer a viable solution to privacy concerns. For example, AI scholars proved that it is possible to reconstruct the original raw data out of the resulting models, both in the context of DL [zhu2020deep] and FL [geiping2020inverting]. Therefore, it is required to further extend the standalone implementation of FL to ensure full privacy of connected entities by additional privacypreserving techniques such as differential privacy (DP) and secure aggregation (SecAgg) when computing and communicating respectively model updates or gradient updates.
Residential STLF can benefit from a decentralized collaborative approach offered by FL and extended by privacy preserving techniques to enable its use as it would overcome the previous exposed issues. When reviewing the literature, both DP and FL were tested in isolation for STLF [dp_smart_meter, dp_imperial_colleague, BARBOSA2016355, dp_sm_guido, briggs2021federated, taik2020electrical]. However, the combined use of FL with privacy preserving techniques for residential STLF using smart meter data has not been tested. Therefore, in this paper, we investigate the combined use of FL with privacy preserving techniques for residential STLF using smart meter data. We provide a holistic view of the STLF challenge and analyze whether the state of the art STLF methods applied under distributed conditions behaves on par with existing solutions. The following research questions guide our investigation:

Hows do literature SLTF Neural Network models centrally trained perform in a decentralized approach such as FL?

Does the inclusion of privacy preserving techniques imply a substantial drop in the residential STLF forecasting accuracy?

Which are the main constraints for secure and nonsecure FL applied to residential STLF?
This paper is structured as follows. Section 2 provides the related work of our main conceptual pillars; that is, federated learning techniques, privacy preserving techniques, and forecasting techniques. There we further review of several applications of DL centralized techniques for STLF. Section 3 describes our secure federated model used for residential STLF. The structure of this section is based on the description of (1) the dataset, (2) the architecture, (3) the deep learning technique (4) the metrics used for evaluation and, finally, (5) the model operation. In section 4, we simulate and evaluate the proposed forecasting system by means of five scenarios. These scenarios cover classic FL (no privacy techniques), correlation data classic FL, different DL architecture and secure FL through the implementation of two privacypreserving techniques. Finally, in Section 5 provides the conclusion and the potential future research directions the authors consider.
2 Related work
2.1 Federated Learning
In most fields, AI has already proven its value, though the performance of models is highly dependent on the quantity and quality of data. Generally speaking, the challenge of designing highperforming AI is hindered by problems related to data fragmentation and isolation – mostly due to concerns of competitive pressure and tight regulatory frameworks (related to data privacy and security). The authors in [mcmahan2017communicationefficient, konecny2016federated] proposed a fundamentally new method, FL. The main idea of FL is to allow the training of ML models between multiple disconnected entities without the physical moving of raw data, nor explicitly exposing local raw data in any way to each other. In other words, FL allows competing entities (e.g. companies) to leverage each others’ datasets without revealing their individual dataset. In doing so, there is potential for models trained with FL to obtain a more accurate forecasting output than when each entity independently trained a model. To date, there are two different training approaches and three different configurations for the distribution of data and errors.
The two main approaches to train FL models are: federated stochastic gradient descend (FedSGD) and federated averaging (FedAvg) [mcmahan2017communicationefficient]. Although both rely on similar functionalities, there are differences between the two approaches.
FedSGD works by averaging the client’s gradient (direction of learning) at every step in the learning phase. A client can be thought of as one disconnected entity within FL. More specifically, clients locally compute gradients of their loss (difference between error and ground truth). The clients subsequently send each of their locally computed gradients to a central server. The central server aggregates and averages the locally computed gradients by applying (weighted) average from each client’s update.
In FedAvg, the central server averages the models’ updates when all the clients have finished computing their local models. In other words, FedAvg modifies the FedSGD algorithm by letting each clients’ models compute their own model weights (based on their own gradients). Each client
will train their own model in parallel. In doing so, the impact is twofold: there is a reduction in the number of communication rounds (per batch in FedSGD versus per epoch in FedAvg) and an improvement forecasting performance
[mcmahan2017communicationefficient]. Each client will use the received model weights as their base model for the next iteration round. This is repeated till the end of the prescribed rounds.Above mentioned, FedSGD and FedAvg are two approaches to FL but there are also multiple configurations. The configurations depend on how the feature space , the label space , and the space formed by the identifiers are distributed. Different setups of the triplet (
) can be classified as Horizontal, Vertical, and Assisted Federated Learning
[yang2019federated]. Take for instance two clients and .
Horizontal Federated Learning is when and share feature space such that but label space is different such that .

Vertical Federated Learning is when , but and .

Assisted Learning (AL) is done through collided data between clients. In [xian2020assisted] defined collision as clients with the same data entries of a dataset but different in feature space . AL leverages the sharing of error terms as clients share errors between each other. One client may use the errors of another for their own benefit to increase their training performance.
Regardless of the approach and configuration, FL is attributed to moral hazard issues [DBLP:journals/corr/abs191204977] or socalled ’soft’ attacks on the contextual integrity of the shared data between federated clients. The moral hazard issue arises because FL is by nature collaborative [DBLP:journals/corr/McMahanMRA16]. Multiple clients come together to train models iteratively using their respective data at their disposal. Therefore, the involved clients require trust between each involved clients’ data and client’s behaviour to train and use the final model.
Furthermore, clients do not exchange raw data between each other but information which are inferences upon raw data are exchanged. FL as a standalone does not guarantee data privacy because of the information exchange between clients. In [zhu2019deep], researchers found a way to use gradients’ updates to retrieve original raw data of a client. Such possibilities stand in conflict with requirements such as, for instance the European Union’s GDPR. Therefore, FL requires additional complementary adjustments when applied in a realistic environment where data privacy of clients is sought after.
There is a limited amount of literature which directly addresses the implementation of FL for STLF. For example, in [briggs2021federated] measured the performance of a FL model under clustering. The results displayed are by average 10% better relative to centralized learning techniques. Building on their work, the authors in [9469923]
applied kmeans algorithms to group users according to socioeconomics factors. This results in an improvement compared with the default separation through ACORNS. Authors in
[taik2020electrical] followed a similar approach to demonstrate the application of FL over a dataset of over 200 households. The testing phase used a set of four scenarios in which authors presented the utility of FL applied to STLF with handcrafted models. The number of clients analyzed in their scenarios between 5 to 20, and different local epochs ranging 1 to 5. Authors in [FEKRI2021107669] applied FL in a total of 5 smart meters comparing the performance of models trained using two FL techniques and for two different time resolutions. While the authors in [briggs2021federated, taik2020electrical, 9469923, FEKRI2021107669] use FL, they do not use any privacy techniques in their papers.2.2 Privacy Preserving Techniques
Over the last years, the increase of data usage has led to new techniques aiming to extract every last drop out of "statistical information" [dwork2014algorithmic]
. This form of data extraction is often at odds with the subject’s privacy. However, privacy preserving techniques which extends FL not only cover the privacy of statistical information but also the means of communicating them. Hence, within this section we describe DP as a way to protect an individual’s information and SecAgg as a mechanism to protect the communication channel and the exchange of updates among clients and a central server.
The seminal work of [Dwork06] introduced differential privacy (DP) as a new method to solve adversarial attacks without auxiliary information. As described in [dwork2014algorithmic], "differential privacy addresses the paradox of knowing nothing about an individual while learning useful information about a population." In other words, DP hides individual data trends using noise. Notably, [Dwork06] proposes the concept of epsilon differential privacy (DP), as follows: "For every pair of inputs and that differ in one row, for every output in S, an adversary should not be able to use the output in S to distinguish between any and ".
The privacy budget () determines how much of an individual’s privacy an query may use, or to what extent it may increase the risk to breach an individual’s privacy. For instance, a value of reflects perfect privacy, which means any analysis done will not affect an individual’s privacy at all [DP_non_tech]. The authors in [jayaraman2019evaluating] extended the concept of (DP) to (DP) where
is the failure probability.
DP is accomplished by adding random noise to the data query so that the lack of a single entry is obscured. Laplacian or Gaussian noise are the base methods for this approach [dwork2014algorithmic]. Finding an adequate tradeoff between the noise and utility of the remaining model is crucial.
Concerning SecAgg, it uses a secure communication channel to perform secure training without the need of additive noise [SecAgg] thanks to cryptographic primitives. SecAgg is a secure multiparty computation (SMPC) protocol that which computes client’s averages without revealing their data. The protocol allows a set of distributed, unknown clients to aggregate a value without revealing the value to the rest of the participants. The backbone of SecAgg uses Shamir’s toutofn Secret Sharing that enables a user to split a secret into shares [shamir1979share]. To reconstruct the secret, only more than shares are needed to retrieve the original secret . Any allocation with less than shares will provide no information about the original secret.
Even though SecAgg provides an environment where there is a secure setting for the training of models, still, models and their latent patterns could still point towards the original data owner. More specifically, Model Inversion (MI) attacks aim to reconstruct the original training data from the model parameters [fredrikson2015model]. Under a SecAgg setting, these parameters are not secure under a MI attack.
As stated in the introduction, the implications of privacypreserving techniques in STLF, as stated in the introduction, help forecasting systems to cope with current and specific countries legislation’s. From the energy business perspective, privacypreserving techniques enable competitive agents such as energy vendors to cooperate and integrate with system operators such as distribution system operators (DSOs). Furthermore, the use of privacypreserving techniques might facilitate the proliferation of local markets that ultimately help the energy transition.
Specifically, in the context of hiding smart metering data, different privacypreserving methods such as DP were used. One general approach is to add Gaussian noises to each smart meter data to prevent adversarial attacks [wang2012randomized]. Similarly, by adding adaptive noise to the Batterybased Load Hiding (BLH) values [6847974]. Another approach to the problem of hiding smart metering data, is to look at spatially aggregated profiles from smart meters. Authors in [eibl2017differential] executes DP with the addition of Laplacian noise to the aggregated dataset. It allows to empirically prove the privacy and utility of results in a real dataset.
Within the literature of STLF [dp_imperial_colleague, eibl2017differential, 6847974], authors consistently perturb the datasets by adding noise drawn from either a Gaussian or Laplacian distribution. Even though the referenced papers use this technique, we could not find any mathematical proof that ensures privacy. Hence, in this paper, we follow the approach proposed by [mcmahan2018]
and the moments’ accountant by
[mcmahan2017communicationefficient].2.3 Forecasting Techniques in STLF
In this section, we review different DL architectures (Table 1) taken from STLF literature to further select a suitable base model for our FL model.
Method  Dataset  Deep Learning Architecture  Year 

Marino et al. [marino2016building]  UCI  Individual household electric power consumption  Seq2Seq EncoderDecoder 2x LSTM  2016 
Kong et al. [8039509]  Australia SGDS Smart Grid Dataset  Stacked LSTM + FCL ^{1}^{1}1 Fully connected layer: Layer where all the neurons are connected to the previous layer’s neurons. 
2017 
Li et al. [en10101525]  Fremont, CA 15min Retail building electricity load  Missing architecture description  2017 
Shi et al. [7885096]  Irish CBTs  Residential and SMEs  Stacked LSTM + Pooling mechanism  2018 
Yan et al. [en11113089]  UKDALE Domestic ApplianceLevel Electricity dataset  2x Conv + 1x LSTM + FCL  2018 
Kim and Cho [en12040739]  UCI  Individual household electric power consumption  Missing architecture description  2019 
Kim and Cho [kim2019predicting]  UCI  Individual household electric power consumption  2x Conv + LSTM + 2x FCL  2019 
Le et al. [app9204237]  UCI  Individual household electric power consumption  2x Conv + Bi + LSTM + 2x FCL  2019 
Khan et al. [s20051399]  UCI  Individual household electric power consumption  2x Conv + 2x LSTM (Encoder) + 2x LSTM (Decoder) + 2x FCL  2020 
Analyzing the collected literature, we observed two main trends. First, we recognized the repeated use of the UCI dataset [UCI]
. We understand its use as it is large enough to cover the variability of realworld scenarios. Second, we detected an extensive usage of long shortterm memory (LSTM) layers. LSTM
[hochreiter1997long]are RNN that, contrary to feedforward neural networks (FFNN), have feedback connections. These connections leverage the ability of LSTMs to understand the dependencies between items in a sequence. LSTM also differs from RNN with the addition of Constant Error Carousel (CEC) to cope with the vanishing gradient problem
[hochreiter1998vanishing] by adding three new gates (input, output, and forget gate). In this context, [marino2016building]used a SequencetoSequence (Seq2Seq) architecture, which means the model is getting a sequence (a vector) as input and mapping towards another sequence. This primarily reduces the importance of outliers due to the compression factor to transpose the original input space into the encoded vector
[sutskever2014sequence, cho2014properties]. Seq2Seq is used mainly in applications like language translation due to the intrinsic continuous nature of the data.In [s20051399, en12040739, kim2019predicting, app9204237] we can see an increase in the neural network’s deepness through the years. Different approaches have tried to cope with the STLF problems (image or patternbased recognition).
CNNs are known to perform well with spatial patterns, while LSTMs excel at finding temporal ones. The combination of CNN and LSTM, as described in [s20051399], display the potential performance increase for STLF. However, the authors in [7885096] took a different path. Besides focusing on adding more layers until the desired performance, they also cluster and pool the data to increase variability. Hence, reduced the impact of a single element.
3 Secure Federated Model Design
This section defines our data, metrics, model parameters, and how our FL model operates. This FL model predicts the next hour consumption [kWh] based on the consumption data of the last 12 h. This model uses a standardized dataset. However, we evaluate selected literature models from Table 1 using the same dataset to select the best performing one based on the devised statistical metrics. The overall objective of the models’ evaluation is to explore their performance in a decentralized environment as the authors implement them in a centralized manner.
3.1 Dataset
This analysis uses two main datasets. The first dataset we took was from [uk_data]. This first data collection occurred in the UK Power Networks led Low Carbon London project between November 2011 and February 2014 in London, United Kingdom. It contains the electrical consumption [kWh] from 5567 households in a half an hour resolution. This dataset also contain the Classification Of Residential Neighborhoods’ (ACORN) [acorn]. The dataset is divided into individual household entries known as LCLid (Low Carbon London id). Additionally, the second dataset is composed of daily and weekly weather profiles from the Greater London area. Consequently, all customers have the same weather profile although their location might differ within the London Greater area.
The pipeline used for dataset treatment consists in 3 main steps. First, we modify the time window of our data. Initially, the data is in a halfhour timestamp, and we resample it to hourly data. This modification reduces the computational burden of our analysis. The downscale is made of the sum of two subsequent halfhours. Due to the short timeframe, sensors might fail at certain measurements, so it is normal to have abnormal or null values. During the first step of the pipeline, we trimmed these values. Afterwards, we rescaled all variables to have the same range using a standardscaler as in [s20051399]. Finally, we combine the datasets, to create our own standard data set for the analysis. In Figure 1 we provide an example of the pipeline process result. It represents the electricity consumption [kWh] of 5 LCLid randomly selected for a 2 day period using 1h timestamps.
3.2 Metrics
Metrics are an important characteristic for the development and testing of forecasting models. Indeed, some metrics could offer a misleading answer to the performance of the model. FL models are known to converge to a middle point [Li_2020]. AI models work by optimizing the error of prediction with respect to reality. In a mixed environment where there are many realities, the models tend to minimize the mean of the loss across datasets. This tendency, could provoke a FL model to predict the average of each of the datasets and hence to offer promising mean squared error (MSE), Equation 1 and mean absolute error (MAE), Equation 2; which would create spurious metrics.
Given this, in this paper, we would like to analyze not only MSE or MAE but also the mean absolute percentage error (MAPE), Equation 4 and root mean square error (RMSE), Equation 3. This effort is to increase objectivity of the results. The formal equations are as follows:
(1)

(2)

(3)

(4)

3.3 Evaluating and selection of Centralized Deep Learning Models
The base of any AI model is the underlying learning method. Thus, we evaluate the performance of the formerly reviewed methods (Table 1) to compare them using the metrics previously stated in subsection 3.2. The comparison allows for the later selection of the model our scenarios will use. Each evaluation uses the dataset explained in subsection 3.1. The models^{2}^{2}2The models have been implemented following their correspondent articles or the code authors have provided. were trained over 100 epochs.
Accordingly, Figure 2 illustrates the performance results of the models tested. Some models behave worse in our dataset than what was claimed by the authors. In [kim2019predicting, app9204237] authors calculated the metrics in a nonscaled dataset. Meaning, the transformation of the dataset prior or after the computation of the metric could bias the results. For instance, the scaled MSE is equal to the nonscaled MSE multiplied by being
the standard deviation of the dataset prior standardscaling.In other words, all calculated metrics have to be either scaled or not to offer a fair comparison. The same factor appeared when measuring RMSE. Subsequently, from now on we calculate all the metrics displayed with scaled data to standardize our analysis. The results from
[marino2016building, s20051399] behave similarly, although offering a remarkable difference in a number of network parameters. Training FL models is an intricate and expensive endeavour. From now on, in this paper, we will use the model proposed by [marino2016building] as the foundation model for our scenarios.3.4 Our model parameters: Advancing A Secure, Decentralized Architecture for Federated Learning
Once we have selected a model, subsection 3.3, we need to (1) select the implementation of privacypreserving techniques; (2) select the FL technique and (3) define our model parameters for the secure FL model. We mainly follow the steps provided by the authors in [mcmahan2018].
For the implementation of DP as an extension of our FL model, we chose to follow the steps provided by [mcmahan2018] rather than in [dp_imperial_colleague, lu2019blockchain], where the added noise to the training dataset prior to the training. In our DP implementation, we avoid modifying the entire original dataset and only obfuscate models per query. Additionally we consider Sec Agg following [SecAgg]. In our model, we choose to implement the FedAvg technique mainly for the reduced number of communication rounds needed. However, the later results are generalizable to FedSGD due to their same way of operation.
The nature of FedAvg dictates that the central server will average the different client model updates at the end of a certain number of rounds, normally one. This average will define an estimator, which will bound the function’s sensitivity
[dwork2014algorithmic]. Concerning the estimator, , we consider it as the average of a set of client’s updates . It implies, in our case, that all clients in the federation weigh the same. Hence, this is an arithmetic mean. Contrariwise, the estimator would need to change to a weighted average in case different clients weigh differently.The next parameter to define is the sensitivity query function (euclidean distance), taken from [dwork2014algorithmic], being where the user could be arbitrary data. Considering the first lemma from [mcmahan2018], the sensitivity is bounded as , being the number of clients. The vectors in include the different model updates computed among the clients.
A further consideration is to maintain the gradient’s updates in a known range. To do so, a standard solution is to clip them by a defined value before averaging. There are two different strategies for clipping the values of a neural network. These strategies are per layer clipping or a Flat Clipping [mcmahan2018]. In our case, to reduce complexity, we use Flat Clipping as being the overall clipping parameter. Both strategies rely on the same principle; by layer or by network, the updates project values into an l2 sphere with the norm determined by the provided clipping value.
Initially the clipping the values used to apply a fixed norm which could not be the best solution; the authors in [andrew2021differentially]
found an innovative strategy to adapt the clipping values based on a quantile of the distribution, known as
adaptive clipping.Another necessary parameter is to define how the noise addition to the model updates scale with the sensitivity to obtain privacy guarantee. In our model, we add the Gaussian noise. The Gaussian noise is defined by: for , where is a the noise scale and is the sensitivity of the query. In each query, all rows are selected ( in the first theorem of [mcmahan2018]).
Finally, we need to estimate the privacy loss (privacy budget: ) generated by the addition of noise. We use for our calculation the accountant provided by Renyi Differential Privacy (RDP) [Mironov_2017].
Based on the previously exposed and defined parameters, the query is secured, and the FL model using FedAvg can start training shared models. DP secures the model’s updates with the addition of random noise. This method protects any malicious agent to reconstruct the original data out of the model updates. DP, however, is not protecting the communication channel between the mentioned clients and the central server. To do so, authors in [SecAgg] leverage the work of [damgaard2012multiparty] allowing "a group of mutually distrustful parties each hold a private value and collaborate to compute an aggregate value". In SecAgg, the model will know that at least users participated but not which users. SecAgg implies two main algorithms: sharing and reconstruction.
The sharing algorithm will transform a secret into a set of shares of the secret associated with different clients. These shares follow [shamir1979share], hence collusion between participants is insufficient to disclose other clients’ private information. The reconstruction algorithm works in the opposite direction. It takes the mentioned shares from the clients and reconstructs the secret. In other words, clients share their secrets (models’ updates) with the server through a secure channel without the server being able to reconstruct the secrets. The central server forwards the received encrypted shares (with other clients public key). Each client, upon reception, masks the input and sends it back to the server. Finally, the central server asks for the shares of a client. It reconstructs the aggregated value (secure sum of the different clients’ models’ updates) without knowing the individual secrets or the participating clients.
3.5 Model operation
Having covered all necessary parameters to secure an FL model, it is still necessary to describe how it operates. This section describes the five main steps used for the model to compute the forecasts hereafter described, while in Figure 3 we illustrate the entire process.
Firstly, the central server has to select a model architecture and initialize it. Here, we initialize the model using Glorot initialization [glorot2010understanding]
. Secondly, the central server shares the model with the respective clients. Thirdly, at each client, the local models start training on their local data. Fourthly, clients send their updates (w.r.t. initial model) to the central server after each epoch. Fifthly, the central server averages these models’ updates and adds noise drawn from a Gaussian distribution in the case of DP. The noise multiplier is the ratio of the standard deviation of the noise to the query sensitivity. Sixty and Finally, the server returns the model
^{3}^{3}3The model is perturbed with noise if DP is considered. to the clients. The clients will continue this training process, sending and receiving updates until they reach their common goal.If we apply SeccAgg, the main steps remain the same. However, the protocol requires a set of cryptographic primitives to secure the communication between the client and server. Additionally, the protocol requires further communication rounds for sharing the clients’ secretes and public keys.
4 Simulation and evaluation of the scenarios
4.1 Scenarios Design
We design a set of scenarios to analyze the performance that FL could bring to STLF. These scenarios enable us systematic evaluate a classical FL setup, data correlation, Fl architecture and FL with privacypreserving techniques. In total, we designed five scenarios (lettered from A to E) collected and summarized in Table 2. For the first three scenarios (A, B and C), we took a perspective on performance and assessed (metricsoriented) the models described in the literature are in a decentralized environment. The following two scenarios (E and D) complement these models with privacy preserving techniques (DP and SecAgg). In each scenario, we run eight simulations. Each simulation assesses the behaviour of models in different settings where we scale (increase) the number of clients. The scaling process uses eight buckets () and contain 2, 5, 8, 11, 14, 17, 20, and 23 clients respectively. Each client will contain data specifically from one LCLid. We limit the number of clients (LCLIds) to 23 due to severe computational burdens when training the federated models. For instance, the model trained with [s20051399] as baseline model needed around eights days to finish.
Scenarios  PrivacyPreserving Method  Baseline Model  Imposed Correlation 

A    Marino et al.  
B    Marino et al.  
C    Khan et al.  
D  Differential Privacy  Marino et al.  
E  Secure Aggregation  Marino et al. 
4.2 Simulation Environment
The simulations were conducted using the high performance computer (HPC) facilities of the University of Luxembourg [HPC]
within the IRIS Cluster. Depending on the availability of the Graphic Processor Units (GPU), the federations run in an environment with 32 Intel Skylake cores and two NVIDIA Tesla V100 with 16Gb or 32Gb. The federation code is written in Python based on the framework provided by TensorflowFederated
^{4}^{4}4https://github.com/tensorflow/federated(TFF) whereas the DL models are written in Keras
[chollet2015keras]. Concerning the timeline, the simulations lasted around 800 hours distributed during June, July, and August 2021.4.3 Simulation Results
4.3.1 Scenario A
This scenario analyzes the scaling performance of the decentralized DL architecture designed by [marino2016building] in a classical FL setup (decentralized). We scale the number of clients in the federation as follows, 2, 5, 8, 11, 14, 17, 20, and 23 clients. We use a FL architecture without any privacypreserving techniques. Each client uses data from one random LCLid taken from the final dataset 3.1. Furthermore, in this scenario we impose no datacorrelation among the clients. From a FL training perspective, we trained the FL model over 300 communication rounds. Each clients contains a SGD optimizer with, as learning rate as the optimizer is consider in the original code of [mcmahan2018].
In Table 3 we collect the metrics results expressed in absolute values and the training time per round needed expressed in seconds [s].
Federation size  MSE  RMSE  MAE  MAPE  Time per round [s] 

2  0.0085  0.0923  0.0468  0.3095  42 
5  0.0104  0.1022  0.0340  0.2252  168 
8  0.0153  0.1239  0.0448  0.3302  179 
11  0.0115  0.1076  0.0334  0.3302  246 
14  0.0109  0.1045  0.0353  0.2657  323 
17  0.0114  0.1070  0.0315  0.2409  390 
20  0.0124  0.1114  0.0327  0.2086  476 
23  0.0119  0.1092  0.0304  0.2049  564 
There is a metric improvement when increasing the number of clients in MAE and MAPE metrics as displayed in Table 3. However, the MSE and RMSE metrics are almost constant. These results remark the importance to collect several metrics when analyzing any forecasting models. Concerning the increase of data quantify when scaling the number of LCLids, it corroborates the known fact that for DL training as exposed in [hussein2019impact]. However, FL is not a categorical example of this behaviour because different clients might have opposite data that will drag the performance of the models. In FL, it is not about the amount of data but rather about the quality of data. Nevertheless, it is clear that in our FL setup, the computational time also increases as the number of clients increases, potentially creating time constraints for our FL model.
Additionally, in Figure 4 we expose the MAPE of the models during the different training rounds for the eight simulations. We can observe a quasi logarithmic decrease in the MAPE over the 300 rounds. The spikes were investigated and are due to the data itself, where there is a significant difference in the consumption data input (batch). Nevertheless, throughout the 300 rounds, the MAPE obtained is between 0.20 and 0.35, which can be considered a reasonable forecast based on [lewis1982industrial].
4.3.2 Scenario B
In this scenario, we analyze the performance of data correlation in a classical FL setup. Hence, we build on top of Scenario A (4.3.1). The method we use for the correlation is Pearson correlation as in [en13174408]. Scenario B considers only data from specific ACRONs (H and L), serving us as a correlation filter. Then for each federation size based on their number of clients ([2, 5, 8, 11, 14, 17, 20, 23]), we calculate all possible nonrepeated combinations and compute their correlation, from which we select the combination with the highest correlation.
We collect similar to Scenario A (4.3.1) the results in Table 4, where we display the metrics values and the correlation rate, both expressed in absolute values. Concerning the computation time, it is the same as in Scenario A (4.3.1).
Federation size  MSE  RMSE  MAE  MAPE  Correlation rate 

2  0.0033  0.0582  0.0357  0.2898  0.62 
5  0.0100  0.1001  0.0464  0.3201  0.51 
8  0.0097  0.0989  0.0333  0.2607  0.49 
11  0.0111  0.1056  0.0330  0.2289  0.45 
14  0.0110  0.1049  0.0302  0.2196  0.42 
17  0.0121  0.1102  0.0339  0.2276  0.37 
20  0.0113  0.1065  0.0316  0.2085  0.34 
23  0.0121  0.1103  0.0311  0.2124  0.31 
The results when applying correlation perform almost in every metric better than in the previous scenario. There are only 12 cases where some of the metrics decrease in performance. There is only one case () where the metrics results are consistently worse. However, the performance decrease is between 1% and 3.6%. The correlation rate of the federation size 23 has a insignificant correlation rate to echo a change in the performance. Albeit, other buckets significantly increased their performance, up to 60% (, MSE). Nevertheless, the average metric performance increases, thus teh error decreases. The MSE performance increases 12,98%; RMSE 7,62%; MAE 2,769% and the MAPE 4,39%. These results are in accordance with similar simulations in [9469923], where the application of Kmeans to cluster customers offer a performance gain between 10% and 15%. Hence, just by using correlations among the data used, metrics tend to improve. From an energy point of view, the increase in performance can potentially reduce imbalance even more costs caused by the foresting errors. From a general power system perspective, the forecasting accuracy increase is a positive outcome as the system operator could also plan assets and calculate potential congestions better.
4.3.3 Scenario C
In this scenario, we explore how a bigger DL architecture in terms or parameters impacts the metrics. Scenario C motivation comes from our preliminary conceptual review exposed in Section 2. We noticed that the model of [s20051399] behaves similarly when computed in our dataset and in their presented results. Their DL architecture offers a higher complexity as they include more parameters, and the size is almost five times the baseline model used in Scenario A and B. Given its size and the potential computational burden, we implement three modifications. The first modification concerns the GPUs, where we modify the settings of each of the 2 Nvidia Tesla allocated on the HPC. For each of them, we create two virtual cards, resulting in four cards for the FL model to train. The second modification is the batch size, which we increase from 100 to 200. Ideally, the batch size increase should prevent overtiffing since there are more data entries available to compute the loss of the model. Finally we modify the NN implementation of the DL architecture proposed by [s20051399]. We transform the initially proposed LSTM layers to CuDNNLSTM [appleyard2016optimizing]. The transformation will enable the LSTMs to use the Compute Unified Device Architecture (CUDA) kernel of our Tesla GPUs to reduce the computation time.
Likewise to the previous scenarios, we collect the metrics results obtained from our simulations in Table 5 in absolute values and the computational average time expressed in seconds [s].
Federation size  MSE  RMSE  MAE  MAPE  Time per round [s] 

2  0.0026  0.0516  0.0215  0.1356  281 
5  0.0070  0.0841  0.0450  0.3266  775 
8  0.0116  0.1081  0.0599  0.5774  848 
11  0.0122  0.1107  0.0654  0.6155  1026 
14  0.0116  0.1049  0.1078  0.5826  1377 
17  0.0161  0.1269  0.0757  0.7452  1615 
20  0.0212  0.1458  0.0818  0.6920  2005 
23  0.0245  0.1567  0.0884  0.7471  2500 
The results of Scenario C clearly show the computational burden of training a complex FL model. The computational time recorded is almost five times the recorded from previous scenarios. Concerning the metrics, there is only a performance increase for 8 out of 32 metrics when compared to Scenario A. These are for the case of bucket (all metrics) and bucket (MSE and RMSE). They have a performance increase ranging from 13% up to 68%. Contrariwise the performance decrease for the reminder 24 metrics ranging from 0.39% up to 264%.
These results point to a clear overfitting case, where as the number of clients increase the FL model’s performance dramatically decreases. It is known that models in FL tend to converge to a middle point [Li_2020] where all the different clients find their local minimum. Overfitting is usually defined as the lack of generalization of a model. An overfitted model has crossed the line between learning tendencies or patterns and memorize the data received as input. During FedAvg, the models got averaged at every communication round. Averaging models that have understood patterns result in new models that can devise shared patterns. When overfitted models are averaged, the result does not differ from blatant noise. Furthermore, when exploring the MAPE results over the 300 rounds, depicted in Figure 5, the overfitting is clearly visual. The green line, two LCLid, shows a sharp slope within the first 40 rounds.
4.3.4 Scenario D
Scenario D focuses on implementing DP as a privacypreserving technique and analyzing the impact in our FL model. The addition of noise in DP is not a trivial pursuit as it is crucial to scale adequately the noise given a dataset to maintain privacy and performance. Furthermore, we implement and compare within flat clipping, explained in section 2.2, two approaches, fixed clipping and adaptive clipping techniques, explained in section 3.4. We only consider one federation size, 17, in this scenario since all federation sizes follow the same logic for implementing DP.
The first technique implemented is fixed clipping following the steps in [mcmahan2018]. There are two main steps to follow. Firstly, to identify the lowest clipping () value possible. We treated as a hyperparameter for our model. Clipping could negatively affect the convergence rate of any model as it clips all values bigger than . Secondly, to identity a tolerable level of noise for our simulations. These values allow us to compute the privacy guarantee and know the privacy of our model.
The identification of the lowest clipping follows an iterative calculation of all metrics for our FL model, starting with until . The iteration uses 0.05 steps, where the selection of the starting point follows the recommendations of [mcmahan2018]. To select the lowest clipping value, we use the results for k=17 in Scenario A as metric benchmark to find a clipping value with similar performance. We collected the iteration results for in Table 6 and selected as our fixed clipping value.
S  MSE  RMSE  MAE  MAPE 

0.01  0.1000  0.3163  0.1683  1.3331 
0.05  0.0548  0.2342  0.1172  0.9546 
0.1  0.0414  0.2036  0.0915  0.7180 
0.2  0.0239  0.1547  0.0610  0.4728 
0.25  0.0225  0.1503  0.0544  0.4086 
0.3  0.0200  0.1415  0.0483  0.3514 
0.35  0.0182  0.1350  0.0441  0.3218 
0.4  0.0153  0.1237  0.0349  0.2477 
0.45  0.0148  0.1218  0.0340  0.2417 
0.5  0.0146  0.1211  0.0352  0.2554 
Once identified the lowest clipping parameter, we can compute the standard deviation of the noise. With and the expected number of clients , we apply to calculate the standard deviation of the noise level . Likewise, in the previous step, we treat as a hyperparameter and proceed in interactions. It lets us obtain different values to add to the training process and calculate the privacy guarantee. The privacy guarantee we calculated using the Rényi Differential Privacy Accountant [Mironov_2017] Table 7 collects the metric results and the calculated privacy guarantee obtained in the iterative process.
17  0.4  0.023  1  0.023 
17  0.4  0.023  2  0.04 
17  0.4  0.023  4  0.09 
17  0.4  0.023  8  0.18 
17  0.4  0.023  16  0.36 
17  0.4  0.023  32  0.72 
Considering the obtained values, the addition of DP, distorts the values by adding noise, thus which reduces performance. Technically, when comparing the results obtained in Table 8 with the previous results for the respective federation size, 17 in scenario A Table 3, on average, the performance decreases. The decrease grows as the noise scale () increases, rustling at the end of our simulations an average performance decrease in the metrics of 27%, mainly due to the metric results of . However, we should contextualize the results. We can consider the measurements obtained as accurate results in themselves. The metrics are relatively low, displaying a decent forecasting performance, although higher error than scenario A (no DP). Yet, it is necessary to remark that DP provides a privacy guarantee of (1.39, )DP, where the lower the score, the better. Concerning our privacy results, these are close to perfect privacy ().
Noise scale ()  Stddev Noise ()  Privacy Guarantee ()  MSE  RMSE  MAE  MAPE 

1  0.023  (92.9,)  0.0163  0.1280  0.0376  0.2618 
2  0.04  (33.9,)  0.0165  0.1288  0.0393  0.2797 
4  0.09  (13.8,)  0.0161  0.1270  0.0373  0.2663 
8  0.18  (6.1,)  0.0159  0.1262  0.0360  0.2542 
16  0.36  (2.8,)  0.0176  0.1326  0.0398  0.2808 
32  0.72  (1.39,)  0.0196  0.1401  0.0445  0.3148 
The second technique implemented for our analysis is adaptive clipping. For our FL model, we consider the implementation done in [andrew2021differentially], where their algorithm iteratively adjusts the norm clip, trying to approximate it to a fixed quantile. Hence, there is no need to search for the lowest clipping and noise, contrary to fixed clipping. The clipping adapts per round. Figure 6 is a representation of the evolution and adjustment of the clipping value over the training rounds. There spike at the very beginning is because of the low initial clipping value . Such a low value provokes that few data points will participate in selecting the following clipping values at the initial rounds. The smaller the data points, the more difficult it is to estimate the optimal value. Consequently, the size of the steps required to make this estimate increases exponentially until it reaches the real quantile, resulting in an even stiffer growth.
Pertaining to the metrics obtained when implementing adaptive clipping, we collect them in Table 9. Similar to the previous results with fixed clipping, there is a decrease in performance compared to scenario A; although it is lower than fixed clipping, it is still around 21%. Hence, adaptive clipping does increase the performance. Nonetheless, with the results obtained, it is necessary to remark the same contextualization as previously done for fixed clipping. The results are technically worse than an FL setup with no DP, yet the performance decrease we can justify it as privacy is guaranteed. The privacy guarantee obtained in this scenario is (2.01,)DP. Eventhough being slightly worse than the one obtained for fixed clipping, it still obtains a remarkable privacy guarantee.
Noise scale ()  Stddev Noise ()  Privacy Guarantee ()  MSE  RMSE  MAE  MAPE 

1.35  0.023  (59.2,)  0.0158  0.1259  0.0364  0.2588 
2.5  0.04  (25.16,)  0.0167  0.1295  0.0383  0.2729 
5.6  0.09  (9.25,)  0.0166  0.1290  0.0367  0.2576 
11.2  0.18  (4.23,)  0.0163  0.1278  0.0362  0.2526 
22.4  0.36  (2.01,)  0.0166  0.1289  0.0359  0.2515 
44  0.72  (1.00,)  0.0632  0.2515  0.1294  1.0386 
4.3.5 Scenario E
In this scenario the focus is on SecAgg as a different method to bring privacy into a FL model. Contrary to the previous scenario, where we DP adds random noise drawn from a Gaussian distribution, SecAgg targets the communication and aggregation of the clients. Hence, there is no tradeoff as in Scenario D, where it is mandatory to find a suitable noise. Similar to how we expose the simulations for scenarios A, B and C, we collect the eight simulation results for each federation size in Table 10. We express the metrics in absolute values and the average computation time expressed in seconds [s]. Furthermore, we complement the results with Figure 7, where the error in terms of MAPE decreases almost in a logarithm manner and stabilizes at the end of the 300 rounds.
The application of SecAgg concerning the results expressed in Table 10 affects in a negligible way the computation time of the model compared with scenario A and even B. Consequently, SeccAgg provides a better metric performance than DP.
Federation size  MSE  RMSE  MAE  MAPE  Time per round [s] 

2  0.0076  0.0875  0.0431  0.2989  43 
5  0.0106  0.1032  0.0348  0.2332  173 
8  0.0153  0.1237  0.0427  0.3136  179 
11  0.0121  0.1104  0.0341  0.2589  246 
14  0.0110  0.1049  0.0330  0.2403  313 
17  0.0114  0.1071  0.0309  0.2383  388 
20  0.0128  0.1133  0.0334  0.2111  490 
23  0.0102  0.1012  0.0298  0.2030  570 
5 Conclusions
This work discusses the application of a collaborative short term load forecasting technique, federated learning, using centralized deep learning models and adding privacypreserving techniques to the federated learning model.
This paper first examined the most relevant short term load forecasting literature against our dataset to choose a deep learning architecture. The chosen one, a SequencetoSequence EncoderDecoder and two long shortterm memory layers, served as a foundation for our federated model. The step analysis over different considerations (size, correlation, and even a different deep learning architecture) let us achieve promising results for the federated learning application for residential short term load forecasting. The results align with the ones proposed by [9469923] and [FEKRI2021107669]: (1) There is a performance increase when a federation uses highly correlated data to train, (2) bigger models tend to overfit, affecting the performance, and (3) deep learning architectures highly impact in computation time. Concerning the metrics themselves, the application of federated learning for residential short term load forecasting is encouraging as the obtained metrics score low error.
Later this paper covers the application of privacypreserving techniques for short term load forecasting in a Federated Learning setting. To our knowledge, this is the first paper to cover them. We introduced Differential Privacy and Secure Aggregation to procure a secure setting in an FL context to explore the performance decrease they create when applied. Their application results in a minimal error increase, although every error increase is a step in the opposite direction. However, the inclusion of privacy is a feature that residential short term load forecasting needs in some jurisdictions. Either way, the privacy guarantee obtained is remarkably close to a secure theoretical setting (). We obtained (1.39,) and (2.01,) as the best privacy budget in () terms for the differential privacy and secure aggregation cases, respectively. Furthermore, our analysis also examined a fundamental parameter, clipping, within the DP case. The FL models behave consistently both in fixed and adaptive clipping. However, adaptive clipping does perform minimal better as expected from [andrew2021differentially].
The main lesson learned from our analysis is that finding an adequate tradeoff between noise, performance, and utility is not a trivial endeavour. Nonetheless, after the analysis, we can posit: (1) An initial scaling of the data positively affects the privacy budget because of the reduction of the sensitivity of the query. (2) There is no significant difference between fixed and adaptive clipping. Still, the iterative nature of the latter is more suitable for realworld scenarios due to their rapid converge rate and their necessity to previously search for a clipping value. (3) The addition of secure aggregation almost does not affect the performance; thence, the time required to create a secure configuration is worth the service it provides. (4) Despite the fact that our environment has various constraints, the maximum number of smart meters is 23, we can debate the benefits that adding a privacy layer to residential shortterm load forecasting could provide. We are able to protect their privacy and provide acceptable forecasting performance because the collaborative approach, federated learning, already improves forecasting performance.
Finally, the next steps in this research are to (1) assess bigger (scaled) setups with additional correlation indicators, such as the existence of distributed energy resources (i.e., photovoltaics, electric vehicles, or home energy management systems), in order to improve correlation. Furthermore, (2) to investigate data input disruptions produced by a hostile agent or a mistake caused by a smart metering device malfunction.
CRediT authorship contribution statement
Conceptualization, J.D.F, S.P.M, C.L; Methodology, J.D.F, S.P.M, C.L; Data Curation, J.D.F, S.P.M; Writing  Original Draft, J.D.F, S.P.M, C.L; Software J.D.F; Supervision G.F.; Writing  Review & Editing, J.D.F, S.P.M, C.L, G.F.; Visualization, J.D.F, S.P.M.; Funding acquisition, G.F. All authors have read and agreed to the published version of the manuscript.
Declaration of Competing Interest
The authors declare no conflict of interest.
Acknowledgements
This work has been supported by funding from the European Union (EU) within its Horizon 2020 programme, project MDOT (Medical Device Obligations Taskforce), Grant agreement 814654, and from the Kopernikusproject “SynErgie” by the German Federal Ministry of Education and Research (BMBF).
Comments
There are no comments yet.