Over the last years, the power system has undergone considerable changes, and this will continue over the upcoming years [cahllenges_power_system]. There are three main trends: decarbonization, decentralization, and digitalization [3_ds]. However, the core challenge of the power system is still to balance supply and demand (load) of electricity at every single point in time due to the power volatility of variable energy resources (VERs). The incremental share of VERs in the power system contributes to the balancing challenge. Forecasting is necessary to ensure balance, maintain quality, secure electricity supply, and operate the power system at lower costs. Amongst various types of forecasting, load forecasting is crucial for energy actors, such as market agents, because it allows for a better understanding of consumption and pricing patterns [forecasting_importance]. Within load forecasting, short term load forecasting (STLF) focuses on the load side with a time window from a few minutes or hours to one day-ahead or a week [etnso_forecasting_def]. STLF is vital for many operational processes in the power system, such as planning, operating, and scheduling [inbook]. Concerning residential STLF, it aims to forecast electrical household consumption [kWh] and to assist for example market agents in tackling energy deviations. These energy deviations impact the energy price, which in turn, has a direct impact on the electricity costs that customers face. Therefore, the importance of load forecasting and its categories increases with demand, not only from an economical point of view but, as previously stated, also from an operational point of view.
According to data provided by the International Energy Agency (IEA) [iea_consumption], electricity consumption increased by 47% worldwide between 2010 and 2018. In line with the IEA data, residential electricity consumption follows the global upward trend. The European Union (EU) is an exception as residential electricity consumption remained stable during the same period due to efficient policies. Still, electricity consumption in the EU could increase from electrification processes [electrification]. For example, the electrification of private transportation (e.g. electric vehicles) will further affect electricity consumption patterns. Additionally, there is a recent shift towards an increase in residential electricity consumption due to the recent COVID-19 pandemic and its consequent increase in adoption of remote work [en14040980]. While COVID-19 is a historical exception, this general shift towards increased residential electricity consumption might not be an anomaly in the future. Thus, it is expected that residential STLF particularly gains more importance in the overall power system.
There are traditional methods for STLF but these methods build on limiting assumptions. Early techniques use statistical time series models relying on seasonal autoregressive integrated moving average (ARIMA) [kaur2017timeARIMA]
, and exponential smoothing for double seasonality or linear transfer functions. These techniques fall short by assuming linearity of data. Hence, there is a need for models capable to cope with non-linear dependencies – an area where Artificial Intelligence (AI) methods are gaining momentum.
For STLF, [inbook, s20051399, forecasting_approaches, Electric_load_forecasting, 910780, ardabili2019advances]CHEN2017659SVR]
, Random forest regressor[bogomolov2016energyRFR]
or boosting algorithms. Likewise, DL techniques also rely on various architectures used when creating Neural Networks (NN). For example, the primary use of Convolutional Neural Networks (CNN) is spatial recognition due to its similarity to human retinas’ ability to extract patterns. Areas such as time-series analysis and natural language processing or speech recognition use Recurrent Neural Networks (RNN) due to their capacity to remember previous data patterns. However, the challenges to STLF are not exclusively computer vision or pattern-based issues. Finding models that can deal with spatial and sequential problems requires more complex approaches that offer more accurate results when forecasting, despite the higher computational cost.
Whether statistical or AI based, forecasting techniques need data and data scarcity drastically reduces their accuracy [problem_data_forecasting]. Residential STLF is not an exception and there is already a solution which reduces data scarcity through digitalization: the push for advanced metering infrastructure (AMI) through smart meters increases the frequency and granularity when collecting data [smart_meters_survey]. As a result, STLF can utilize the available data aggregated from smart meters.
However, the aggregation of smart meter data faces a twofold problem. There are considerable challenges of privacy concerning smart meter data use due to the sensibility and correlatability of granular data. Data collected from smart meters installed in residences are granular enough such that one can extract individual customer’s behaviour [hinterstocker2017disaggregation]. Furthermore, the transfer and aggregation of smart meter data is challenging under current regulatory regimes such as the EU’s General Data Protection Regulation (GDPR) – the framework introducing a set of guidelines for collecting and processing of personal information from European citizens.
According to [smart_meters] and [smart_meter_ownership], despite smart meters virtues for forecasting techniques, there is a limit in their usability due to restrictions on data use [smart_meters] and device ownership [smart_meter_ownership]. As a result of these two constraints, STLF systems have limited access to smart meter data.
Nonetheless, there are several approaches to tackle data scarcity. First, it is possible to solve data scarcity by simply scaling the number of devices that collect data. Yet, the scaling of devices is not a trivial endeavour as technical (reliability, computational resources and manageability) [Potenciano_Menci_2020] and economic factors play an important role [Cossent2020]. Second, another solution is to combine data using decentralized collaborative approaches or centralized ones. Centralized collaborative approaches such as Belgium’s Atrias [atrias_2021], or Norway’s Elhub [elhub_2021] provide so-called data lakes. However, centralized collaborative approaches are not possible in every market and jurisdiction as previously mentioned.
Decentralized approaches tackle the previous stated issues by connecting different and distributed entities rather than creating a central data pool such as a data lake. A recent divergent and collaborative approach for forecasting is Federated Learning (FL) [mcmahan2017communicationefficient, konecny2016federated]. It offers a collaboration framework to share prediction models instead of raw data. However, FL as a standalone is not de-facto private. While FL partially addresses the data scarcity, data ownership, and device ownership issues, but it does not offer a viable solution to privacy concerns. For example, AI scholars proved that it is possible to reconstruct the original raw data out of the resulting models, both in the context of DL [zhu2020deep] and FL [geiping2020inverting]. Therefore, it is required to further extend the standalone implementation of FL to ensure full privacy of connected entities by additional privacy-preserving techniques such as differential privacy (DP) and secure aggregation (SecAgg) when computing and communicating respectively model updates or gradient updates.
Residential STLF can benefit from a decentralized collaborative approach offered by FL and extended by privacy preserving techniques to enable its use as it would overcome the previous exposed issues. When reviewing the literature, both DP and FL were tested in isolation for STLF [dp_smart_meter, dp_imperial_colleague, BARBOSA2016355, dp_sm_guido, briggs2021federated, taik2020electrical]. However, the combined use of FL with privacy preserving techniques for residential STLF using smart meter data has not been tested. Therefore, in this paper, we investigate the combined use of FL with privacy preserving techniques for residential STLF using smart meter data. We provide a holistic view of the STLF challenge and analyze whether the state of the art STLF methods applied under distributed conditions behaves on par with existing solutions. The following research questions guide our investigation:
Hows do literature SLTF Neural Network models centrally trained perform in a decentralized approach such as FL?
Does the inclusion of privacy preserving techniques imply a substantial drop in the residential STLF forecasting accuracy?
Which are the main constraints for secure and non-secure FL applied to residential STLF?
This paper is structured as follows. Section 2 provides the related work of our main conceptual pillars; that is, federated learning techniques, privacy preserving techniques, and forecasting techniques. There we further review of several applications of DL centralized techniques for STLF. Section 3 describes our secure federated model used for residential STLF. The structure of this section is based on the description of (1) the dataset, (2) the architecture, (3) the deep learning technique (4) the metrics used for evaluation and, finally, (5) the model operation. In section 4, we simulate and evaluate the proposed forecasting system by means of five scenarios. These scenarios cover classic FL (no privacy techniques), correlation data classic FL, different DL architecture and secure FL through the implementation of two privacy-preserving techniques. Finally, in Section 5 provides the conclusion and the potential future research directions the authors consider.
2 Related work
2.1 Federated Learning
In most fields, AI has already proven its value, though the performance of models is highly dependent on the quantity and quality of data. Generally speaking, the challenge of designing high-performing AI is hindered by problems related to data fragmentation and isolation – mostly due to concerns of competitive pressure and tight regulatory frameworks (related to data privacy and security). The authors in [mcmahan2017communicationefficient, konecny2016federated] proposed a fundamentally new method, FL. The main idea of FL is to allow the training of ML models between multiple disconnected entities without the physical moving of raw data, nor explicitly exposing local raw data in any way to each other. In other words, FL allows competing entities (e.g. companies) to leverage each others’ datasets without revealing their individual dataset. In doing so, there is potential for models trained with FL to obtain a more accurate forecasting output than when each entity independently trained a model. To date, there are two different training approaches and three different configurations for the distribution of data and errors.
The two main approaches to train FL models are: federated stochastic gradient descend (Fed-SGD) and federated averaging (Fed-Avg) [mcmahan2017communicationefficient]. Although both rely on similar functionalities, there are differences between the two approaches.
Fed-SGD works by averaging the client’s gradient (direction of learning) at every step in the learning phase. A client can be thought of as one disconnected entity within FL. More specifically, clients locally compute gradients of their loss (difference between error and ground truth). The clients subsequently send each of their locally computed gradients to a central server. The central server aggregates and averages the locally computed gradients by applying (weighted) average from each client’s update.
In Fed-Avg, the central server averages the models’ updates when all the clients have finished computing their local models. In other words, Fed-Avg modifies the Fed-SGD algorithm by letting each clients’ models compute their own model weights (based on their own gradients). Each client
will train their own model in parallel. In doing so, the impact is twofold: there is a reduction in the number of communication rounds (per batch in Fed-SGD versus per epoch in Fed-Avg) and an improvement forecasting performance[mcmahan2017communicationefficient]. Each client will use the received model weights as their base model for the next iteration round. This is repeated till the end of the prescribed rounds.
Above mentioned, Fed-SGD and Fed-Avg are two approaches to FL but there are also multiple configurations. The configurations depend on how the feature space , the label space , and the space formed by the identifiers are distributed. Different setups of the triplet (
) can be classified as Horizontal, Vertical, and Assisted Federated Learning[yang2019federated]. Take for instance two clients and .
Horizontal Federated Learning is when and share feature space such that but label space is different such that .
Vertical Federated Learning is when , but and .
Assisted Learning (AL) is done through collided data between clients. In [xian2020assisted] defined collision as clients with the same data entries of a dataset but different in feature space . AL leverages the sharing of error terms as clients share errors between each other. One client may use the errors of another for their own benefit to increase their training performance.
Regardless of the approach and configuration, FL is attributed to moral hazard issues [DBLP:journals/corr/abs-1912-04977] or so-called ’soft’ attacks on the contextual integrity of the shared data between federated clients. The moral hazard issue arises because FL is by nature collaborative [DBLP:journals/corr/McMahanMRA16]. Multiple clients come together to train models iteratively using their respective data at their disposal. Therefore, the involved clients require trust between each involved clients’ data and client’s behaviour to train and use the final model.
Furthermore, clients do not exchange raw data between each other but information which are inferences upon raw data are exchanged. FL as a standalone does not guarantee data privacy because of the information exchange between clients. In [zhu2019deep], researchers found a way to use gradients’ updates to retrieve original raw data of a client. Such possibilities stand in conflict with requirements such as, for instance the European Union’s GDPR. Therefore, FL requires additional complementary adjustments when applied in a realistic environment where data privacy of clients is sought after.
There is a limited amount of literature which directly addresses the implementation of FL for STLF. For example, in [briggs2021federated] measured the performance of a FL model under clustering. The results displayed are by average 10% better relative to centralized learning techniques. Building on their work, the authors in 
applied k-means algorithms to group users according to socio-economics factors. This results in an improvement compared with the default separation through ACORNS. Authors in[taik2020electrical] followed a similar approach to demonstrate the application of FL over a dataset of over 200 households. The testing phase used a set of four scenarios in which authors presented the utility of FL applied to STLF with handcrafted models. The number of clients analyzed in their scenarios between 5 to 20, and different local epochs ranging 1 to 5. Authors in [FEKRI2021107669] applied FL in a total of 5 smart meters comparing the performance of models trained using two FL techniques and for two different time resolutions. While the authors in [briggs2021federated, taik2020electrical, 9469923, FEKRI2021107669] use FL, they do not use any privacy techniques in their papers.
2.2 Privacy Preserving Techniques
Over the last years, the increase of data usage has led to new techniques aiming to extract every last drop out of "statistical information" [dwork2014algorithmic]
. This form of data extraction is often at odds with the subject’s privacy. However, privacy preserving techniques which extends FL not only cover the privacy of statistical information but also the means of communicating them. Hence, within this section we describe DP as a way to protect an individual’s information and SecAgg as a mechanism to protect the communication channel and the exchange of updates among clients and a central server.
The seminal work of [Dwork06] introduced differential privacy (DP) as a new method to solve adversarial attacks without auxiliary information. As described in [dwork2014algorithmic], "differential privacy addresses the paradox of knowing nothing about an individual while learning useful information about a population." In other words, DP hides individual data trends using noise. Notably, [Dwork06] proposes the concept of epsilon differential privacy (-DP), as follows: "For every pair of inputs and that differ in one row, for every output in S, an adversary should not be able to use the output in S to distinguish between any and ".
The privacy budget () determines how much of an individual’s privacy an query may use, or to what extent it may increase the risk to breach an individual’s privacy. For instance, a value of reflects perfect privacy, which means any analysis done will not affect an individual’s privacy at all [DP_non_tech]. The authors in [jayaraman2019evaluating] extended the concept of (-DP) to (-DP) where
is the failure probability.
DP is accomplished by adding random noise to the data query so that the lack of a single entry is obscured. Laplacian or Gaussian noise are the base methods for this approach [dwork2014algorithmic]. Finding an adequate trade-off between the noise and utility of the remaining model is crucial.
Concerning SecAgg, it uses a secure communication channel to perform secure training without the need of additive noise [SecAgg] thanks to cryptographic primitives. SecAgg is a secure multi-party computation (SMPC) protocol that which computes client’s averages without revealing their data. The protocol allows a set of distributed, unknown clients to aggregate a value without revealing the value to the rest of the participants. The backbone of SecAgg uses Shamir’s t-out-of-n Secret Sharing that enables a user to split a secret into shares [shamir1979share]. To reconstruct the secret, only more than shares are needed to retrieve the original secret . Any allocation with less than shares will provide no information about the original secret.
Even though SecAgg provides an environment where there is a secure setting for the training of models, still, models and their latent patterns could still point towards the original data owner. More specifically, Model Inversion (MI) attacks aim to reconstruct the original training data from the model parameters [fredrikson2015model]. Under a SecAgg setting, these parameters are not secure under a MI attack.
As stated in the introduction, the implications of privacy-preserving techniques in STLF, as stated in the introduction, help forecasting systems to cope with current and specific countries legislation’s. From the energy business perspective, privacy-preserving techniques enable competitive agents such as energy vendors to cooperate and integrate with system operators such as distribution system operators (DSOs). Furthermore, the use of privacy-preserving techniques might facilitate the proliferation of local markets that ultimately help the energy transition.
Specifically, in the context of hiding smart metering data, different privacy-preserving methods such as DP were used. One general approach is to add Gaussian noises to each smart meter data to prevent adversarial attacks [wang2012randomized]. Similarly, by adding adaptive noise to the Battery-based Load Hiding (BLH) values . Another approach to the problem of hiding smart metering data, is to look at spatially aggregated profiles from smart meters. Authors in [eibl2017differential] executes DP with the addition of Laplacian noise to the aggregated dataset. It allows to empirically prove the privacy and utility of results in a real dataset.
Within the literature of STLF [dp_imperial_colleague, eibl2017differential, 6847974], authors consistently perturb the datasets by adding noise drawn from either a Gaussian or Laplacian distribution. Even though the referenced papers use this technique, we could not find any mathematical proof that ensures privacy. Hence, in this paper, we follow the approach proposed by [mcmahan2018]
and the moments’ accountant by[mcmahan2017communicationefficient].
2.3 Forecasting Techniques in STLF
In this section, we review different DL architectures (Table 1) taken from STLF literature to further select a suitable base model for our FL model.
|Method||Dataset||Deep Learning Architecture||Year|
|Marino et al. [marino2016building]||UCI - Individual household electric power consumption||Seq2Seq Encoder-Decoder 2x LSTM||2016|
|Kong et al. ||Australia SGDS Smart Grid Dataset||Stacked LSTM + FCL 111
Fully connected layer: Layer where all the neurons are connected to the previous layer’s neurons.
|Li et al. [en10101525]||Fremont, CA 15min Retail building electricity load||Missing architecture description||2017|
|Shi et al. ||Irish CBTs - Residential and SMEs||Stacked LSTM + Pooling mechanism||2018|
|Yan et al. [en11113089]||UK-DALE Domestic Appliance-Level Electricity dataset||2x Conv + 1x LSTM + FCL||2018|
|Kim and Cho [en12040739]||UCI - Individual household electric power consumption||Missing architecture description||2019|
|Kim and Cho [kim2019predicting]||UCI - Individual household electric power consumption||2x Conv + LSTM + 2x FCL||2019|
|Le et al. [app9204237]||UCI - Individual household electric power consumption||2x Conv + Bi + LSTM + 2x FCL||2019|
|Khan et al. [s20051399]||UCI - Individual household electric power consumption||2x Conv + 2x LSTM (Encoder) + 2x LSTM (Decoder) + 2x FCL||2020|
Analyzing the collected literature, we observed two main trends. First, we recognized the repeated use of the UCI dataset [UCI]
. We understand its use as it is large enough to cover the variability of real-world scenarios. Second, we detected an extensive usage of long short-term memory (LSTM) layers. LSTM[hochreiter1997long]
are RNN that, contrary to feedforward neural networks (FFNN), have feedback connections. These connections leverage the ability of LSTMs to understand the dependencies between items in a sequence. LSTM also differs from RNN with the addition of Constant Error Carousel (CEC) to cope with the vanishing gradient problem[hochreiter1998vanishing] by adding three new gates (input, output, and forget gate). In this context, [marino2016building]
used a Sequence-to-Sequence (Seq2Seq) architecture, which means the model is getting a sequence (a vector) as input and mapping towards another sequence. This primarily reduces the importance of outliers due to the compression factor to transpose the original input space into the encoded vector[sutskever2014sequence, cho2014properties]. Seq2Seq is used mainly in applications like language translation due to the intrinsic continuous nature of the data.
In [s20051399, en12040739, kim2019predicting, app9204237] we can see an increase in the neural network’s deepness through the years. Different approaches have tried to cope with the STLF problems (image or pattern-based recognition).
CNNs are known to perform well with spatial patterns, while LSTMs excel at finding temporal ones. The combination of CNN and LSTM, as described in [s20051399], display the potential performance increase for STLF. However, the authors in  took a different path. Besides focusing on adding more layers until the desired performance, they also cluster and pool the data to increase variability. Hence, reduced the impact of a single element.
3 Secure Federated Model Design
This section defines our data, metrics, model parameters, and how our FL model operates. This FL model predicts the next hour consumption [kWh] based on the consumption data of the last 12 h. This model uses a standardized dataset. However, we evaluate selected literature models from Table 1 using the same dataset to select the best performing one based on the devised statistical metrics. The overall objective of the models’ evaluation is to explore their performance in a decentralized environment as the authors implement them in a centralized manner.
This analysis uses two main datasets. The first dataset we took was from [uk_data]. This first data collection occurred in the UK Power Networks led Low Carbon London project between November 2011 and February 2014 in London, United Kingdom. It contains the electrical consumption [kWh] from 5567 households in a half an hour resolution. This dataset also contain the Classification Of Residential Neighborhoods’ (ACORN) [acorn]. The dataset is divided into individual household entries known as LCLid (Low Carbon London id). Additionally, the second dataset is composed of daily and weekly weather profiles from the Greater London area. Consequently, all customers have the same weather profile although their location might differ within the London Greater area.
The pipeline used for dataset treatment consists in 3 main steps. First, we modify the time window of our data. Initially, the data is in a half-hour timestamp, and we resample it to hourly data. This modification reduces the computational burden of our analysis. The downscale is made of the sum of two subsequent half-hours. Due to the short timeframe, sensors might fail at certain measurements, so it is normal to have abnormal or null values. During the first step of the pipeline, we trimmed these values. Afterwards, we rescaled all variables to have the same range using a standard-scaler as in [s20051399]. Finally, we combine the datasets, to create our own standard data set for the analysis. In Figure 1 we provide an example of the pipeline process result. It represents the electricity consumption [kWh] of 5 LCLid randomly selected for a 2 day period using 1h timestamps.
Metrics are an important characteristic for the development and testing of forecasting models. Indeed, some metrics could offer a misleading answer to the performance of the model. FL models are known to converge to a middle point [Li_2020]. AI models work by optimizing the error of prediction with respect to reality. In a mixed environment where there are many realities, the models tend to minimize the mean of the loss across datasets. This tendency, could provoke a FL model to predict the average of each of the datasets and hence to offer promising mean squared error (MSE), Equation 1 and mean absolute error (MAE), Equation 2; which would create spurious metrics.
Given this, in this paper, we would like to analyze not only MSE or MAE but also the mean absolute percentage error (MAPE), Equation 4 and root mean square error (RMSE), Equation 3. This effort is to increase objectivity of the results. The formal equations are as follows:
3.3 Evaluating and selection of Centralized Deep Learning Models
The base of any AI model is the underlying learning method. Thus, we evaluate the performance of the formerly reviewed methods (Table 1) to compare them using the metrics previously stated in subsection 3.2. The comparison allows for the later selection of the model our scenarios will use. Each evaluation uses the dataset explained in subsection 3.1. The models222The models have been implemented following their correspondent articles or the code authors have provided. were trained over 100 epochs.
Accordingly, Figure 2 illustrates the performance results of the models tested. Some models behave worse in our dataset than what was claimed by the authors. In [kim2019predicting, app9204237] authors calculated the metrics in a non-scaled dataset. Meaning, the transformation of the dataset prior or after the computation of the metric could bias the results. For instance, the scaled MSE is equal to the non-scaled MSE multiplied by being
the standard deviation of the dataset prior standard-scaling.In other words, all calculated metrics have to be either scaled or not to offer a fair comparison. The same factor appeared when measuring RMSE. Subsequently, from now on we calculate all the metrics displayed with scaled data to standardize our analysis. The results from[marino2016building, s20051399] behave similarly, although offering a remarkable difference in a number of network parameters. Training FL models is an intricate and expensive endeavour. From now on, in this paper, we will use the model proposed by [marino2016building] as the foundation model for our scenarios.
3.4 Our model parameters: Advancing A Secure, Decentralized Architecture for Federated Learning
Once we have selected a model, subsection 3.3, we need to (1) select the implementation of privacy-preserving techniques; (2) select the FL technique and (3) define our model parameters for the secure FL model. We mainly follow the steps provided by the authors in [mcmahan2018].
For the implementation of DP as an extension of our FL model, we chose to follow the steps provided by [mcmahan2018] rather than in [dp_imperial_colleague, lu2019blockchain], where the added noise to the training dataset prior to the training. In our DP implementation, we avoid modifying the entire original dataset and only obfuscate models per query. Additionally we consider Sec Agg following [SecAgg]. In our model, we choose to implement the Fed-Avg technique mainly for the reduced number of communication rounds needed. However, the later results are generalizable to Fed-SGD due to their same way of operation.
The nature of Fed-Avg dictates that the central server will average the different client model updates at the end of a certain number of rounds, normally one. This average will define an estimator, which will bound the function’s sensitivity[dwork2014algorithmic]. Concerning the estimator, , we consider it as the average of a set of client’s updates . It implies, in our case, that all clients in the federation weigh the same. Hence, this is an arithmetic mean. Contrariwise, the estimator would need to change to a weighted average in case different clients weigh differently.
The next parameter to define is the sensitivity query function (euclidean distance), taken from [dwork2014algorithmic], being where the user could be arbitrary data. Considering the first lemma from [mcmahan2018], the sensitivity is bounded as , being the number of clients. The vectors in include the different model updates computed among the clients.
A further consideration is to maintain the gradient’s updates in a known range. To do so, a standard solution is to clip them by a defined value before averaging. There are two different strategies for clipping the values of a neural network. These strategies are per layer clipping or a Flat Clipping [mcmahan2018]. In our case, to reduce complexity, we use Flat Clipping as being the overall clipping parameter. Both strategies rely on the same principle; by layer or by network, the updates project values into an l2 sphere with the norm determined by the provided clipping value.
Initially the clipping the values used to apply a fixed norm which could not be the best solution; the authors in [andrew2021differentially]
found an innovative strategy to adapt the clipping values based on a quantile of the distribution, known asadaptive clipping.
Another necessary parameter is to define how the noise addition to the model updates scale with the sensitivity to obtain privacy guarantee. In our model, we add the Gaussian noise. The Gaussian noise is defined by: for , where is a the noise scale and is the sensitivity of the query. In each query, all rows are selected ( in the first theorem of [mcmahan2018]).
Finally, we need to estimate the privacy loss (privacy budget: ) generated by the addition of noise. We use for our calculation the accountant provided by Renyi Differential Privacy (RDP) [Mironov_2017].
Based on the previously exposed and defined parameters, the query is secured, and the FL model using Fed-Avg can start training shared models. DP secures the model’s updates with the addition of random noise. This method protects any malicious agent to reconstruct the original data out of the model updates. DP, however, is not protecting the communication channel between the mentioned clients and the central server. To do so, authors in [SecAgg] leverage the work of [damgaard2012multiparty] allowing "a group of mutually distrustful parties each hold a private value and collaborate to compute an aggregate value". In SecAgg, the model will know that at least users participated but not which users. SecAgg implies two main algorithms: sharing and reconstruction.
The sharing algorithm will transform a secret into a set of shares of the secret associated with different clients. These shares follow [shamir1979share], hence collusion between participants is insufficient to disclose other clients’ private information. The reconstruction algorithm works in the opposite direction. It takes the mentioned shares from the clients and reconstructs the secret. In other words, clients share their secrets (models’ updates) with the server through a secure channel without the server being able to reconstruct the secrets. The central server forwards the received encrypted shares (with other clients public key). Each client, upon reception, masks the input and sends it back to the server. Finally, the central server asks for the shares of a client. It reconstructs the aggregated value (secure sum of the different clients’ models’ updates) without knowing the individual secrets or the participating clients.
3.5 Model operation
Having covered all necessary parameters to secure an FL model, it is still necessary to describe how it operates. This section describes the five main steps used for the model to compute the forecasts hereafter described, while in Figure 3 we illustrate the entire process.
Firstly, the central server has to select a model architecture and initialize it. Here, we initialize the model using Glorot initialization [glorot2010understanding]
. Secondly, the central server shares the model with the respective clients. Thirdly, at each client, the local models start training on their local data. Fourthly, clients send their updates (w.r.t. initial model) to the central server after each epoch. Fifthly, the central server averages these models’ updates and adds noise drawn from a Gaussian distribution in the case of DP. The noise multiplier is the ratio of the standard deviation of the noise to the query sensitivity. Sixty and Finally, the server returns the model333The model is perturbed with noise if DP is considered. to the clients. The clients will continue this training process, sending and receiving updates until they reach their common goal.
If we apply SeccAgg, the main steps remain the same. However, the protocol requires a set of cryptographic primitives to secure the communication between the client and server. Additionally, the protocol requires further communication rounds for sharing the clients’ secretes and public keys.
4 Simulation and evaluation of the scenarios
4.1 Scenarios Design
We design a set of scenarios to analyze the performance that FL could bring to STLF. These scenarios enable us systematic evaluate a classical FL setup, data correlation, Fl architecture and FL with privacy-preserving techniques. In total, we designed five scenarios (lettered from A to E) collected and summarized in Table 2. For the first three scenarios (A, B and C), we took a perspective on performance and assessed (metrics-oriented) the models described in the literature are in a decentralized environment. The following two scenarios (E and D) complement these models with privacy preserving techniques (DP and SecAgg). In each scenario, we run eight simulations. Each simulation assesses the behaviour of models in different settings where we scale (increase) the number of clients. The scaling process uses eight buckets () and contain 2, 5, 8, 11, 14, 17, 20, and 23 clients respectively. Each client will contain data specifically from one LCLid. We limit the number of clients (LCLIds) to 23 due to severe computational burdens when training the federated models. For instance, the model trained with [s20051399] as baseline model needed around eights days to finish.
|Scenarios||Privacy-Preserving Method||Baseline Model||Imposed Correlation|
|A||-||Marino et al.|
|B||-||Marino et al.|
|C||-||Khan et al.|
|D||Differential Privacy||Marino et al.|
|E||Secure Aggregation||Marino et al.|
4.2 Simulation Environment
The simulations were conducted using the high performance computer (HPC) facilities of the University of Luxembourg [HPC]
within the IRIS Cluster. Depending on the availability of the Graphic Processor Units (GPU), the federations run in an environment with 32 Intel Skylake cores and two NVIDIA Tesla V100 with 16Gb or 32Gb. The federation code is written in Python based on the framework provided by Tensorflow-Federated444https://github.com/tensorflow/federated
(TFF) whereas the DL models are written in Keras[chollet2015keras]. Concerning the timeline, the simulations lasted around 800 hours distributed during June, July, and August 2021.
4.3 Simulation Results
4.3.1 Scenario A
This scenario analyzes the scaling performance of the decentralized DL architecture designed by [marino2016building] in a classical FL setup (decentralized). We scale the number of clients in the federation as follows, 2, 5, 8, 11, 14, 17, 20, and 23 clients. We use a FL architecture without any privacy-preserving techniques. Each client uses data from one random LCLid taken from the final dataset 3.1. Furthermore, in this scenario we impose no data-correlation among the clients. From a FL training perspective, we trained the FL model over 300 communication rounds. Each clients contains a SGD optimizer with, as learning rate as the optimizer is consider in the original code of [mcmahan2018].
In Table 3 we collect the metrics results expressed in absolute values and the training time per round needed expressed in seconds [s].
|Federation size||MSE||RMSE||MAE||MAPE||Time per round [s]|
There is a metric improvement when increasing the number of clients in MAE and MAPE metrics as displayed in Table 3. However, the MSE and RMSE metrics are almost constant. These results remark the importance to collect several metrics when analyzing any forecasting models. Concerning the increase of data quantify when scaling the number of LCLids, it corroborates the known fact that for DL training as exposed in [hussein2019impact]. However, FL is not a categorical example of this behaviour because different clients might have opposite data that will drag the performance of the models. In FL, it is not about the amount of data but rather about the quality of data. Nevertheless, it is clear that in our FL setup, the computational time also increases as the number of clients increases, potentially creating time constraints for our FL model.
Additionally, in Figure 4 we expose the MAPE of the models during the different training rounds for the eight simulations. We can observe a quasi logarithmic decrease in the MAPE over the 300 rounds. The spikes were investigated and are due to the data itself, where there is a significant difference in the consumption data input (batch). Nevertheless, throughout the 300 rounds, the MAPE obtained is between 0.20 and 0.35, which can be considered a reasonable forecast based on [lewis1982industrial].
4.3.2 Scenario B
In this scenario, we analyze the performance of data correlation in a classical FL setup. Hence, we build on top of Scenario A (4.3.1). The method we use for the correlation is Pearson correlation as in [en13174408]. Scenario B considers only data from specific ACRONs (H and L), serving us as a correlation filter. Then for each federation size based on their number of clients ([2, 5, 8, 11, 14, 17, 20, 23]), we calculate all possible non-repeated combinations and compute their correlation, from which we select the combination with the highest correlation.
We collect similar to Scenario A (4.3.1) the results in Table 4, where we display the metrics values and the correlation rate, both expressed in absolute values. Concerning the computation time, it is the same as in Scenario A (4.3.1).
|Federation size||MSE||RMSE||MAE||MAPE||Correlation rate|
The results when applying correlation perform almost in every metric better than in the previous scenario. There are only 12 cases where some of the metrics decrease in performance. There is only one case () where the metrics results are consistently worse. However, the performance decrease is between 1% and 3.6%. The correlation rate of the federation size 23 has a insignificant correlation rate to echo a change in the performance. Albeit, other buckets significantly increased their performance, up to 60% (, MSE). Nevertheless, the average metric performance increases, thus teh error decreases. The MSE performance increases 12,98%; RMSE 7,62%; MAE 2,769% and the MAPE 4,39%. These results are in accordance with similar simulations in , where the application of K-means to cluster customers offer a performance gain between 10% and 15%. Hence, just by using correlations among the data used, metrics tend to improve. From an energy point of view, the increase in performance can potentially reduce imbalance even more costs caused by the foresting errors. From a general power system perspective, the forecasting accuracy increase is a positive outcome as the system operator could also plan assets and calculate potential congestions better.
4.3.3 Scenario C
In this scenario, we explore how a bigger DL architecture in terms or parameters impacts the metrics. Scenario C motivation comes from our preliminary conceptual review exposed in Section 2. We noticed that the model of [s20051399] behaves similarly when computed in our dataset and in their presented results. Their DL architecture offers a higher complexity as they include more parameters, and the size is almost five times the baseline model used in Scenario A and B. Given its size and the potential computational burden, we implement three modifications. The first modification concerns the GPUs, where we modify the settings of each of the 2 Nvidia Tesla allocated on the HPC. For each of them, we create two virtual cards, resulting in four cards for the FL model to train. The second modification is the batch size, which we increase from 100 to 200. Ideally, the batch size increase should prevent overtiffing since there are more data entries available to compute the loss of the model. Finally we modify the NN implementation of the DL architecture proposed by [s20051399]. We transform the initially proposed LSTM layers to CuDNNLSTM [appleyard2016optimizing]. The transformation will enable the LSTMs to use the Compute Unified Device Architecture (CUDA) kernel of our Tesla GPUs to reduce the computation time.
Likewise to the previous scenarios, we collect the metrics results obtained from our simulations in Table 5 in absolute values and the computational average time expressed in seconds [s].
|Federation size||MSE||RMSE||MAE||MAPE||Time per round [s]|
The results of Scenario C clearly show the computational burden of training a complex FL model. The computational time recorded is almost five times the recorded from previous scenarios. Concerning the metrics, there is only a performance increase for 8 out of 32 metrics when compared to Scenario A. These are for the case of bucket (all metrics) and bucket (MSE and RMSE). They have a performance increase ranging from 13% up to 68%. Contrariwise the performance decrease for the reminder 24 metrics ranging from 0.39% up to 264%.
These results point to a clear overfitting case, where as the number of clients increase the FL model’s performance dramatically decreases. It is known that models in FL tend to converge to a middle point [Li_2020] where all the different clients find their local minimum. Overfitting is usually defined as the lack of generalization of a model. An overfitted model has crossed the line between learning tendencies or patterns and memorize the data received as input. During Fed-Avg, the models got averaged at every communication round. Averaging models that have understood patterns result in new models that can devise shared patterns. When overfitted models are averaged, the result does not differ from blatant noise. Furthermore, when exploring the MAPE results over the 300 rounds, depicted in Figure 5, the overfitting is clearly visual. The green line, two LCLid, shows a sharp slope within the first 40 rounds.
4.3.4 Scenario D
Scenario D focuses on implementing DP as a privacy-preserving technique and analyzing the impact in our FL model. The addition of noise in DP is not a trivial pursuit as it is crucial to scale adequately the noise given a dataset to maintain privacy and performance. Furthermore, we implement and compare within flat clipping, explained in section 2.2, two approaches, fixed clipping and adaptive clipping techniques, explained in section 3.4. We only consider one federation size, 17, in this scenario since all federation sizes follow the same logic for implementing DP.
The first technique implemented is fixed clipping following the steps in [mcmahan2018]. There are two main steps to follow. Firstly, to identify the lowest clipping () value possible. We treated as a hyper-parameter for our model. Clipping could negatively affect the convergence rate of any model as it clips all values bigger than . Secondly, to identity a tolerable level of noise for our simulations. These values allow us to compute the privacy guarantee and know the privacy of our model.
The identification of the lowest clipping follows an iterative calculation of all metrics for our FL model, starting with until . The iteration uses 0.05 steps, where the selection of the starting point follows the recommendations of [mcmahan2018]. To select the lowest clipping value, we use the results for k=17 in Scenario A as metric benchmark to find a clipping value with similar performance. We collected the iteration results for in Table 6 and selected as our fixed clipping value.
Once identified the lowest clipping parameter, we can compute the standard deviation of the noise. With and the expected number of clients , we apply to calculate the standard deviation of the noise level . Likewise, in the previous step, we treat as a hyper-parameter and proceed in interactions. It lets us obtain different values to add to the training process and calculate the privacy guarantee. The privacy guarantee we calculated using the Rényi Differential Privacy Accountant [Mironov_2017] Table 7 collects the metric results and the calculated privacy guarantee obtained in the iterative process.
Considering the obtained values, the addition of DP, distorts the values by adding noise, thus which reduces performance. Technically, when comparing the results obtained in Table 8 with the previous results for the respective federation size, 17 in scenario A Table 3, on average, the performance decreases. The decrease grows as the noise scale () increases, rustling at the end of our simulations an average performance decrease in the metrics of 27%, mainly due to the metric results of . However, we should contextualize the results. We can consider the measurements obtained as accurate results in themselves. The metrics are relatively low, displaying a decent forecasting performance, although higher error than scenario A (no DP). Yet, it is necessary to remark that DP provides a privacy guarantee of (1.39, )-DP, where the lower the score, the better. Concerning our privacy results, these are close to perfect privacy ().
|Noise scale ()||Stddev Noise ()||Privacy Guarantee ()||MSE||RMSE||MAE||MAPE|
The second technique implemented for our analysis is adaptive clipping. For our FL model, we consider the implementation done in [andrew2021differentially], where their algorithm iteratively adjusts the norm clip, trying to approximate it to a fixed quantile. Hence, there is no need to search for the lowest clipping and noise, contrary to fixed clipping. The clipping adapts per round. Figure 6 is a representation of the evolution and adjustment of the clipping value over the training rounds. There spike at the very beginning is because of the low initial clipping value . Such a low value provokes that few data points will participate in selecting the following clipping values at the initial rounds. The smaller the data points, the more difficult it is to estimate the optimal value. Consequently, the size of the steps required to make this estimate increases exponentially until it reaches the real quantile, resulting in an even stiffer growth.
Pertaining to the metrics obtained when implementing adaptive clipping, we collect them in Table 9. Similar to the previous results with fixed clipping, there is a decrease in performance compared to scenario A; although it is lower than fixed clipping, it is still around 21%. Hence, adaptive clipping does increase the performance. Nonetheless, with the results obtained, it is necessary to remark the same contextualization as previously done for fixed clipping. The results are technically worse than an FL set-up with no DP, yet the performance decrease we can justify it as privacy is guaranteed. The privacy guarantee obtained in this scenario is (2.01,)-DP. Even-though being slightly worse than the one obtained for fixed clipping, it still obtains a remarkable privacy guarantee.
|Noise scale ()||Stddev Noise ()||Privacy Guarantee ()||MSE||RMSE||MAE||MAPE|
4.3.5 Scenario E
In this scenario the focus is on SecAgg as a different method to bring privacy into a FL model. Contrary to the previous scenario, where we DP adds random noise drawn from a Gaussian distribution, SecAgg targets the communication and aggregation of the clients. Hence, there is no trade-off as in Scenario D, where it is mandatory to find a suitable noise. Similar to how we expose the simulations for scenarios A, B and C, we collect the eight simulation results for each federation size in Table 10. We express the metrics in absolute values and the average computation time expressed in seconds [s]. Furthermore, we complement the results with Figure 7, where the error in terms of MAPE decreases almost in a logarithm manner and stabilizes at the end of the 300 rounds.
The application of SecAgg concerning the results expressed in Table 10 affects in a negligible way the computation time of the model compared with scenario A and even B. Consequently, SeccAgg provides a better metric performance than DP.
|Federation size||MSE||RMSE||MAE||MAPE||Time per round [s]|
This work discusses the application of a collaborative short term load forecasting technique, federated learning, using centralized deep learning models and adding privacy-preserving techniques to the federated learning model.
This paper first examined the most relevant short term load forecasting literature against our dataset to choose a deep learning architecture. The chosen one, a Sequence-to-Sequence Encoder-Decoder and two long short-term memory layers, served as a foundation for our federated model. The step analysis over different considerations (size, correlation, and even a different deep learning architecture) let us achieve promising results for the federated learning application for residential short term load forecasting. The results align with the ones proposed by  and [FEKRI2021107669]: (1) There is a performance increase when a federation uses highly correlated data to train, (2) bigger models tend to overfit, affecting the performance, and (3) deep learning architectures highly impact in computation time. Concerning the metrics themselves, the application of federated learning for residential short term load forecasting is encouraging as the obtained metrics score low error.
Later this paper covers the application of privacy-preserving techniques for short term load forecasting in a Federated Learning setting. To our knowledge, this is the first paper to cover them. We introduced Differential Privacy and Secure Aggregation to procure a secure setting in an FL context to explore the performance decrease they create when applied. Their application results in a minimal error increase, although every error increase is a step in the opposite direction. However, the inclusion of privacy is a feature that residential short term load forecasting needs in some jurisdictions. Either way, the privacy guarantee obtained is remarkably close to a secure theoretical setting (). We obtained (1.39,) and (2.01,) as the best privacy budget in () terms for the differential privacy and secure aggregation cases, respectively. Furthermore, our analysis also examined a fundamental parameter, clipping, within the DP case. The FL models behave consistently both in fixed and adaptive clipping. However, adaptive clipping does perform minimal better as expected from [andrew2021differentially].
The main lesson learned from our analysis is that finding an adequate trade-off between noise, performance, and utility is not a trivial endeavour. Nonetheless, after the analysis, we can posit: (1) An initial scaling of the data positively affects the privacy budget because of the reduction of the sensitivity of the query. (2) There is no significant difference between fixed and adaptive clipping. Still, the iterative nature of the latter is more suitable for real-world scenarios due to their rapid converge rate and their necessity to previously search for a clipping value. (3) The addition of secure aggregation almost does not affect the performance; thence, the time required to create a secure configuration is worth the service it provides. (4) Despite the fact that our environment has various constraints, the maximum number of smart meters is 23, we can debate the benefits that adding a privacy layer to residential short-term load forecasting could provide. We are able to protect their privacy and provide acceptable forecasting performance because the collaborative approach, federated learning, already improves forecasting performance.
Finally, the next steps in this research are to (1) assess bigger (scaled) setups with additional correlation indicators, such as the existence of distributed energy resources (i.e., photovoltaics, electric vehicles, or home energy management systems), in order to improve correlation. Furthermore, (2) to investigate data input disruptions produced by a hostile agent or a mistake caused by a smart metering device malfunction.
CRediT authorship contribution statement
Conceptualization, J.D.F, S.P.M, C.L; Methodology, J.D.F, S.P.M, C.L; Data Curation, J.D.F, S.P.M; Writing - Original Draft, J.D.F, S.P.M, C.L; Software J.D.F; Supervision G.F.; Writing - Review & Editing, J.D.F, S.P.M, C.L, G.F.; Visualization, J.D.F, S.P.M.; Funding acquisition, G.F. All authors have read and agreed to the published version of the manuscript.
Declaration of Competing Interest
The authors declare no conflict of interest.
This work has been supported by funding from the European Union (EU) within its Horizon 2020 programme, project MDOT (Medical Device Obligations Taskforce), Grant agreement 814654, and from the Kopernikus-project “SynErgie” by the German Federal Ministry of Education and Research (BMBF).