1 Introduction
Smart grids aim at creating automated and efficient energy delivery networks which improve power delivery reliability and quality, along with network security, energy efficiency, and demandside management aspects [1]. Modern power distribution systems are supported by advanced monitoring infrastructures that produce immense amount of data, thus enabling fine grained analytics and improved forecasting performance. In particular, electric load forecasting emerges as a critical task in the energy field, as it enables useful support for decision making, supporting optimal pricing strategies, seamless integration of renewables and maintenance cost reductions. Load forecasting is carried out at different time horizons, ranging from milliseconds to years, depending on the specific problem at hand.
In this work we focus on the dayahead prediction problem also referred in the literature as short term load forecasting (STLF) [2]. Since deregulation of electric energy distribution and wide adoption of renewables strongly affects daily market prices, STLF emerges to be of fundamental importance for efficient power supply [3]. Furthermore, we differentiate forecasting on the granularity level at which it is applied. For instance, in individual household scenario, load prediction is rather difficult as power consumption patterns are highly volatile. On the contrary, aggregated load consumption i.e., that associated with a neighborhood, a region, or even an entire state, is normally easier to predict as the resulting signal exhibit slower dynamics.
Historical power loads are timeseries affected by several external timevariant factors, such as weather conditions, human activities, temporal and seasonal characteristics that make their predictions a challenging problem. A large variety of prediction methods has been proposed for the electric load forecasting over the years and, only the most relevant ones are reviewed in this section. Autoregressive moving average models (ARMA) were among the first model families used in shortterm load forecasting [4, 5]. Soon they were replaced by ARIMA and seasonal ARIMA models [6]
to cope with time variance often exhibited by load profiles. In order to include exogenous variables like temperature into the forecasting method, model families were extended to ARMAX
[7, 8] and ARIMAX[9]. The main shortcoming of these system identification families is the linearity assumption for the system being observed, hypothesis that does not generally hold. In order to solve this limitation, nonlinear models like Feed Forward Neural Networks were proposed and became attractive for those scenarios exhibiting significant nonlinearity, as in load forecasting tasks
[10, 11, 12, 13, 3]. The intrinsic sequential nature of time series data was then exploited by considering sophisticated techniques ranging from advanced feed forward architecture with residual connections
[14] to convolutional approaches [15, 16] and Recurrent Neural Networks [17, 18] along with their many variants such as Echostate Network [19, 20, 18][21, 22, 23, 18][24, 18]. Moreover, some hybrid architectures have also been proposed aiming to capture the temporal dependencies in the data with recurrent networks while performing a more general feature extraction operation with convolutional layers
[25, 26].Reference  Predictive Family of Models  Time Horizon  Exogenous Variables  Dataset (Location) 

[18]  LSTM, GRU, ERNN, NARX, ESN  D    Rome, Italy 
[18]  LSTM, GRU, ERNN, NARX, ESN  D  T  New England [27] 
[28]  ERNN  H  T, H, P, other  Palermo, Italy 
[17]  ERNN  H  T, W, H  Hubli, India 
[20]  ESN  15min to 1Y    Sceaux, France [29] 
[21]  LSTM, NARX  D    Unknown 
[22]  LSTM  D(?)  C, TI  Australia [30] 
[23]  LSTM  2W to 4M  T, W, H, C, TI  France 
[31]  LSTM  2D  T, P, H, C, TI  Unknown [32] 
[33]  LSTM, seq2seqLSTM  60 H  C, TI  Sceaux, France [29] 
[34]  LSTM, seq2seqLSTM  12 H  T, C, TI  New England [35] 
[24]  GRU  D  T, C, other  Dongguan, China 
[15]  CNN  D  C, TI  USA 
[16]  CNN  D  C, TI  Sceaux, France 
[25]  CNN + LSTM  D  T, C, TI  NorthChina 
[26]  CNN + LSTM  D    NorthItaly 
Different reviews address the load forecasting topic by means of (not necessarily deep) neural networks. In [36] the authors focus on the use of some deep learning architectures for load forecasting. However, this review lacks a comprehensive comparative study of performance verified on common load forecasting benchmarks. The absence of valid costperformance metric does not allow the report to make conclusive statements. In [18] an exhaustive overview of recurrent neural networks for short term load forecasting is presented. The very detailed work considers one layer (not deep) recurrent networks only. A comprehensive summary of the most relevant researches dealing with STLF employing recurrent neural networks, convolutional neural networks and seq2seq models is presented in Table . It emerges that most of the works have been performed on different datasets, making it rather difficult  if not impossible  to asses their absolute performance and, consequently, recommend the best stateoftheart solutions for load forecast.
In this survey we consider the most relevant and recent deep architectures and contrast them in terms of performance accuracy on opensource benchmarks. The considered architectures include recurrent neural networks, sequence to sequence models and temporal convolutional neural networks. The experimental comparison is performed on two different realworld datasets which are representatives of two distinct scenarios. The first one considers power consumption at an individual household level with a signal characterized by high frequency components while the second one takes into account aggregation of several consumers. Our contributions consist in:

A comprehensive review. The survey provides a comprehensive investigation of deep learning architectures known to the smart grid literature as well as novel recent ones suitable for electric load forecasting.

A multistep prediction strategy comparison for recurrent neural networks: we study and compare how different prediction strategies can be applied to recurrent neural networks. To the best of our knowledge this work has not been done yet for deep recurrent neural networks.

A relevant performance assessment. To the best of our knowledge, the present work provides the first systematic experimental comparison of the most relevant deep learning architectures for the electric load forecasting problems of individual and aggregated electric demand. It should be noted that envisaged architectures are domain independent and, as such, can be applied in different forecasting scenarios.
The rest of this paper is organized as follows.
In Section we formally introduce the forecasting problems along with the notation that will be used in this work.
In Section we introduce Feed Forward Neural Networks (FNNs) and the main concepts relevant to the learning task. We also provide a short review of the literature regarding the use of FNNs for the load forecasting problem.
In Section we provide a general overview of Recurrent Neural Networks (RNNs) and their most advanced architectures: Long ShortTerm Memory and Gated Recurrent Unit networks.
In Section Sequence To Sequence architectures (seq2seq) are discussed as a general improvement over recurrent neural networks. We present both, simple and advanced models built on the sequence to sequence paradigm.
In Section Convolutional Neural Networks are introduced and one of their most recent variant, the temporal convolutional network (TCN), is presented as the stateoftheart method for univariate timeseries prediction.
In Section the realworld datasets used for models comparison are presented. For each dataset, we provide a description of the preprocessing operations and the techniques that have been used to validate the models performance.
Finally, In Section we draw conclusions based on the performed assessments.
2 Problem Description
In basic multistep ahead electric load forecasting a univariate time series
that spans through several years is given. In this work, input data are presented to the different predictive families of models as a regressor vector composed of fixed timelagged data associated with a window size of length
which slides over the time series. Given this fixed length view of past values, a predictor aims at forecasting the nextvalues of the time series. In this work the forecasting problem is studied as a supervised learning problem. As such, given the input vector at discrete time
defined as , the forecasting problem requires to infer the next measurements or a subset of. To ease the notation we express the input and output vectors in the reference system of the time window instead of the time series one. By following this approach, the input vector at discrete time becomes and the corresponding output vector is . characterizes the real output values defined as . Similarly, we denote as , the prediction vector provided by a predictive model whose parameters vectorhas been estimated by optimizing a performance function.
Without loss of generality, in the remaining of the paper, we drop the subscript from the inner elements of and . The introduced notation, along with the sliding window approach, is depicted in Figure .
In certain applications we will additionally be provided with exogenous variables (e.g., the temperatures) each of which representing a univariate time series aligned in time with the data of electricity demand. In this scenario the components of the regressor vector become vectors, i.e., . Indeed, each element of the input sequence is represented as where is the scalar load measurement at time , while is the scalar value of the exogenous feature.
The nomenclature used in this work is given in Table .
Notation  Description  

window size of the regressor vector  
time horizon of the forecast  
number of exogenous variable  
scalar value  
vector/matrix  
vector/matrix transposed  
elementwise product  
*  convolution operator  
dilated convolution operation  
index for the layer  














hidden state vector  
model’s vector of parameters  
number of hidden neurons 
3 Feed Forward Neural Networks
Feed Forward Neural Networks (FNNs) are parametric model families characterized by the universal function approximation property
[37]. Their computational architectures are composed of a layered structure consisting of three main building blocks: the input layer, the hidden layer(s) and the output layer. The number of hidden layers (), determines the depth of the network, while the size of each layer, i.e., the number of hidden units of the layer defines its complexity in terms of neurons. FNNs provide only direct forward connections between two consecutive layers, each connection associated with a trainable parameter; note that given the feedfoward nature of the computation no recursive feedback is allowed. More in detail, given a vector fed at the network input, the FNN’s computation can be expressed as:(1)  
(2) 
where and .
Each layer is characterized with its own parameters matrix
and bias vector
. Hereafter, in order to ease the notation, we incorporate the bias term in the weight matrix, i.e., and . groups all the network’s parameters.Given a training set of inputoutput vectors in the (, ) form, , the learning procedure aims at identifying a suitable configuration of parameters
that minimizes a loss function
evaluating the discrepancy between the estimated values and the measurements :The mean squared error:
(3) 
is a very popular loss function for time series prediction and, not rarely, a regularization penalty term is introduced to prevent overfitting and improve the generalization capabilities of the model
(4) 
The most used regularization scheme controlling model complexity is the L2 regularization , being a suitable hyperparameter controlling the regularization strength.
As Equation is not convex, the solution cannot be obtained in a closed form with linear equation solvers or convex optimization techniques. Parameters estimation (learning procedure) operates iteratively e.g., by leveraging on the gradient descent approach:
(5) 
where is the learning rate and the gradient w.r.t.
. Stochastic Gradient Descent (SGD), RMSProp
[38], Adagrad [39], Adam [40] are popular learning procedures. The learning procedure yields estimate associated with the predictive model .In our work, deep FNNs are the baseline model architectures.
In multistep ahead prediction the output layer dimension coincides with the forecasting horizon . The dimension of the input vector depends also on the presence of exogenous variables; this aspect is further discussed in Section .
3.1 Related Work
The use of Feed Forward Neural networks in short term load forecasting dates back to the 90s. Authors in [11] propose a shallow neural network with a single hidden layer to provide a 24hour forecast using both load and temperature information. In [10] one day ahead forecast is implemented using two different prediction strategies: one network provides all 24 forecast values in a single shot (MIMO strategy) while another single output network provides the dayahead prediction by recursively feedbacking its last value estimate (recurrent strategy). The recurrent strategy shows to be more efficient in terms of both training time and forecasting accuracy. In [41] the authors present a feed forward neural network to forecast electric loads on a weekly basis. The sparsely connected feed forward architecture receives the load timeseries, temperature readings, as well as the time and day of the week. It is shown that the extra information improves the forecast accuracy compared to an ARIMA model trained on the same task. [12] presents one of the first multilayer FNN to forecast the hourly load of a power system.
A detailed review concerning applications of artificial neural networks in shortterm load forecasting can be found in [3]. However, this survey dates back to the early 2000s, and does not discuss deep models. More recently, architectural variants of feed forward neural networks have been used; for example, in [14] a ResNet [42] inspired model is used to provide day ahead forecast by leveraging on a very deep architecture. The article shows a significant improvement on aggregated load forecasting when compared to other (notneural) regression models on different datasets.
4 Recurrent Neural networks
In this section we overview recurrent neural networks, and, in particular the Elmann Net architecture [43], LongShort Term Memory [44] and Gated Recurrent Unit [45] networks. Afterwords, we introduce deep recurrent neural networks and discuss different strategies to perform multistep ahead forecasting. Finally, we present related work in shortterm load forecasting that leverages on recurrent networks.
4.1 Elmann RNNs (ERNN)
Elmann Recurrent Neural Networks (ERNN) were proposed in [43] to generalize feedforward neural networks for better handling ordered data sequences like timeseries.
The reason behind the effectiveness of RNNs in dealing with sequences of data comes from their ability to learn a compact representation of the input sequence by means of a recurrent function that implements the following mapping:
(6) 
By expanding Equation and given a sequence of inputs , the computation becomes:
(7)  
(8)  
(9) 
where , , are the weight matrices for hiddenhidden, inputhidden, hiddenoutput connections respectively,
is an activation function (generally the hyperbolic tangent one) and
is normally a linear function. The computation of a single module in an Elmann recurrent neural network is depicted in Figure .It can be noted that an ERNN processes one element of the sequence at a time, preserving its inherent temporal order. After reading an element from the input sequence the network updates its internal state using both (a transformation of) the latest state and (a transformation of) the current input (Equation ). The described process can be better visualized as an acyclic graph obtained from the original cyclic graph (left side of Figure ) via an operation known as time unfolding (right side of Figure ). It is of fundamental importance to point out that all nodes in the unfolded network share the same parameters, as they are just replicas distributed over time.
The parameters of the network
are usually learned via Backpropagation Through Time (BPTT)
[46, 47], a generalized version of standard Backpropagation. In order to apply gradientbased optimization, the recurrent neural network has to be transformed through the unfolding procedure shown in Figure . In this way, the network is converted into a FNN having as many layers as time intervals in the input sequence, and each layer is constrained to have the same weight matrices. In practice Truncated Backpropagation Through Time [48] TBPTT(, ) is used. The method processes an input window of length one timestep at a time and runs BPTT for timesteps every steps. Notice that having does not limit the memory capacity of the network as the hidden state incorporates information taken from the whole sequence. Despite that, setting to a very low number may result in poor performance. In the literature BPTT is considered equivalent to TBPTT(,). In this work we used epochwise Truncated BPTT i.e., TBPTT(
, ) to indicate that the weights update is performed once a whole sequence has been processed.Despite of the model simplicity, Elmann RNNs are hard to train due to ineffectiveness of gradient (back)propagation. In fact, it emerges that the propagation of gradient is effective for shortterm connections but is very likely to fail for longterm ones, when the gradient norm usually shrinks to zero or diverges. These two behaviours are known as the vanishing gradient and the exploding gradient problems
[49, 50] and were extensively studied in the machine learning community.4.2 Long ShortTerm Memory (LSTM)
Recurrent neural networks with Long ShortTerm Memory (LSTM) were introduced to cope with the vanishing and exploding gradients problems occurring in ERNNs and, more in general, in standard RNNs [44]. LSTM networks maintain the same topological structure of ERNN but differ in the composition of the inner module  or cell.
Each LSTM cell has the same input and output as an ordinary ERNN cell but, internally, it implements a gated system that controls the neural information processing (see Figure Figure and )
. The key feature of gated networks is their ability to control the gradient flow by acting on the gate values; this allows to tackle the vanishing gradient problem, as LSTM can maintain its internal memory unaltered for long time intervals. Notice from the equations below that the inner state of the network results as a linear combination of the old state and the new state (Equation
). Part of the old state is preserved and flows forward while in the ERNN the state value is completely replaced at each timestep (Equation ). In detail, the neural computation is:(10)  
(11)  
(12)  
(13)  
(14)  
(15) 
where , are parameters to be learned, is the Hadamard product, is generally a sigmoid activation while can be any nonlinear one (hyperbolic tangent in the original paper). The cell state encodes the  so far learned  information from the input sequence. At timestep the flow of information within the unit is controlled by three elements called gates: the forget gate controls the cell state’s content and changes it when obsolete, the input gate controls which state value will be updated and how much, , finally the output gate produces a filtered version of the cell state and serves it as the network’s output [51].
4.3 Gated Recurrent Units (GRU)
Firstly introduced in [45], GRUs are a simplified variant of LSTM and, as such, belong to the family of gated RNNs. GRUs distinguish themselves from LSTMs for merging in one gate functionalities controlled by the forget gate and the input gate. This kind of cell ends up having just two gates, which results in a more parsimonious architecture compared to LSTM that, instead, has three gates.
The basic components of a GRU cell are outlined in Figure , whereas the neural computation is controlled by:
(16)  
(17)  
(18)  
(19) 
where , are the parameters to be learned, is generally a sigmoid activation while can be any kind of nonlinearity (in the original work it was an hyperbolic tangent). and
are the update and the reset gates, respectively. Several works in the natural language processing community show that GRUs perform comparably to LSTM but train generally faster due to the lighter computation
[52, 53].4.4 Deep Recurrent Neural Networks
All recurrent architectures presented so far are characterized by a single layer. In turn, this implies that the computation is composed by an affine transformation followed by a nonlinearity. That said, the concept of depth in RNN is less straightforward than in feedforward architectures. Indeed, the later ones become deep when the input is processed by a large number of nonlinear transformations before generating the output values. However, according to this definition, an unfolded RNN is already a deep model given its multiple nonlinear processing layers. That said, a deep multilevel processing can be applied to all the transition functions (inputhidden, hiddenhidden, hiddenoutput) as there are no intermediate layers involved in these computations
[54]. Deepness can also be introduced in recurrent neural networks by stacking recurrent layers one on top of the other [55]. As this deep architecture is more intriguing, in this work, we refer it as a Deep RNN. By iterating the RNN computation, the function implemented by the deep architecture can be represented as:(20) 
where is the hidden state at timestep for layer . Notice that . It has been empirically shown in several works that Deep RNNs are better to capture the temporal hierarchy exhibited by timeseries then their shallow counterpart [54, 56, 57]. Of course, hybrid architectures having different layers recurrent or not can be considered as well.
4.5 MultiStep Prediction Schemes
There are five different architectureindependent strategies for multistep ahead forecasting [58]:
Recursive strategy (Rec)
a single model is trained to perform a onestep ahead forecast given the input sequence. Subsequently, during the operational phase, the forecasted output is recursively fedback and considered to be the correct one. By iterating times this procedure we generate the forecast values at time . The procedure is described in Algorithm , where is the input vector without its first element while the procedure concatenates the scalar output to the exogenous input variables.
To summarize, the predictor receives in input a vector of length and outputs a scalar value .
Direct strategy
design a set of independent predictors , each of which providing a forecast at time . Similarly to the recursive strategy, each predictor outputs a scalar value , but the input vector is the same to all the predictors. Algorithm details the procedure.
DirRec strategy
[59] is a combination of the above two strategies. Similar to the direct approach, models are used, but here, each predictor leverages on an enlarged input set, obtained by adding the results of the forecast at the previous timestep. The procedure is detailed in Algorithm .
MIMO strategy
(Multiple input  Multiple output) [60], a single predictor is trained to forecast a whole output sequence of length in oneshot, i.e., differently from the previous cases the output of the model is not a scalar but a vector:
DIRMO strategy
[61], represents a tradeoff between the Direct strategy and the MIMO strategy. It divides the steps forecasts into smaller forecasting problems, each of which of length . It follows that predictors are used to solve the problem.
Given the considerable computational demand required by RNNs during training, we focus on multistep forecasting strategies that are computationally cheaper, specifically, Recursive and MIMO strategies [58]. We will call them RNNRec and RNNMIMO.
Given the hidden state at timestep , the hiddenoutput mapping is obtained through a fully connected layer on top of the recurrent neural network. The objective of this dense network is to learn the mapping between the last state of the recurrent network, which represents a kind of lossy summary of the taskrelevant aspect of the input sequence, and the output domain. This holds for all the presented recurrent networks and is consistent with Equation . In this work RNNRec and RNNMIMO differ in the cardinality of the output domain, which is for the former and for the latter, meaning that in Equation either or . The objective function is:
(21) 
4.6 Related work
In [17] an Elmann recurrent neural network is considered to provide hourly load forecasts. The study also compares the performance of the network when additional weather information such as temperature and humidity are fed to the model. The authors conclude that, as expected, the recurrent network benefits from multiinput data and, in particular, weather ones. [28] makes use of ERNN to forecast household electric consumption obtained from a suburban area in the neighbours of Palermo (Italy). In addition to the historical load measurements, the authors introduce several features to enhance the model’s predictive capabilities. Besides the weather and the calendar information, a specific adhoc index was created to assess the influence of the use of airconditioning equipment on the electricity demand. In recent years, LSTMs have been adopted in short term load forecasting, proving to be more effective then traditional timeseries analysis methods. In [21] LSTM is shown to outperform traditional forecasting methods being able to exploit the long term dependencies in the time series to forecast the dayahead load consumption. Several works proved to be successful in enhancing the recurrent neural network capabilities by employing multivariate input data. In [22] the authors propose a deep, LSTM based architecture that uses past measurements of the whole household consumption along with some measurements from selected appliances to forecast the consumption of the subsequent time interval (i.e., a one step prediction). In [23] a LSTMbased network is trained using a multivariate input which includes temperature, holiday/working day information, date and time information. Similarly, in [31]
a power demand forecasting model based on LSTM shows an accuracy improvement compared to more traditional machine learning techniques such as Gradient Boosting Trees and Support Vector Regression.
GRUs have not been used much in the literature as LSTM networks are often preferred. That said, the use of GRUbased networks is reported in [18], while a more recent study [24] uses GRUs for the daily consumption forecast of individual customers. Thus, investigating deep GRUbased architectures is a relevant scientific topic, also thanks to their faster convergence and simpler structure compared to LSTM [52].
Despite all these promising results, an extensive study of recurrent neural networks [18], and in particular of ERNN, LSTM, GRU, ESN[62] and NARX, concludes that none of the investigated recurrent architectures manages to outperform the others in all considered experiments. Moreover, the authors noticed that recurrent cells with gated mechanisms like LSTM and GRU perform comparably well to much simpler ERNN. This may indicate that in shortterm load forecasting gating mechanism may be unnecessary; this issue is further investigated and evidence found in the present work.
5 Sequence To Sequence models
Sequence To Sequence (seq2seq) architectures [63] or encoderdecoder models [45]
were initially designed to solve RNNs inability to produce output sequences of arbitrary length. The architecture was firstly used in neural machine translation
[64, 65, 45] but has emerged as the golden standard in different fields such as speech recognition [66, 67, 68] and image captioning [69].The core idea of this general framework is to employ two networks resulting in an encoderdecoder architecture. The first neural network (possibly deep) , an encoder, reads the input sequence of length one timestep at a time; the computation generates a, generally lossy, fixed dimensional vector representation of it , . This embedded representation is usually called context in the literature and can be the last hidden state of the encoder or a function of it. Then, a second neural network  the decoder  will learn how to produce the output sequence given the context vector, i.e., . The schematics of the whole architecture is depicted in Figure .
The encoder and the decoder modules are generally two recurrent neural networks trained endtoend to minimize the objective function:
(22) 
(23) 
where is the decoder’s estimate at time , is the real measurement, is the decoder’s last state, is the context vector from the encoder, is the input sequence and the regularization term. The training procedure for this type of architecture is called teacher forcing [70]. As shown in Figure and explained in Equation , during training, the decoder’s input at time is the groundtruth value , which is then used to generate the next state and, then, the estimate . During inference the true values are unavailable and replaced by the estimates:
(24) 
This discrepancy between training and testing results in errors accumulating over time during inference. In the literature this problem is often referred to as exposure bias [71]. Several solutions have been proposed to address this problem; in [72] the authors present scheduled sampling, a curriculum learning strategy that gradually changes the training process by switching the decoder’s inputs from groundtruth values to model’s predictions. The professor forcing algorithm, introduced in [73]
, uses an adversarial framework to encourage the dynamics of the recurrent network to be the same both at training and operational (test) time. Finally, in recent years, reinforcement learning methods have been adopted to train sequence to sequence models; a comprehensive review is presented in
[74].In this work we investigate two sequence to sequence architectures, one trained via teacher forcing (TF) and one using selfgenerated (SG) samples. The former is characterized by Equation during training while Equation is used during prediction. The latter architecture adopts Equation both for training and prediction. The decoder’s dynamics are summarized in Figure . It is clear that the two training procedures differ in the decoder’s input source: groundtruth values in teacher forcing, estimated values in selfgenerated training.
5.1 Related Work
Only recently seq2seq models have been adopted in short term load forecasting. In [33] a LSTM based encoderdecoder model is shown to produce superior performance compared to standard LSTM. In [75] the authors introduce an adaptation of RNN based sequencetosequence architectures for timeseries forecasting of electrical loads to demonstrate its better performance with respect to a suite of models ranging from standard RNNs to classical time series techniques.
6 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) [76] are a family of neural networks designed to work with data that can be structured in a gridlike topology. CNNs were originally used on two dimensional and threedimensional images, but they are also suitable for onedimensional data such as univariate timeseries. Once recognized as a very efficient solution for image recognition and classification [77, 78, 79, 42]
, CNNs have experienced wide adoption in many different computer vision tasks
[80, 81, 82, 83, 84]. Moreover, sequence modeling tasks, like short term electric load forecasting, have been mainly addressed with recurrent neural networks, but recent research indicates that convolutional networks can also attain stateoftheartperformance in several applications including audio generation [85], machine translation [86] and timeseries prediction [87].As the name suggests, these kind of networks are based on a discrete convolution operator that produces an output feature map by sliding a kernel over the input . Each element in the output feature map is obtained by summing up the result of the elementwise multiplication between the input patch (i.e., a slice of the input having the same dimensionality of the kernel) and the kernel. The number of kernels (filters)
used in a convolutional layer determines the depth of the output volume (i.e., the number of output feature maps). To control the other spatial dimensions of the output feature maps two hyperparameters are used: stride and padding. Stride represents the distance between two consecutive input patches and can be defined for each direction of motion. Padding refers to the possibility of implicitly enlarging the inputs by adding (usually) zeros at the borders to control the output size w.r.t the input one. Indeed, without padding, the dimensionality of the output would be reduced after each convolutional layer.
Considering a 1D timeseries and a onedimensional kernel , the element of the convolution between and is:
(25) 
with if no zeropadding is used, otherwise padding matches the input dimensionality, i.e., . Equation is referred to the onedimensional input case but can be easily extended to multidimensional inputs (e.g., images, where ) [88]. The reason behind the success of these networks can be summarized in the following three points:

local connectivity: each hidden neuron is connected to a subset of input neurons that are close to each other (according to specific spatiotemporal metric). This property allows the network to drastically reduce the number of parameters to learn (w.r.t. a fully connected network) and facilitate computations.

parameter sharing: the weights used to compute the output neurons in a feature map are the same, so that the same kernel is used for each location. This allows to reduce the number of parameters to learn.

translation equivariance: the network is robust to an eventual shifting of its input.
In our work we focus on a convolutional architecture inspired by Wavenet [85]
, a fully probabilistic and autoregressive model used for generating raw audio waveforms and extended to timeseries prediction tasks
[87]. Up to the authors’ knowledge this architecture has never been proposed to forecast the electric load. A recent empirical comparison between temporal convolutional networks and recurrent networks has been carried out in [89] on tasks such as polymorphic music and chartersequence level modelling. The authors were the first to use the name Temporal Convolutional Networks (TCNs) to indicate convolutional networks which are autoregressive, able to process sequences of arbitrary length and output a sequence of the same length. To achieve the above the network has to employ causal (dilated) convolutions and residual connections should be used to handle a very long history size.Dilated Causal Convolution (DCC)
Being TCNs a family of autoregressive models, the estimated value at time must depend only on past samples and not on future ones (Figure ). To achieve this behavior in a Convolutional Neural Network the standard convolution operator is replaced by causal convolution. Moreover, zeropadding of length (filter size  1) is added to ensure that each layer has the same length of the input layer. To further enhance the network capabilities dilated causal convolutions are used, allowing to increase the receptive field of the network (i.e., the number of input neurons to which the filter is applied) and its ability to learn longterm dependencies in the timeseries. Given a onedimensional input , and a kernel , a dilated convolution output using a dilation factor becomes:
(26) 
This is a major advantage w.r.t simple causal convolutions, as in the later case the receptive field grows linearly with the depth of the network while with dilated convolutions the dependence is exponential , ensuring that a much larger history size is used by the network.
Residual Connections
Despite the implementation of dilated convolution, the CNN still needs a large number of layers to learn the dynamics of the inputs. Moreover, performance often degrade with the increase of the network depth. The degradation problem has been first addressed in [42] where the authors propose a deep residual learning framework. The authors observe that for a layers network with a training error , inserting extra layers on top of it should either leave the error unchanged or improve it. Indeed, in the worst case scenario, the new stacked non linear layers should learn the identity mapping where is the output of the network having layers and is the output of the network with layers. Although almost trivial, in practice, neural networks experience problems in learning this identity mapping. The proposed solution suggests these stacked layers to fit a residual mapping instead of the desired one, . The original mapping is recast into which is realized by feed forward neural networks with shortcut connections; in this way the identity mapping is learned by simply driving the weights of the stacked layers to zero.
By means of the two aforementioned principles, the temporal convolutional network is able to exploit a large history size in an efficient manner. Indeed, as observed in [89], these models present several computational advantages compared to RNNs. In fact, they have lower memory requirements during training and the predictions for later timesteps are not done sequentially but can be computed in parallel exploiting parameter sharing. Moreover, TCNs training is much more stable than that involving RNNs allowing to avoid the exploding/vanishing gradient problem. For all the above, TCNs have demonstrated to be promising area of research for time series prediction problems and here, we aim to assess their forecasting performance w.r.t stateoftheart models in shortterm load forecasting. The architecture used in our work is depicted in Figure , which is, except for some minor modifications, the network structure detailed in [87]. In the first layer of the network we process separately the load information and, when available, the exogenous information such as temperature readings. Later the results will be concatenated together and processed by a deep residual network with
layers. Each layer consists of a residual block with 1D dilated causal convolution, a rectified linear unit (ReLU) activation and finally dropout to prevent overfitting. The output layer consists of 1x1 convolution which allows the network to output a onedimensional vector
having the same dimensionality of the input vector . To approach multistep forecasting, we adopt a MIMO strategy.6.1 Related Work
In the shortterm load forecasting relevant literature, CNNs have not been studied to a large extent. Indeed, until recently, these models were not considered for any timeseries related problem. Still, several works tried to address the topic; in [15]
a deep convolutional neural network model named DeepEnergy is presented. The proposed network is inspired by the first architectures used in ImageNet challenge (e.g,
[77]), alternating convolutional and pooling layers, halving the width of the feature map after each step. According to the provided experimental results, DeepEnergy can precisely predict energy load in the next three days outperforming five other machine learning algorithms including LSTM and FNN. In [16] a CNN is compared to recurrent and feed forward approaches showing promising results on a benchmark dataset. In [25] a hybrid approach involving both convolutional and recurrent architectures is presented. The authors integrate different input sources and use convolutional layers to extract meaningful features from the historic load while the recurrent network main task is to learn the system’s dynamics. The model is evaluated on a large dataset containing hourly loads from a city in North China and is compared with a threelayer feed forward neural network. A different hybrid approach is presented in [26], the authors process the load information in parallel with a CNN and an LSTM. The features generated by the two networks are then used as an input for a final prediction network (fully connected) in charge of forecasting the dayahead load.7 Performance Assessment
In this section we perform evaluation and assessment of all the presented architectures. The testing is carried out by means of three use cases that are based on two different datasets used as benchmarks. We first introduce the performance metrics that we considered for both network optimization and testing, then describe the datasets that have been used and finally we discuss results.
7.1 Performance Metrics
The efficiency of the considered architectures has been measured and quantified using widely adopted error metrics. Specifically, we adopted the Root mean squared error (RMSE) and the Mean Absolute Error (MAE):
where is the number of inputoutput pairs provided to the model in the course of testing, and are respectively the real load values and the estimated load values at time for sample (i.e., the time window). is the mean operator, is the euclidean L2 norm, while is the L1 norm. and are the real load values and the estimated load values for one sample, respectively. Still, a more intuitive and indicative interpretation of prediction efficiency of the estimators can be expressed by the normalized root mean squared error which, differently from the two above metrics, is independent from the scale of the data:
where and are the maximum and minimum value of training dataset, respectively. In order to quantify the proportion of variance in the target that is explained by the forecasting methods we consider also the index:
where
All considered models have been implemented in Keras 2.12
[90]with Tensorflow
[91] as backend. The experiments are executed on a Linux cluster with an Intel(R) Xeon(R) Silver CPU and an Nvidia Titan XP.7.2 Use Case I
The first use case considers the Individual household electric power consumption data set (IHEPC) which contains 2.07M measurements of electric power consumption for a single house located in Sceaux (7km of Paris, France). Measurements are collected every minute between December 2006 and November 2010 (47 months) [29]. In this study we focus on prediction of the "Global active power" parameter. Nearly 1.25% of measurements are missing, still, all the available ones come with timestamps. We reconstruct the missing values using the mean power consumption for the corresponding time slot across the different years of measurements. In order to have a unified approach we have decided to resample the dataset using a sampling rate of 15 minutes which is a widely adopted standard in modern smart meters technologies. In Table the sample size are outlined for each dataset.
In this use case we performed the forecasting using only historical load values. The right side of Figure depicts the average weekly electric consumption. As expected, it can be observed that the highest consumption is registered in the morning and evening periods of day when the occupancy of resident houses is high. Moreover, the average load profile over a week clearly shows that weekdays are similar while weekends present a different trend of consumption.
The figure shows that the data are characterized by high variance. The prediction task consists in forecasting the electric load for the next day, i.e., 96 timesteps ahead.
Dataset  Train  Test 

IHEPC  103301  35040 
GEFCom2014  44640  8928 
In order to assess the performance of the architectures we hold out a portion of the data which denotes our test set and comprises the last year of measurements. The remaining measurements are repeatedly divided in two sets, keeping aside a month of data every five ones. This process allows us to build a training set and a validation set for which different hyperparameters configurations can be evaluated. Only the best performing configuration is later evaluated on the test set.
7.3 Use Case II and III
The other two use cases are based on the GEFCom2014dataset [35], which was made available for an online forecasting competition that lasted between August 2015 and December 2015. The dataset contains 60.6k hourly measurements of (aggregated) electric power consumption collected by ISO New England between January 2005 and December 2011. Differently from the IHEPCdataset, temperature values are also available and are used by the different architectures to enhance their prediction performance. In particular the input variables being used for forecasting the subsequent at timestep
include: several previous load measurements, the temperature measurements for the previous timesteps registered by 25 different stations, hour, day, month and year of the measurements. We apply standard normalization to load and temperature measurements while for the other variables we simply apply onehot encoding, i.e., a
dimensional vector in which one of the elements equals 1, and all remaining elements equal 0 [92]. On the right side of Figure we observe the average load and the data dispersion on a weekly basis. Compared to IHEPC, the load profiles look much more regular. This meets intuitive expectations as the load measurements in the first dataset come from a single household, thus the randomness introduced by user behaviour makes more remarkable impact on the results. On the opposite, the load information in GEFCom2014comes from the aggregation of the data provided by several different smart meters; clustered data exhibits a more stable and regular pattern. The main task of these use cases, as well the previous one, consists in forecasting the electric load for the next day, i.e., 24 timesteps ahead. The hyperparameters optimization and the final score for the models follow the same guidelines provided for IHEPC, the number of points for each subset is described in Table .Weekly statistics for the electric load in the whole IHEPC(Left) and GEFCom2014datasets (right). The bold line is the mean curve, the dotted line is the median and the green area covers one standard deviation from the mean.
7.4 Results
The compared architectures are the ones presented in previous sections with one exception. We have additionally considered a deeper variant of a feed forward neural network with residual connections which is named DFNN in the remainder of the work. In accordance to the findings of [93]
we have employed a 2shortcut network, i.e., the input undergoes two affine transformations each followed by a non linearity before being summed to its original values. For regularization purposes we have included Dropout and Batch Normalization
[94] in each residual block. We have additionally inserted this model in the results comparison because it represents an evolution of standard feed forward neural networks which is expected to better handle highly complex timeseries data.Table summarizes the best configurations found trough grid search for each model and use case. For both datasets we experimented different input sequences of length . Finally, we used a window size of four days, which represents the best tradeoff between performance and memory requirements. The output sequence length is fixed to one day. For each model we identified the optimal number of stacked layers in the network , the number of hidden units per layer , the regularization coefficient (L2 regularization) and the dropout rate . Moreover, for TCN we additionally tuned the width of the convolutional kernel and the number of filters applied at each layer (i.e., the depth of each output volume after the convolution operation). The dilation factor is increased exponentially with the depth of the network, i.e. with being the layer of the network.
Hyperparameters  Dataset  FNN  DFNN  TCN  ERNN  LSTM  GRU  seq2seq  
Rec  MIMO  Rec  MIMO  Rec  MIMO  TF  SG  
L  IHPEC  3  6  8  3  1  2  1  2  1  1  1 
GEFCOM  1  6  6  4  4  4  2  4  1  2  1  
1  6  8  2  1  4  1  2  2  2  3  
IHPEC  50  10  30  20  20  10  50  30  50  
GEFCOM  60  30  20  50  15  20  30  20  10  50  
60  30  10  30  30  50  10  15  20  201510  
IHPEC  0.001  0.0005  0.005  0.001  0.001  0.001  0.001  0.001  0.0005  0.01  0.01  
GEFCOM  0.01  0.0005  0.01  0.01  0.0005  0.001  0.001  0.01  0.0005  0.01  0.01  
0.005  0.0005  0.005  0.0005  0.001  0.0005  0.0005  0.001  0.01  0.001  0.01  
IHPEC  0.1  0.1  0.1  0.0  0.0  0.0  0.0  0.0  0.0  0.1  0.2  
GEFCOM  0.1  0.1  0.1  0.1  0.0  0.1  0.0  0.1  0.0  0.1  0.1  
0.1  0.1  0.1  0.0  0.0  0.0  0.0  0.1  0.0  0.1  0.0  
M  IHPEC  2, 32  
GEFCOM  2, 16  
2, 64 
Table summarizes the test scores of the presented architectures obtained for the IHEPCdataset. Certain similarities among networks trained for different uses cases can be spotted out already at this stage. In particular, we observe that all models exploit a small number of neurons. This is not usual in deep learning but  at least for recurrent architectures  is consistent with [18]. With some exceptions, recurrent networks benefit from a less strict regularization; dropout is almost always set to zero and values are small.
Among Recurrent Neural Networks we observe that, in general, the MIMO strategy outperforms the recursive one in this multi step prediction task. This is reasonable in such a scenario. Indeed, the recursive strategy, differently from the MIMO one, is highly sensitive to errors accumulation which, in a highly volatile time series as the one addressed here, results in a very inaccurate forecast. Among the MIMO models we observe that gated networks perform significantly better than simple Elmann network one. This suggests that gated systems are effectively learning to better exploit the temporal dependency in the data. In general we notice that all the models, except the RNNs trained with recursive strategy, achieve comparable performance and none really stands out. It is interesting to comment that GRUMIMO and LSTMMIMO outperform sequence to sequence architectures which are supposed to better model complex temporal dynamics like the one exhibited by the residential load curve. Nevertheless, by observing the performance of recurrent networks trained with the recursive strategy, this behaviour is less surprising. In fact, compared with the aggregated load profiles, the load curve belonging to a single smart meter is way more volatile and sensitive to customers behaviour. For this reason, leveraging geographical and socioeconomic features that characterize the area where the user lives may allow deep networks to generate better predictions.
For visualization purposes we compare all the models performance on a single day prediction scenario on the left side of Figure . On the right side of Figure we quantify the differences between the best predictor (the GRUMIMO) and the actual measurements; the thinner the line the closer the prediction to the true data. Furthermore, in this Figure, we concatenate multiple day predictions to have a wider time span and evaluate the model predictive capabilities. We observe that the model is able to generate a prediction that correctly models the general trend of the load curve but fails to predict steep peaks. This might come from the design choice of using MSE as the optimization metric, which could discourage deep models to predict high peaks as large errors are hugely penalized, and therefore, predicting a lower and smoother function results in better performance according to this metric. Alternatively, some of the peaks may simply represent noise due to particular user behaviour and thus unpredictable by definition.
RMSE  MAE  NRMSE  

FNN  
DFNN  
TCN  
ERNN  MIMO  
Rec  
LSTM  MIMO  
Rec  
GRU  MIMO  
Rec  
seq2seq  TF  
SG 
The load curve of the second dataset (GEFCom2014) results from the aggregation of several different load profiles producing a smoother load curve when compared with the individual load case. Hyperparameters optimization and the final score for the models can be found in Table .
Table and Table show the experimental results obtained by the models in two different scenarios. In the former case, only load values were provided to the models while in the latter scenario the input vector has been augmented with the exogenous features described before. Compared to the previous dataset this time series exhibits a much more regular pattern; as such we expect the prediction task to be easier. Indeed, we can observe a major improvement in terms of performance across all the models. As already noted in [95, 22] the prediction accuracy increases significantly when the forecasting task is carried out on a smooth load curve (resulting from the aggregation of many individual consumers).
We can observe that, in general, all models except plain FNNs benefit from the presence of exogenous variables.
When exogenous variables are adopted, we notice a major improvement by RNNs trained with the recursive strategy which outperform MIMO ones. This increase in accuracy can be attributed to a better capacity of leveraging the exogenous time series of temperatures to yield a better load forecast. Moreover, RNNs with MIMO strategy gain negligible improvements compared to their performance when no extrafeature is provided. This kind of architectures use a feedforward neural network to map their final hidden state to a sequence of values, i.e., the estimates. Exogenous variables are elaborated directly by this FNN, which, as observed above, shows to have problems in handling both load data and extra information. Consequently, a better way of injecting exogenous variables in MIMO recurrent network needs to be found in order to provide a boost in prediction performance comparable to the one achieved by employing the recursive strategy.
For reasons that are similar to those discussed above, sequence to sequence models trained via teacher forcing (seq2seqTF) experienced an improvement when exogenous features are used. Still, seq2seq trained in freerunning mode (seq2seqSG) proves to be a valid alternative to standard seq2seqTF producing high quality predictions in all use cases. The absence of a discrepancy between training and inference in terms of data generating distribution shows to be an advantage as seq2seqSG is less sensitive to noise and error propagation.
Finally, we notice that TCNs perform well in all the presented use cases. Considering their lower memory requirements in the training process along with their inherent parallelism this type of networks represents a promising alternative to recurrent neural networks for shortterm load forecasting.
The results of predictions are presented in the same fashion as for the previous use case in Figure . Observe that, in general, all the considered models are able to produce reasonable estimates as sudden picks in consumption are smoothed. Therefore, predictors greatly improve their accuracy when predicting day ahead values for the aggregated load curves with respect to individual households scenario.
RMSE  MAE  NRMSE  

FNN  
DFNN  
TCN  
ERNN  MIMO  
Rec  
LSTM  MIMO  
Rec  
GRU  MIMO  
Rec  
seq2seq  TF  
SG 
RMSE  MAE  NRMSE  

FNN  
DFNN  
TCN  
ERNN  MIMO  
Rec  
LSTM  MIMO  
Rec  
GRU  MIMO  
Rec  
seq2seq  TF  
SG 
8 Conclusions
In this work we have surveyed and experimentally evaluated the most relevant deep learning models applied to the shortterm load forecasting problem, paving the way for standardized assessment and identification of the most optimal solutions in this field. The focus has been given to the three main families of models, namely, Recurrent Neural Networks, Sequence to Sequence Architectures and recently developed Temporal Convolutional Neural Networks. An architectural description along with a technical discussion on how multistep ahead forecasting is achieved, has been provided for each considered model. Moreover, different forecasting strategies are discussed and evaluated, identifying advantages and drawbacks for each of them. The evaluation has been carried out on the three realworld use cases that refer to two distinct scenarios for load forecasting. Indeed, one use case deals with dataset coming from a single household while the other two tackle the prediction of a load curve that represents several aggregated meters, dispersed over the wide area. Our findings concerning application of recurrent neural networks to shortterm load forecasting, show that the simple ERNN performs comparably to gated networks such as GRU and LSTM when adopted in aggregated load forecasting. Thus, the less costly alternative provided by ERNN may represent the most effective solution in this scenario as it allows to reduce the training time without remarkable impact on prediction accuracy. On the contrary, a significant difference exists for single house electric load forecasting where the gated networks shows to be superior to Elmann ones suggesting that the gated mechanism allows to better handle irregular time series. Sequence to Sequence models have demonstrated to be quite efficient in load forecasting tasks even though they seem to fail in outperforming RNNs. In general we can claim that seq2seq architectures do not represent a golden standard in load forecasting as they are in other domains like natural language processing. In addition to that, regarding this family of architectures, we have observed that teacher forcing may not represent the best solution for training seq2seq models on shortterm load forecasting tasks. Despite being harder in terms of convergence, freerunning models learn to handle their own errors, avoiding the discrepancy between training and testing that is a well known issue for teacher forcing. It turns out to be worth efforts to further investigate capabilities of seq2seq models trained with intermediate solutions such as professor forcing. Finally, we evaluated the recently developed Temporal Convolutional Neural Networks which demonstrated convincing performance when applied to load forecasting tasks. Therefore, we strongly believe that the adoption of these networks for sequence modelling in the considered field is very promising and might even introduce a significant advance in this area that is emerging as a key importance for future Smart Grid developments.
Acknowledgment
This project is carried out within the frame of the Swiss Centre for Competence in Energy Research on the Future Swiss Electrical Infrastructure (SCCERFURIES) with the financial support of the Swiss Innovation Agency (Innosuisse  SCCER program).
References
 [1] X. Fang, S. Misra, G. Xue, and D. Yang. Smart grid — the new and improved power grid: A survey. IEEE Communications Surveys Tutorials, 14(4):944–980, Fourth 2012.
 [2] Eisa Almeshaiei and Hassan Soltan. A methodology for electric power load forecasting. Alexandria Engineering Journal, 50(2):137 – 144, 2011.
 [3] H. S. Hippert, C. E. Pedreira, and R. C. Souza. Neural networks for shortterm load forecasting: a review and evaluation. IEEE Transactions on Power Systems, 16(1):44–55, Feb 2001.
 [4] JiannFuh Chen, WeiMing Wang, and ChaoMing Huang. Analysis of an adaptive timeseries autoregressive movingaverage (arma) model for shortterm load forecasting. Electric Power Systems Research, 34(3):187–196, 1995.
 [5] ShyhJier Huang and KuangRong Shih. Shortterm load forecasting via arma model identification including nongaussian process considerations. IEEE Transactions on power systems, 18(2):673–679, 2003.
 [6] Martin T Hagan and Suzanne M Behr. The time series approach to short term load forecasting. IEEE Transactions on Power Systems, 2(3):785–791, 1987.

[7]
ChaoMing Huang, ChiJen Huang, and MingLi Wang.
A particle swarm optimization to identifying the armax model for shortterm load forecasting.
IEEE Transactions on Power Systems, 20(2):1126–1133, 2005.  [8] HongTzer Yang, ChaoMing Huang, and ChingLien Huang. Identification of armax model for short term load forecasting: An evolutionary programming approach. In Power Industry Computer Application Conference, 1995. Conference Proceedings., 1995 IEEE, pages 325–330. IEEE, 1995.
 [9] Guy R Newsham and Benjamin J Birt. Buildinglevel occupancy data to improve arimabased electricity use forecasts. In Proceedings of the 2nd ACM workshop on embedded sensing systems for energyefficiency in building, pages 13–18. ACM, 2010.
 [10] K. Y. Lee, Y. T. Cha, and J. H. Park. Shortterm load forecasting using an artificial neural network. IEEE Transactions on Power Systems, 7(1):124–132, Feb 1992.
 [11] D. C. Park, M. A. ElSharkawi, R. J. Marks, L. E. Atlas, and M. J. Damborg. Electric load forecasting using an artificial neural network. IEEE Transactions on Power Systems, 6(2):442–449, May 1991.
 [12] Dipti Srinivasan, A.C. Liew, and C.S. Chang. A neural network shortterm load forecaster. Electric Power Systems Research, 28(3):227 – 234, 1994.
 [13] I. Drezga and S. Rahman. Shortterm load forecasting with local ann predictors. IEEE Transactions on Power Systems, 14(3):844–850, Aug 1999.
 [14] K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He. Shortterm load forecasting with deep residual networks. IEEE Transactions on Smart Grid, pages 1–1, 2018.
 [15] PingHuan Kuo and ChiouJye Huang. A high precision artificial neural networks model for shortterm energy load forecasting. Energies, 11(1), 2018.
 [16] K. Amarasinghe, D. L. Marino, and M. Manic. Deep neural networks for energy load forecasting. In 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), pages 1483–1488, June 2017.
 [17] Siddarameshwara Nayaka, Anup Yelamali, and Kshitiz Byahatti. Electricity short term load forecasting using elman recurrent neural network. pages 351 – 354, 11 2010.
 [18] Filippo Maria Bianchi, Enrico Maiorino, Michael C. Kampffmeyer, Antonello Rizzi, and Robert Jenssen. An overview and comparative analysis of recurrent neural networks for short term load forecasting. CoRR, abs/1705.04378, 2017.
 [19] Filippo Maria Bianchi, Enrico De Santis, Antonello Rizzi, and Alireza Sadeghian. Shortterm electric load forecasting using echo state networks and pca decomposition. IEEE Access, 3:1931–1943, 2015.
 [20] Elena Mocanu, Phuong H Nguyen, Madeleine Gibescu, and Wil L Kling. Deep learning for estimating building energy consumption. Sustainable Energy, Grids and Networks, 6:91–99, 2016.
 [21] Jian Zheng, Cencen Xu, Ziang Zhang, and Xiaohua Li. Electric load forecasting in smart grids using longshorttermmemory based recurrent neural network. In Information Sciences and Systems (CISS), 2017 51st Annual Conference on, pages 1–6. IEEE, 2017.
 [22] Weicong Kong, Zhao Yang Dong, Youwei Jia, David J Hill, Yan Xu, and Yuan Zhang. Shortterm residential load forecasting based on lstm recurrent neural network. IEEE Transactions on Smart Grid, 2017.

[23]
Salah Bouktif, Ali Fiaz, Ali Ouni, and Mohamed Serhani.
Optimal deep learning lstm model for electric load forecasting using feature selection and genetic algorithm: Comparison with machine learning approaches.
Energies, 11(7):1636, 2018.  [24] Yixing Wang, Meiqin Liu, Zhejing Bao, and Senlin Zhang. Shortterm load forecasting with multisource data using gated recurrent unit neural networks. Energies, 11:1138, 05 2018.
 [25] Wan He. Load forecasting via deep neural networks. Procedia Computer Science, 122:308 – 314, 2017. 5th International Conference on Information Technology and Quantitative Management, ITQM 2017.
 [26] Chujie Tian, Jian Ma, Chunhong Zhang, and Panpan Zhan. A deep neural network model for shortterm load forecast based on long shortterm memory network and convolutional neural network. Energies, 11:3493, 12 2018.
 [27] Tao Hong, Pierre Pinson, and Shu Fan. Global energy forecasting competition 2012. International Journal of Forecasting, 30(2):357 – 363, 2014.
 [28] Antonino Marvuglia and Antonio Messineo. Using recurrent artificial neural networks to forecast household electricity consumption. Energy Procedia, 14:45 – 55, 2012. 2011 2nd International Conference on Advances in Energy Engineering (ICAEE).
 [29] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
 [30] Smart grid, smart city, australian govern., australia, canberray.
 [31] Yao Cheng, Chang Xu, Daisuke Mashima, Vrizlynn L. L. Thing, and Yongdong Wu. Powerlstm: Power demand forecasting using long shortterm memory neural network. In Gao Cong, WenChih Peng, Wei Emma Zhang, Chengliang Li, and Aixin Sun, editors, Advanced Data Mining and Applications, pages 727–740, Cham, 2017. Springer International Publishing.
 [32] Umass smart dataset. http://traces.cs.umass.edu/index.php/Smart/Smart, 2017.
 [33] Daniel L Marino, Kasun Amarasinghe, and Milos Manic. Building energy load forecasting using deep neural networks. In Industrial Electronics Society, IECON 201642nd Annual Conference of the IEEE, pages 7046–7051. IEEE, 2016.
 [34] Henning Wilms, Marco Cupelli, and Antonello Monti. Combining autoregression with exogenous variables in sequencetosequence recurrent neural networks for shortterm load forecasting. In 2018 IEEE 16th International Conference on Industrial Informatics (INDIN), pages 673–679. IEEE, 2018.
 [35] Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli, and Rob J. Hyndman. Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. International Journal of Forecasting, 32(3):896 – 913, 2016.
 [36] A. Almalaq and G. Edwards. A review of deep learning methods applied on load forecasting. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 511–516, Dec 2017.
 [37] Balázs Csanád Csáji. Approximation with artificial neural networks.
 [38] G. Hinton. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
 [39] Matthew D. Zeiler. Adadelta: An adaptive learning rate method. 1212, 12 2012.
 [40] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [41] S.T. Chen, D.C. Yu, and A.R. Moghaddamjo. Weather sensitive shortterm load forecasting using nonfully connected artificial neural network. IEEE Transactions on Power Systems (Institute of Electrical and Electronics Engineers); (United States), (3), 8 1992.

[42]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pages 770–778, 2016.  [43] Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211, 1990.
 [44] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [45] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724–1734, 2014.
 [46] Paul J. Werbos. Backpropagation through time: What it does and how to do it. 1990.
 [47] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, Cambridge, MA, USA, 1986.
 [48] Ronald J. Williams and Jing Peng. An efficient gradientbased algorithm for online training of recurrent network trajectories. Neural Computation, 2, 09 1998.
 [49] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157–166, March 1994.
 [50] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML’13, pages III–1310–III–1318. JMLR.org, 2013.
 [51] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, Oct 2017.
 [52] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
 [53] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
 [54] Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. CoRR, abs/1312.6026, 2013.
 [55] Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Comput., 4(2):234–242, March 1992.
 [56] Alex Graves, Abdelrahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.
 [57] Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural networks. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 190–198. Curran Associates, Inc., 2013.
 [58] Souhaib Ben Taieb, Gianluca Bontempi, Amir F. Atiya, and Antti Sorjamaa. A review and comparison of strategies for multistep ahead time series forecasting based on the nn5 forecasting competition. Expert Systems with Applications, 39(8):7067 – 7083, 2012.
 [59] Antti Sorjamaa and Amaury Lendasse. Time series prediction using dirrec strategy. volume 6, pages 143–148, 01 2006.
 [60] Gianluca Bontempi. Long term time series prediction with multiinput multioutput local learning. Proceedings of the 2nd European Symposium on Time Series Prediction (TSP), ESTSP08, 01 2008.
 [61] Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa, and Amaury Lendasse. Longterm prediction of time series by combining direct and mimo strategies. 2009 International Joint Conference on Neural Networks, pages 3054–3061, 2009.
 [62] F. M. Bianchi, E. De Santis, A. Rizzi, and A. Sadeghian. Shortterm electric load forecasting using echo state networks and pca decomposition. IEEE Access, 3:1931–1943, 2015.
 [63] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 3104–3112, 2014.
 [64] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
 [65] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. 09 2016.
 [66] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649, May 2013.
 [67] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attentionbased models for speech recognition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 577–585. Curran Associates, Inc., 2015.
 [68] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. Endtoend attentionbased large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945–4949, March 2016.
 [69] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.
 [70] Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270–280, 1989.
 [71] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015.
 [72] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, pages 1171–1179, Cambridge, MA, USA, 2015. MIT Press.
 [73] Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron C. Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.
 [74] Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K. Reddy. Deep reinforcement learning for sequence to sequence models. CoRR, abs/1805.09461, 2018.
 [75] Henning Wilms, Marco Cupelli, and A Monti. Combining autoregression with exogenous variables in sequencetosequence recurrent neural networks for shortterm load forecasting. pages 673–679, 07 2018.
 [76] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 [77] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [78] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [79] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [80] Ross Girshick. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
 [81] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [82] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [83] Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel recursive super resolution. arXiv preprint arXiv:1702.00783, 2017.

[84]
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photorealistic single image superresolution using a generative adversarial network.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 105–114. IEEE, 2017.  [85] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.
 [86] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.

[87]
Anastasia Borovykh, Sander Bohte, and Kees Oosterlee.
Conditional time series forecasting with convolutional neural
networks.
In
Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence
, pages 729–730, September 2017.  [88] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. CoRR, abs/1603.07285, 2016.
 [89] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
 [90] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
 [91] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [92] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag, Berlin, Heidelberg, 2006.
 [93] Sihan Li, Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Demystifying resnet. arXiv preprint arXiv:1611.01186, 2016.
 [94] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. pages 448–456, 2015.
 [95] A. Marinescu, C. Harris, I. Dusparic, S. Clarke, and V. Cahill. Residential electrical demand forecasting in very small scale: An evaluation of forecasting methods. In 2013 2nd International Workshop on Software Engineering Challenges for the Smart Grid (SE4SG), pages 25–32, May 2013.
Comments
There are no comments yet.