Smart grids aim at creating automated and efficient energy delivery networks which improve power delivery reliability and quality, along with network security, energy efficiency, and demand-side management aspects . Modern power distribution systems are supported by advanced monitoring infrastructures that produce immense amount of data, thus enabling fine grained analytics and improved forecasting performance. In particular, electric load forecasting emerges as a critical task in the energy field, as it enables useful support for decision making, supporting optimal pricing strategies, seamless integration of renewables and maintenance cost reductions. Load forecasting is carried out at different time horizons, ranging from milliseconds to years, depending on the specific problem at hand.
In this work we focus on the day-ahead prediction problem also referred in the literature as short term load forecasting (STLF) . Since deregulation of electric energy distribution and wide adoption of renewables strongly affects daily market prices, STLF emerges to be of fundamental importance for efficient power supply . Furthermore, we differentiate forecasting on the granularity level at which it is applied. For instance, in individual household scenario, load prediction is rather difficult as power consumption patterns are highly volatile. On the contrary, aggregated load consumption i.e., that associated with a neighborhood, a region, or even an entire state, is normally easier to predict as the resulting signal exhibit slower dynamics.
Historical power loads are time-series affected by several external time-variant factors, such as weather conditions, human activities, temporal and seasonal characteristics that make their predictions a challenging problem. A large variety of prediction methods has been proposed for the electric load forecasting over the years and, only the most relevant ones are reviewed in this section. Autoregressive moving average models (ARMA) were among the first model families used in short-term load forecasting [4, 5]. Soon they were replaced by ARIMA and seasonal ARIMA models 
to cope with time variance often exhibited by load profiles. In order to include exogenous variables like temperature into the forecasting method, model families were extended to ARMAX[7, 8] and ARIMAX
. The main shortcoming of these system identification families is the linearity assumption for the system being observed, hypothesis that does not generally hold. In order to solve this limitation, nonlinear models like Feed Forward Neural Networks were proposed and became attractive for those scenarios exhibiting significant nonlinearity, as in load forecasting tasks[10, 11, 12, 13, 3]
. The intrinsic sequential nature of time series data was then exploited by considering sophisticated techniques ranging from advanced feed forward architecture with residual connections to convolutional approaches [15, 16] and Recurrent Neural Networks [17, 18] along with their many variants such as Echo-state Network [19, 20, 18]21, 22, 23, 18]24, 18]
. Moreover, some hybrid architectures have also been proposed aiming to capture the temporal dependencies in the data with recurrent networks while performing a more general feature extraction operation with convolutional layers[25, 26].
|Reference||Predictive Family of Models||Time Horizon||Exogenous Variables||Dataset (Location)|
|||LSTM, GRU, ERNN, NARX, ESN||D||-||Rome, Italy|
|||LSTM, GRU, ERNN, NARX, ESN||D||T||New England |
|||ERNN||H||T, H, P, other||Palermo, Italy|
|||ERNN||H||T, W, H||Hubli, India|
|||ESN||15min to 1Y||-||Sceaux, France |
|||LSTM||D(?)||C, TI||Australia |
|||LSTM||2W to 4M||T, W, H, C, TI||France|
|||LSTM||2D||T, P, H, C, TI||Unknown |
|||LSTM, seq2seq-LSTM||60 H||C, TI||Sceaux, France |
|||LSTM, seq2seq-LSTM||12 H||T, C, TI||New England |
|||GRU||D||T, C, other||Dongguan, China|
|||CNN||D||C, TI||Sceaux, France|
|||CNN + LSTM||D||T, C, TI||North-China|
|||CNN + LSTM||D||-||North-Italy|
Different reviews address the load forecasting topic by means of (not necessarily deep) neural networks. In  the authors focus on the use of some deep learning architectures for load forecasting. However, this review lacks a comprehensive comparative study of performance verified on common load forecasting benchmarks. The absence of valid cost-performance metric does not allow the report to make conclusive statements. In  an exhaustive overview of recurrent neural networks for short term load forecasting is presented. The very detailed work considers one layer (not deep) recurrent networks only. A comprehensive summary of the most relevant researches dealing with STLF employing recurrent neural networks, convolutional neural networks and seq2seq models is presented in Table . It emerges that most of the works have been performed on different datasets, making it rather difficult - if not impossible - to asses their absolute performance and, consequently, recommend the best state-of-the-art solutions for load forecast.
In this survey we consider the most relevant -and recent- deep architectures and contrast them in terms of performance accuracy on open-source benchmarks. The considered architectures include recurrent neural networks, sequence to sequence models and temporal convolutional neural networks. The experimental comparison is performed on two different real-world datasets which are representatives of two distinct scenarios. The first one considers power consumption at an individual household level with a signal characterized by high frequency components while the second one takes into account aggregation of several consumers. Our contributions consist in:
A comprehensive review. The survey provides a comprehensive investigation of deep learning architectures known to the smart grid literature as well as novel recent ones suitable for electric load forecasting.
A multi-step prediction strategy comparison for recurrent neural networks: we study and compare how different prediction strategies can be applied to recurrent neural networks. To the best of our knowledge this work has not been done yet for deep recurrent neural networks.
A relevant performance assessment. To the best of our knowledge, the present work provides the first systematic experimental comparison of the most relevant deep learning architectures for the electric load forecasting problems of individual and aggregated electric demand. It should be noted that envisaged architectures are domain independent and, as such, can be applied in different forecasting scenarios.
The rest of this paper is organized as follows.
In Section we formally introduce the forecasting problems along with the notation that will be used in this work. In Section we introduce Feed Forward Neural Networks (FNNs) and the main concepts relevant to the learning task. We also provide a short review of the literature regarding the use of FNNs for the load forecasting problem.
In Section we provide a general overview of Recurrent Neural Networks (RNNs) and their most advanced architectures: Long Short-Term Memory and Gated Recurrent Unit networks.
In Section Sequence To Sequence architectures (seq2seq) are discussed as a general improvement over recurrent neural networks. We present both, simple and advanced models built on the sequence to sequence paradigm.
In Section Convolutional Neural Networks are introduced and one of their most recent variant, the temporal convolutional network (TCN), is presented as the state-of-the-art method for univariate time-series prediction.
In Section the real-world datasets used for models comparison are presented. For each dataset, we provide a description of the preprocessing operations and the techniques that have been used to validate the models performance.
Finally, In Section we draw conclusions based on the performed assessments.
2 Problem Description
In basic multi-step ahead electric load forecasting a univariate time series
that spans through several years is given. In this work, input data are presented to the different predictive families of models as a regressor vector composed of fixed time-lagged data associated with a window size of lengthwhich slides over the time series. Given this fixed length view of past values, a predictor aims at forecasting the next
values of the time series. In this work the forecasting problem is studied as a supervised learning problem. As such, given the input vector at discrete timedefined as , the forecasting problem requires to infer the next measurements or a subset of. To ease the notation we express the input and output vectors in the reference system of the time window instead of the time series one. By following this approach, the input vector at discrete time becomes and the corresponding output vector is . characterizes the real output values defined as . Similarly, we denote as , the prediction vector provided by a predictive model whose parameters vector
has been estimated by optimizing a performance function.
Without loss of generality, in the remaining of the paper, we drop the subscript from the inner elements of and . The introduced notation, along with the sliding window approach, is depicted in Figure .
In certain applications we will additionally be provided with exogenous variables (e.g., the temperatures) each of which representing a univariate time series aligned in time with the data of electricity demand. In this scenario the components of the regressor vector become vectors, i.e., . Indeed, each element of the input sequence is represented as where is the scalar load measurement at time , while is the scalar value of the exogenous feature.
The nomenclature used in this work is given in Table .
|window size of the regressor vector|
|time horizon of the forecast|
|number of exogenous variable|
|dilated convolution operation|
|index for the layer|
|hidden state vector|
|model’s vector of parameters|
number of hidden neurons
3 Feed Forward Neural Networks
Feed Forward Neural Networks (FNNs) are parametric model families characterized by the universal function approximation property. Their computational architectures are composed of a layered structure consisting of three main building blocks: the input layer, the hidden layer(s) and the output layer. The number of hidden layers (), determines the depth of the network, while the size of each layer, i.e., the number of hidden units of the layer defines its complexity in terms of neurons. FNNs provide only direct forward connections between two consecutive layers, each connection associated with a trainable parameter; note that given the feedfoward nature of the computation no recursive feedback is allowed. More in detail, given a vector fed at the network input, the FNN’s computation can be expressed as:
where and .
Each layer is characterized with its own parameters matrix
and bias vector. Hereafter, in order to ease the notation, we incorporate the bias term in the weight matrix, i.e., and . groups all the network’s parameters.
Given a training set of input-output vectors in the (, ) form, , the learning procedure aims at identifying a suitable configuration of parameters
that minimizes a loss functionevaluating the discrepancy between the estimated values and the measurements :
The mean squared error:
is a very popular loss function for time series prediction and, not rarely, a regularization penalty term is introduced to prevent overfitting and improve the generalization capabilities of the model
The most used regularization scheme controlling model complexity is the L2 regularization , being a suitable hyper-parameter controlling the regularization strength.
As Equation is not convex, the solution cannot be obtained in a closed form with linear equation solvers or convex optimization techniques. Parameters estimation (learning procedure) operates iteratively e.g., by leveraging on the gradient descent approach:
where is the learning rate and the gradient w.r.t.38], Adagrad , Adam  are popular learning procedures. The learning procedure yields estimate associated with the predictive model .
In our work, deep FNNs are the baseline model architectures.
In multi-step ahead prediction the output layer dimension coincides with the forecasting horizon . The dimension of the input vector depends also on the presence of exogenous variables; this aspect is further discussed in Section .
3.1 Related Work
The use of Feed Forward Neural networks in short term load forecasting dates back to the 90s. Authors in  propose a shallow neural network with a single hidden layer to provide a 24-hour forecast using both load and temperature information. In  one day ahead forecast is implemented using two different prediction strategies: one network provides all 24 forecast values in a single shot (MIMO strategy) while another single output network provides the day-ahead prediction by recursively feedbacking its last value estimate (recurrent strategy). The recurrent strategy shows to be more efficient in terms of both training time and forecasting accuracy. In  the authors present a feed forward neural network to forecast electric loads on a weekly basis. The sparsely connected feed forward architecture receives the load time-series, temperature readings, as well as the time and day of the week. It is shown that the extra information improves the forecast accuracy compared to an ARIMA model trained on the same task.  presents one of the first multi-layer FNN to forecast the hourly load of a power system.
A detailed review concerning applications of artificial neural networks in short-term load forecasting can be found in . However, this survey dates back to the early 2000s, and does not discuss deep models. More recently, architectural variants of feed forward neural networks have been used; for example, in  a ResNet  inspired model is used to provide day ahead forecast by leveraging on a very deep architecture. The article shows a significant improvement on aggregated load forecasting when compared to other (not-neural) regression models on different datasets.
4 Recurrent Neural networks
In this section we overview recurrent neural networks, and, in particular the Elmann Net architecture , Long-Short Term Memory  and Gated Recurrent Unit  networks. Afterwords, we introduce deep recurrent neural networks and discuss different strategies to perform multi-step ahead forecasting. Finally, we present related work in short-term load forecasting that leverages on recurrent networks.
4.1 Elmann RNNs (ERNN)
Elmann Recurrent Neural Networks (ERNN) were proposed in  to generalize feedforward neural networks for better handling ordered data sequences like time-series.
The reason behind the effectiveness of RNNs in dealing with sequences of data comes from their ability to learn a compact representation of the input sequence by means of a recurrent function that implements the following mapping:
By expanding Equation and given a sequence of inputs , the computation becomes:
where , , are the weight matrices for hidden-hidden, input-hidden, hidden-output connections respectively,
is an activation function (generally the hyperbolic tangent one) andis normally a linear function. The computation of a single module in an Elmann recurrent neural network is depicted in Figure .
It can be noted that an ERNN processes one element of the sequence at a time, preserving its inherent temporal order. After reading an element from the input sequence the network updates its internal state using both (a transformation of) the latest state and (a transformation of) the current input (Equation ). The described process can be better visualized as an acyclic graph obtained from the original cyclic graph (left side of Figure ) via an operation known as time unfolding (right side of Figure ). It is of fundamental importance to point out that all nodes in the unfolded network share the same parameters, as they are just replicas distributed over time.
The parameters of the network
are usually learned via Backpropagation Through Time (BPTT)[46, 47], a generalized version of standard Backpropagation. In order to apply gradient-based optimization, the recurrent neural network has to be transformed through the unfolding procedure shown in Figure . In this way, the network is converted into a FNN having as many layers as time intervals in the input sequence, and each layer is constrained to have the same weight matrices. In practice Truncated Backpropagation Through Time  TBPTT(, ) is used. The method processes an input window of length one timestep at a time and runs BPTT for timesteps every steps. Notice that having does not limit the memory capacity of the network as the hidden state incorporates information taken from the whole sequence. Despite that, setting to a very low number may result in poor performance. In the literature BPTT is considered equivalent to TBPTT(,
). In this work we used epoch-wise Truncated BPTT i.e., TBPTT(, ) to indicate that the weights update is performed once a whole sequence has been processed.
Despite of the model simplicity, Elmann RNNs are hard to train due to ineffectiveness of gradient (back)propagation. In fact, it emerges that the propagation of gradient is effective for short-term connections but is very likely to fail for long-term ones, when the gradient norm usually shrinks to zero or diverges. These two behaviours are known as the vanishing gradient and the exploding gradient problems[49, 50] and were extensively studied in the machine learning community.
4.2 Long Short-Term Memory (LSTM)
Recurrent neural networks with Long Short-Term Memory (LSTM) were introduced to cope with the vanishing and exploding gradients problems occurring in ERNNs and, more in general, in standard RNNs . LSTM networks maintain the same topological structure of ERNN but differ in the composition of the inner module - or cell.
Each LSTM cell has the same input and output as an ordinary ERNN cell but, internally, it implements a gated system that controls the neural information processing (see Figure Figure and )
. The key feature of gated networks is their ability to control the gradient flow by acting on the gate values; this allows to tackle the vanishing gradient problem, as LSTM can maintain its internal memory unaltered for long time intervals. Notice from the equations below that the inner state of the network results as a linear combination of the old state and the new state (Equation). Part of the old state is preserved and flows forward while in the ERNN the state value is completely replaced at each timestep (Equation ). In detail, the neural computation is:
where , are parameters to be learned, is the Hadamard product, is generally a sigmoid activation while can be any non-linear one (hyperbolic tangent in the original paper). The cell state encodes the - so far learned - information from the input sequence. At timestep the flow of information within the unit is controlled by three elements called gates: the forget gate controls the cell state’s content and changes it when obsolete, the input gate controls which state value will be updated and how much, , finally the output gate produces a filtered version of the cell state and serves it as the network’s output .
4.3 Gated Recurrent Units (GRU)
Firstly introduced in , GRUs are a simplified variant of LSTM and, as such, belong to the family of gated RNNs. GRUs distinguish themselves from LSTMs for merging in one gate functionalities controlled by the forget gate and the input gate. This kind of cell ends up having just two gates, which results in a more parsimonious architecture compared to LSTM that, instead, has three gates.
The basic components of a GRU cell are outlined in Figure , whereas the neural computation is controlled by:
where , are the parameters to be learned, is generally a sigmoid activation while can be any kind of non-linearity (in the original work it was an hyperbolic tangent). and
are the update and the reset gates, respectively. Several works in the natural language processing community show that GRUs perform comparably to LSTM but train generally faster due to the lighter computation[52, 53].
4.4 Deep Recurrent Neural Networks
All recurrent architectures presented so far are characterized by a single layer. In turn, this implies that the computation is composed by an affine transformation followed by a non-linearity. That said, the concept of depth in RNN is less straightforward than in feed-forward architectures. Indeed, the later ones become deep when the input is processed by a large number of non-linear transformations before generating the output values. However, according to this definition, an unfolded RNN is already a deep model given its multiple non-linear processing layers. That said, a deep multi-level processing can be applied to all the transition functions (input-hidden, hidden-hidden, hidden-output) as there are no intermediate layers involved in these computations. Deepness can also be introduced in recurrent neural networks by stacking recurrent layers one on top of the other . As this deep architecture is more intriguing, in this work, we refer it as a Deep RNN. By iterating the RNN computation, the function implemented by the deep architecture can be represented as:
where is the hidden state at timestep for layer . Notice that . It has been empirically shown in several works that Deep RNNs are better to capture the temporal hierarchy exhibited by time-series then their shallow counterpart [54, 56, 57]. Of course, hybrid architectures having different layers -recurrent or not- can be considered as well.
4.5 Multi-Step Prediction Schemes
There are five different architecture-independent strategies for multi-step ahead forecasting :
Recursive strategy (Rec)
a single model is trained to perform a one-step ahead forecast given the input sequence. Subsequently, during the operational phase, the forecasted output is recursively fedback and considered to be the correct one. By iterating times this procedure we generate the forecast values at time . The procedure is described in Algorithm , where is the input vector without its first element while the procedure concatenates the scalar output to the exogenous input variables.
To summarize, the predictor receives in input a vector of length and outputs a scalar value .
design a set of independent predictors , each of which providing a forecast at time . Similarly to the recursive strategy, each predictor outputs a scalar value , but the input vector is the same to all the predictors. Algorithm details the procedure.
 is a combination of the above two strategies. Similar to the direct approach, models are used, but here, each predictor leverages on an enlarged input set, obtained by adding the results of the forecast at the previous timestep. The procedure is detailed in Algorithm .
(Multiple input - Multiple output) , a single predictor is trained to forecast a whole output sequence of length in one-shot, i.e., differently from the previous cases the output of the model is not a scalar but a vector:
, represents a trade-off between the Direct strategy and the MIMO strategy. It divides the steps forecasts into smaller forecasting problems, each of which of length . It follows that predictors are used to solve the problem.
Given the considerable computational demand required by RNNs during training, we focus on multi-step forecasting strategies that are computationally cheaper, specifically, Recursive and MIMO strategies . We will call them RNN-Rec and RNN-MIMO.
Given the hidden state at timestep , the hidden-output mapping is obtained through a fully connected layer on top of the recurrent neural network. The objective of this dense network is to learn the mapping between the last state of the recurrent network, which represents a kind of lossy summary of the task-relevant aspect of the input sequence, and the output domain. This holds for all the presented recurrent networks and is consistent with Equation . In this work RNN-Rec and RNN-MIMO differ in the cardinality of the output domain, which is for the former and for the latter, meaning that in Equation either or . The objective function is:
4.6 Related work
In  an Elmann recurrent neural network is considered to provide hourly load forecasts. The study also compares the performance of the network when additional weather information such as temperature and humidity are fed to the model. The authors conclude that, as expected, the recurrent network benefits from multi-input data and, in particular, weather ones.  makes use of ERNN to forecast household electric consumption obtained from a suburban area in the neighbours of Palermo (Italy). In addition to the historical load measurements, the authors introduce several features to enhance the model’s predictive capabilities. Besides the weather and the calendar information, a specific ad-hoc index was created to assess the influence of the use of air-conditioning equipment on the electricity demand. In recent years, LSTMs have been adopted in short term load forecasting, proving to be more effective then traditional time-series analysis methods. In  LSTM is shown to outperform traditional forecasting methods being able to exploit the long term dependencies in the time series to forecast the day-ahead load consumption. Several works proved to be successful in enhancing the recurrent neural network capabilities by employing multivariate input data. In  the authors propose a deep, LSTM based architecture that uses past measurements of the whole household consumption along with some measurements from selected appliances to forecast the consumption of the subsequent time interval (i.e., a one step prediction). In  a LSTM-based network is trained using a multivariate input which includes temperature, holiday/working day information, date and time information. Similarly, in 
a power demand forecasting model based on LSTM shows an accuracy improvement compared to more traditional machine learning techniques such as Gradient Boosting Trees and Support Vector Regression.
GRUs have not been used much in the literature as LSTM networks are often preferred. That said, the use of GRU-based networks is reported in , while a more recent study  uses GRUs for the daily consumption forecast of individual customers. Thus, investigating deep GRU-based architectures is a relevant scientific topic, also thanks to their faster convergence and simpler structure compared to LSTM .
Despite all these promising results, an extensive study of recurrent neural networks , and in particular of ERNN, LSTM, GRU, ESN and NARX, concludes that none of the investigated recurrent architectures manages to outperform the others in all considered experiments. Moreover, the authors noticed that recurrent cells with gated mechanisms like LSTM and GRU perform comparably well to much simpler ERNN. This may indicate that in short-term load forecasting gating mechanism may be unnecessary; this issue is further investigated -and evidence found- in the present work.
5 Sequence To Sequence models
were initially designed to solve RNNs inability to produce output sequences of arbitrary length. The architecture was firstly used in neural machine translation[64, 65, 45] but has emerged as the golden standard in different fields such as speech recognition [66, 67, 68] and image captioning .
The core idea of this general framework is to employ two networks resulting in an encoder-decoder architecture. The first neural network (possibly deep) , an encoder, reads the input sequence of length one timestep at a time; the computation generates a, generally lossy, fixed dimensional vector representation of it , . This embedded representation is usually called context in the literature and can be the last hidden state of the encoder or a function of it. Then, a second neural network - the decoder - will learn how to produce the output sequence given the context vector, i.e., . The schematics of the whole architecture is depicted in Figure .
The encoder and the decoder modules are generally two recurrent neural networks trained end-to-end to minimize the objective function:
where is the decoder’s estimate at time , is the real measurement, is the decoder’s last state, is the context vector from the encoder, is the input sequence and the regularization term. The training procedure for this type of architecture is called teacher forcing . As shown in Figure and explained in Equation , during training, the decoder’s input at time is the ground-truth value , which is then used to generate the next state and, then, the estimate . During inference the true values are unavailable and replaced by the estimates:
This discrepancy between training and testing results in errors accumulating over time during inference. In the literature this problem is often referred to as exposure bias . Several solutions have been proposed to address this problem; in  the authors present scheduled sampling, a curriculum learning strategy that gradually changes the training process by switching the decoder’s inputs from ground-truth values to model’s predictions. The professor forcing algorithm, introduced in 
, uses an adversarial framework to encourage the dynamics of the recurrent network to be the same both at training and operational (test) time. Finally, in recent years, reinforcement learning methods have been adopted to train sequence to sequence models; a comprehensive review is presented in.
In this work we investigate two sequence to sequence architectures, one trained via teacher forcing (TF) and one using self-generated (SG) samples. The former is characterized by Equation during training while Equation is used during prediction. The latter architecture adopts Equation both for training and prediction. The decoder’s dynamics are summarized in Figure . It is clear that the two training procedures differ in the decoder’s input source: ground-truth values in teacher forcing, estimated values in self-generated training.
5.1 Related Work
Only recently seq2seq models have been adopted in short term load forecasting. In  a LSTM based encoder-decoder model is shown to produce superior performance compared to standard LSTM. In  the authors introduce an adaptation of RNN based sequence-to-sequence architectures for time-series forecasting of electrical loads to demonstrate its better performance with respect to a suite of models ranging from standard RNNs to classical time series techniques.
6 Convolutional Neural Networks
Convolutional Neural Networks (CNNs)  are a family of neural networks designed to work with data that can be structured in a grid-like topology. CNNs were originally used on two dimensional and three-dimensional images, but they are also suitable for one-dimensional data such as univariate time-series. Once recognized as a very efficient solution for image recognition and classification [77, 78, 79, 42]
, CNNs have experienced wide adoption in many different computer vision tasks[80, 81, 82, 83, 84]. Moreover, sequence modeling tasks, like short term electric load forecasting, have been mainly addressed with recurrent neural networks, but recent research indicates that convolutional networks can also attain state-of-the-art-performance in several applications including audio generation , machine translation  and time-series prediction .
As the name suggests, these kind of networks are based on a discrete convolution operator that produces an output feature map by sliding a kernel over the input . Each element in the output feature map is obtained by summing up the result of the element-wise multiplication between the input patch (i.e., a slice of the input having the same dimensionality of the kernel) and the kernel. The number of kernels (filters)
used in a convolutional layer determines the depth of the output volume (i.e., the number of output feature maps). To control the other spatial dimensions of the output feature maps two hyper-parameters are used: stride and padding. Stride represents the distance between two consecutive input patches and can be defined for each direction of motion. Padding refers to the possibility of implicitly enlarging the inputs by adding (usually) zeros at the borders to control the output size w.r.t the input one. Indeed, without padding, the dimensionality of the output would be reduced after each convolutional layer.
Considering a 1D time-series and a one-dimensional kernel , the element of the convolution between and is:
with if no zero-padding is used, otherwise padding matches the input dimensionality, i.e., . Equation is referred to the one-dimensional input case but can be easily extended to multi-dimensional inputs (e.g., images, where ) . The reason behind the success of these networks can be summarized in the following three points:
local connectivity: each hidden neuron is connected to a subset of input neurons that are close to each other (according to specific spatio-temporal metric). This property allows the network to drastically reduce the number of parameters to learn (w.r.t. a fully connected network) and facilitate computations.
parameter sharing: the weights used to compute the output neurons in a feature map are the same, so that the same kernel is used for each location. This allows to reduce the number of parameters to learn.
translation equivariance: the network is robust to an eventual shifting of its input.
In our work we focus on a convolutional architecture inspired by Wavenet 
, a fully probabilistic and autoregressive model used for generating raw audio wave-forms and extended to time-series prediction tasks. Up to the authors’ knowledge this architecture has never been proposed to forecast the electric load. A recent empirical comparison between temporal convolutional networks and recurrent networks has been carried out in  on tasks such as polymorphic music and charter-sequence level modelling. The authors were the first to use the name Temporal Convolutional Networks (TCNs) to indicate convolutional networks which are autoregressive, able to process sequences of arbitrary length and output a sequence of the same length. To achieve the above the network has to employ causal (dilated) convolutions and residual connections should be used to handle a very long history size.
Dilated Causal Convolution (DCC)
Being TCNs a family of autoregressive models, the estimated value at time must depend only on past samples and not on future ones (Figure ). To achieve this behavior in a Convolutional Neural Network the standard convolution operator is replaced by causal convolution. Moreover, zero-padding of length (filter size - 1) is added to ensure that each layer has the same length of the input layer. To further enhance the network capabilities dilated causal convolutions are used, allowing to increase the receptive field of the network (i.e., the number of input neurons to which the filter is applied) and its ability to learn long-term dependencies in the time-series. Given a one-dimensional input , and a kernel , a dilated convolution output using a dilation factor becomes:
This is a major advantage w.r.t simple causal convolutions, as in the later case the receptive field grows linearly with the depth of the network while with dilated convolutions the dependence is exponential , ensuring that a much larger history size is used by the network.
Despite the implementation of dilated convolution, the CNN still needs a large number of layers to learn the dynamics of the inputs. Moreover, performance often degrade with the increase of the network depth. The degradation problem has been first addressed in  where the authors propose a deep residual learning framework. The authors observe that for a -layers network with a training error , inserting extra layers on top of it should either leave the error unchanged or improve it. Indeed, in the worst case scenario, the new stacked non linear layers should learn the identity mapping where is the output of the network having layers and is the output of the network with layers. Although almost trivial, in practice, neural networks experience problems in learning this identity mapping. The proposed solution suggests these stacked layers to fit a residual mapping instead of the desired one, . The original mapping is recast into which is realized by feed forward neural networks with shortcut connections; in this way the identity mapping is learned by simply driving the weights of the stacked layers to zero.
By means of the two aforementioned principles, the temporal convolutional network is able to exploit a large history size in an efficient manner. Indeed, as observed in , these models present several computational advantages compared to RNNs. In fact, they have lower memory requirements during training and the predictions for later timesteps are not done sequentially but can be computed in parallel exploiting parameter sharing. Moreover, TCNs training is much more stable than that involving RNNs allowing to avoid the exploding/vanishing gradient problem. For all the above, TCNs have demonstrated to be promising area of research for time series prediction problems and here, we aim to assess their forecasting performance w.r.t state-of-the-art models in short-term load forecasting. The architecture used in our work is depicted in Figure , which is, except for some minor modifications, the network structure detailed in . In the first layer of the network we process separately the load information and, when available, the exogenous information such as temperature readings. Later the results will be concatenated together and processed by a deep residual network with
layers. Each layer consists of a residual block with 1D dilated causal convolution, a rectified linear unit (ReLU) activation and finally dropout to prevent overfitting. The output layer consists of 1x1 convolution which allows the network to output a one-dimensional vectorhaving the same dimensionality of the input vector . To approach multi-step forecasting, we adopt a MIMO strategy.
6.1 Related Work
In the short-term load forecasting relevant literature, CNNs have not been studied to a large extent. Indeed, until recently, these models were not considered for any time-series related problem. Still, several works tried to address the topic; in 
a deep convolutional neural network model named DeepEnergy is presented. The proposed network is inspired by the first architectures used in ImageNet challenge (e.g,), alternating convolutional and pooling layers, halving the width of the feature map after each step. According to the provided experimental results, DeepEnergy can precisely predict energy load in the next three days outperforming five other machine learning algorithms including LSTM and FNN. In  a CNN is compared to recurrent and feed forward approaches showing promising results on a benchmark dataset. In  a hybrid approach involving both convolutional and recurrent architectures is presented. The authors integrate different input sources and use convolutional layers to extract meaningful features from the historic load while the recurrent network main task is to learn the system’s dynamics. The model is evaluated on a large dataset containing hourly loads from a city in North China and is compared with a three-layer feed forward neural network. A different hybrid approach is presented in , the authors process the load information in parallel with a CNN and an LSTM. The features generated by the two networks are then used as an input for a final prediction network (fully connected) in charge of forecasting the day-ahead load.
7 Performance Assessment
In this section we perform evaluation and assessment of all the presented architectures. The testing is carried out by means of three use cases that are based on two different datasets used as benchmarks. We first introduce the performance metrics that we considered for both network optimization and testing, then describe the datasets that have been used and finally we discuss results.
7.1 Performance Metrics
The efficiency of the considered architectures has been measured and quantified using widely adopted error metrics. Specifically, we adopted the Root mean squared error (RMSE) and the Mean Absolute Error (MAE):
where is the number of input-output pairs provided to the model in the course of testing, and are respectively the real load values and the estimated load values at time for sample (i.e., the time window). is the mean operator, is the euclidean L2 norm, while is the L1 norm. and are the real load values and the estimated load values for one sample, respectively. Still, a more intuitive and indicative interpretation of prediction efficiency of the estimators can be expressed by the normalized root mean squared error which, differently from the two above metrics, is independent from the scale of the data:
where and are the maximum and minimum value of training dataset, respectively. In order to quantify the proportion of variance in the target that is explained by the forecasting methods we consider also the index:
7.2 Use Case I
The first use case considers the Individual household electric power consumption data set (IHEPC) which contains 2.07M measurements of electric power consumption for a single house located in Sceaux (7km of Paris, France). Measurements are collected every minute between December 2006 and November 2010 (47 months) . In this study we focus on prediction of the "Global active power" parameter. Nearly 1.25% of measurements are missing, still, all the available ones come with timestamps. We reconstruct the missing values using the mean power consumption for the corresponding time slot across the different years of measurements. In order to have a unified approach we have decided to resample the dataset using a sampling rate of 15 minutes which is a widely adopted standard in modern smart meters technologies. In Table the sample size are outlined for each dataset.
In this use case we performed the forecasting using only historical load values. The right side of Figure depicts the average weekly electric consumption. As expected, it can be observed that the highest consumption is registered in the morning and evening periods of day when the occupancy of resident houses is high. Moreover, the average load profile over a week clearly shows that weekdays are similar while weekends present a different trend of consumption.
The figure shows that the data are characterized by high variance. The prediction task consists in forecasting the electric load for the next day, i.e., 96 timesteps ahead.
In order to assess the performance of the architectures we hold out a portion of the data which denotes our test set and comprises the last year of measurements. The remaining measurements are repeatedly divided in two sets, keeping aside a month of data every five ones. This process allows us to build a training set and a validation set for which different hyper-parameters configurations can be evaluated. Only the best performing configuration is later evaluated on the test set.
7.3 Use Case II and III
The other two use cases are based on the GEFCom2014dataset , which was made available for an online forecasting competition that lasted between August 2015 and December 2015. The dataset contains 60.6k hourly measurements of (aggregated) electric power consumption collected by ISO New England between January 2005 and December 2011. Differently from the IHEPCdataset, temperature values are also available and are used by the different architectures to enhance their prediction performance. In particular the input variables being used for forecasting the subsequent at timestep
include: several previous load measurements, the temperature measurements for the previous timesteps registered by 25 different stations, hour, day, month and year of the measurements. We apply standard normalization to load and temperature measurements while for the other variables we simply apply one-hot encoding, i.e., a-dimensional vector in which one of the elements equals 1, and all remaining elements equal 0 . On the right side of Figure we observe the average load and the data dispersion on a weekly basis. Compared to IHEPC, the load profiles look much more regular. This meets intuitive expectations as the load measurements in the first dataset come from a single household, thus the randomness introduced by user behaviour makes more remarkable impact on the results. On the opposite, the load information in GEFCom2014comes from the aggregation of the data provided by several different smart meters; clustered data exhibits a more stable and regular pattern. The main task of these use cases, as well the previous one, consists in forecasting the electric load for the next day, i.e., 24 timesteps ahead. The hyper-parameters optimization and the final score for the models follow the same guidelines provided for IHEPC, the number of points for each subset is described in Table .
Weekly statistics for the electric load in the whole IHEPC(Left) and GEFCom2014datasets (right). The bold line is the mean curve, the dotted line is the median and the green area covers one standard deviation from the mean.
The compared architectures are the ones presented in previous sections with one exception. We have additionally considered a deeper variant of a feed forward neural network with residual connections which is named DFNN in the remainder of the work. In accordance to the findings of 
we have employed a 2-shortcut network, i.e., the input undergoes two affine transformations each followed by a non linearity before being summed to its original values. For regularization purposes we have included Dropout and Batch Normalization in each residual block. We have additionally inserted this model in the results comparison because it represents an evolution of standard feed forward neural networks which is expected to better handle highly complex time-series data.
Table summarizes the best configurations found trough grid search for each model and use case. For both datasets we experimented different input sequences of length . Finally, we used a window size of four days, which represents the best trade-off between performance and memory requirements. The output sequence length is fixed to one day. For each model we identified the optimal number of stacked layers in the network , the number of hidden units per layer , the regularization coefficient (L2 regularization) and the dropout rate . Moreover, for TCN we additionally tuned the width of the convolutional kernel and the number of filters applied at each layer (i.e., the depth of each output volume after the convolution operation). The dilation factor is increased exponentially with the depth of the network, i.e. with being the layer of the network.
Table summarizes the test scores of the presented architectures obtained for the IHEPCdataset. Certain similarities among networks trained for different uses cases can be spotted out already at this stage. In particular, we observe that all models exploit a small number of neurons. This is not usual in deep learning but - at least for recurrent architectures - is consistent with . With some exceptions, recurrent networks benefit from a less strict regularization; dropout is almost always set to zero and values are small.
Among Recurrent Neural Networks we observe that, in general, the MIMO strategy outperforms the recursive one in this multi step prediction task. This is reasonable in such a scenario. Indeed, the recursive strategy, differently from the MIMO one, is highly sensitive to errors accumulation which, in a highly volatile time series as the one addressed here, results in a very inaccurate forecast. Among the MIMO models we observe that gated networks perform significantly better than simple Elmann network one. This suggests that gated systems are effectively learning to better exploit the temporal dependency in the data. In general we notice that all the models, except the RNNs trained with recursive strategy, achieve comparable performance and none really stands out. It is interesting to comment that GRU-MIMO and LSTM-MIMO outperform sequence to sequence architectures which are supposed to better model complex temporal dynamics like the one exhibited by the residential load curve. Nevertheless, by observing the performance of recurrent networks trained with the recursive strategy, this behaviour is less surprising. In fact, compared with the aggregated load profiles, the load curve belonging to a single smart meter is way more volatile and sensitive to customers behaviour. For this reason, leveraging geographical and socio-economic features that characterize the area where the user lives may allow deep networks to generate better predictions.
For visualization purposes we compare all the models performance on a single day prediction scenario on the left side of Figure . On the right side of Figure we quantify the differences between the best predictor (the GRU-MIMO) and the actual measurements; the thinner the line the closer the prediction to the true data. Furthermore, in this Figure, we concatenate multiple day predictions to have a wider time span and evaluate the model predictive capabilities. We observe that the model is able to generate a prediction that correctly models the general trend of the load curve but fails to predict steep peaks. This might come from the design choice of using MSE as the optimization metric, which could discourage deep models to predict high peaks as large errors are hugely penalized, and therefore, predicting a lower and smoother function results in better performance according to this metric. Alternatively, some of the peaks may simply represent noise due to particular user behaviour and thus unpredictable by definition.
The load curve of the second dataset (GEFCom2014) results from the aggregation of several different load profiles producing a smoother load curve when compared with the individual load case. Hyper-parameters optimization and the final score for the models can be found in Table .
Table and Table show the experimental results obtained by the models in two different scenarios. In the former case, only load values were provided to the models while in the latter scenario the input vector has been augmented with the exogenous features described before. Compared to the previous dataset this time series exhibits a much more regular pattern; as such we expect the prediction task to be easier. Indeed, we can observe a major improvement in terms of performance across all the models. As already noted in [95, 22] the prediction accuracy increases significantly when the forecasting task is carried out on a smooth load curve (resulting from the aggregation of many individual consumers).
We can observe that, in general, all models except plain FNNs benefit from the presence of exogenous variables.
When exogenous variables are adopted, we notice a major improvement by RNNs trained with the recursive strategy which outperform MIMO ones. This increase in accuracy can be attributed to a better capacity of leveraging the exogenous time series of temperatures to yield a better load forecast. Moreover, RNNs with MIMO strategy gain negligible improvements compared to their performance when no extra-feature is provided. This kind of architectures use a feedforward neural network to map their final hidden state to a sequence of values, i.e., the estimates. Exogenous variables are elaborated directly by this FNN, which, as observed above, shows to have problems in handling both load data and extra information. Consequently, a better way of injecting exogenous variables in MIMO recurrent network needs to be found in order to provide a boost in prediction performance comparable to the one achieved by employing the recursive strategy.
For reasons that are similar to those discussed above, sequence to sequence models trained via teacher forcing (seq2seq-TF) experienced an improvement when exogenous features are used. Still, seq2seq trained in free-running mode (seq2seq-SG) proves to be a valid alternative to standard seq2seq-TF producing high quality predictions in all use cases. The absence of a discrepancy between training and inference in terms of data generating distribution shows to be an advantage as seq2seq-SG is less sensitive to noise and error propagation.
Finally, we notice that TCNs perform well in all the presented use cases. Considering their lower memory requirements in the training process along with their inherent parallelism this type of networks represents a promising alternative to recurrent neural networks for short-term load forecasting.
The results of predictions are presented in the same fashion as for the previous use case in Figure . Observe that, in general, all the considered models are able to produce reasonable estimates as sudden picks in consumption are smoothed. Therefore, predictors greatly improve their accuracy when predicting day ahead values for the aggregated load curves with respect to individual households scenario.
In this work we have surveyed and experimentally evaluated the most relevant deep learning models applied to the short-term load forecasting problem, paving the way for standardized assessment and identification of the most optimal solutions in this field. The focus has been given to the three main families of models, namely, Recurrent Neural Networks, Sequence to Sequence Architectures and recently developed Temporal Convolutional Neural Networks. An architectural description along with a technical discussion on how multi-step ahead forecasting is achieved, has been provided for each considered model. Moreover, different forecasting strategies are discussed and evaluated, identifying advantages and drawbacks for each of them. The evaluation has been carried out on the three real-world use cases that refer to two distinct scenarios for load forecasting. Indeed, one use case deals with dataset coming from a single household while the other two tackle the prediction of a load curve that represents several aggregated meters, dispersed over the wide area. Our findings concerning application of recurrent neural networks to short-term load forecasting, show that the simple ERNN performs comparably to gated networks such as GRU and LSTM when adopted in aggregated load forecasting. Thus, the less costly alternative provided by ERNN may represent the most effective solution in this scenario as it allows to reduce the training time without remarkable impact on prediction accuracy. On the contrary, a significant difference exists for single house electric load forecasting where the gated networks shows to be superior to Elmann ones suggesting that the gated mechanism allows to better handle irregular time series. Sequence to Sequence models have demonstrated to be quite efficient in load forecasting tasks even though they seem to fail in outperforming RNNs. In general we can claim that seq2seq architectures do not represent a golden standard in load forecasting as they are in other domains like natural language processing. In addition to that, regarding this family of architectures, we have observed that teacher forcing may not represent the best solution for training seq2seq models on short-term load forecasting tasks. Despite being harder in terms of convergence, free-running models learn to handle their own errors, avoiding the discrepancy between training and testing that is a well known issue for teacher forcing. It turns out to be worth efforts to further investigate capabilities of seq2seq models trained with intermediate solutions such as professor forcing. Finally, we evaluated the recently developed Temporal Convolutional Neural Networks which demonstrated convincing performance when applied to load forecasting tasks. Therefore, we strongly believe that the adoption of these networks for sequence modelling in the considered field is very promising and might even introduce a significant advance in this area that is emerging as a key importance for future Smart Grid developments.
This project is carried out within the frame of the Swiss Centre for Competence in Energy Research on the Future Swiss Electrical Infrastructure (SCCER-FURIES) with the financial support of the Swiss Innovation Agency (Innosuisse - SCCER program).
-  X. Fang, S. Misra, G. Xue, and D. Yang. Smart grid — the new and improved power grid: A survey. IEEE Communications Surveys Tutorials, 14(4):944–980, Fourth 2012.
-  Eisa Almeshaiei and Hassan Soltan. A methodology for electric power load forecasting. Alexandria Engineering Journal, 50(2):137 – 144, 2011.
-  H. S. Hippert, C. E. Pedreira, and R. C. Souza. Neural networks for short-term load forecasting: a review and evaluation. IEEE Transactions on Power Systems, 16(1):44–55, Feb 2001.
-  Jiann-Fuh Chen, Wei-Ming Wang, and Chao-Ming Huang. Analysis of an adaptive time-series autoregressive moving-average (arma) model for short-term load forecasting. Electric Power Systems Research, 34(3):187–196, 1995.
-  Shyh-Jier Huang and Kuang-Rong Shih. Short-term load forecasting via arma model identification including non-gaussian process considerations. IEEE Transactions on power systems, 18(2):673–679, 2003.
-  Martin T Hagan and Suzanne M Behr. The time series approach to short term load forecasting. IEEE Transactions on Power Systems, 2(3):785–791, 1987.
Chao-Ming Huang, Chi-Jen Huang, and Ming-Li Wang.
A particle swarm optimization to identifying the armax model for short-term load forecasting.IEEE Transactions on Power Systems, 20(2):1126–1133, 2005.
-  Hong-Tzer Yang, Chao-Ming Huang, and Ching-Lien Huang. Identification of armax model for short term load forecasting: An evolutionary programming approach. In Power Industry Computer Application Conference, 1995. Conference Proceedings., 1995 IEEE, pages 325–330. IEEE, 1995.
-  Guy R Newsham and Benjamin J Birt. Building-level occupancy data to improve arima-based electricity use forecasts. In Proceedings of the 2nd ACM workshop on embedded sensing systems for energy-efficiency in building, pages 13–18. ACM, 2010.
-  K. Y. Lee, Y. T. Cha, and J. H. Park. Short-term load forecasting using an artificial neural network. IEEE Transactions on Power Systems, 7(1):124–132, Feb 1992.
-  D. C. Park, M. A. El-Sharkawi, R. J. Marks, L. E. Atlas, and M. J. Damborg. Electric load forecasting using an artificial neural network. IEEE Transactions on Power Systems, 6(2):442–449, May 1991.
-  Dipti Srinivasan, A.C. Liew, and C.S. Chang. A neural network short-term load forecaster. Electric Power Systems Research, 28(3):227 – 234, 1994.
-  I. Drezga and S. Rahman. Short-term load forecasting with local ann predictors. IEEE Transactions on Power Systems, 14(3):844–850, Aug 1999.
-  K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He. Short-term load forecasting with deep residual networks. IEEE Transactions on Smart Grid, pages 1–1, 2018.
-  Ping-Huan Kuo and Chiou-Jye Huang. A high precision artificial neural networks model for short-term energy load forecasting. Energies, 11(1), 2018.
-  K. Amarasinghe, D. L. Marino, and M. Manic. Deep neural networks for energy load forecasting. In 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), pages 1483–1488, June 2017.
-  Siddarameshwara Nayaka, Anup Yelamali, and Kshitiz Byahatti. Electricity short term load forecasting using elman recurrent neural network. pages 351 – 354, 11 2010.
-  Filippo Maria Bianchi, Enrico Maiorino, Michael C. Kampffmeyer, Antonello Rizzi, and Robert Jenssen. An overview and comparative analysis of recurrent neural networks for short term load forecasting. CoRR, abs/1705.04378, 2017.
-  Filippo Maria Bianchi, Enrico De Santis, Antonello Rizzi, and Alireza Sadeghian. Short-term electric load forecasting using echo state networks and pca decomposition. IEEE Access, 3:1931–1943, 2015.
-  Elena Mocanu, Phuong H Nguyen, Madeleine Gibescu, and Wil L Kling. Deep learning for estimating building energy consumption. Sustainable Energy, Grids and Networks, 6:91–99, 2016.
-  Jian Zheng, Cencen Xu, Ziang Zhang, and Xiaohua Li. Electric load forecasting in smart grids using long-short-term-memory based recurrent neural network. In Information Sciences and Systems (CISS), 2017 51st Annual Conference on, pages 1–6. IEEE, 2017.
-  Weicong Kong, Zhao Yang Dong, Youwei Jia, David J Hill, Yan Xu, and Yuan Zhang. Short-term residential load forecasting based on lstm recurrent neural network. IEEE Transactions on Smart Grid, 2017.
-  Salah Bouktif, Ali Fiaz, Ali Ouni, and Mohamed Serhani. Energies, 11(7):1636, 2018.
-  Yixing Wang, Meiqin Liu, Zhejing Bao, and Senlin Zhang. Short-term load forecasting with multi-source data using gated recurrent unit neural networks. Energies, 11:1138, 05 2018.
-  Wan He. Load forecasting via deep neural networks. Procedia Computer Science, 122:308 – 314, 2017. 5th International Conference on Information Technology and Quantitative Management, ITQM 2017.
-  Chujie Tian, Jian Ma, Chunhong Zhang, and Panpan Zhan. A deep neural network model for short-term load forecast based on long short-term memory network and convolutional neural network. Energies, 11:3493, 12 2018.
-  Tao Hong, Pierre Pinson, and Shu Fan. Global energy forecasting competition 2012. International Journal of Forecasting, 30(2):357 – 363, 2014.
-  Antonino Marvuglia and Antonio Messineo. Using recurrent artificial neural networks to forecast household electricity consumption. Energy Procedia, 14:45 – 55, 2012. 2011 2nd International Conference on Advances in Energy Engineering (ICAEE).
-  Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
-  Smart grid, smart city, australian govern., australia, canberray.
-  Yao Cheng, Chang Xu, Daisuke Mashima, Vrizlynn L. L. Thing, and Yongdong Wu. Powerlstm: Power demand forecasting using long short-term memory neural network. In Gao Cong, Wen-Chih Peng, Wei Emma Zhang, Chengliang Li, and Aixin Sun, editors, Advanced Data Mining and Applications, pages 727–740, Cham, 2017. Springer International Publishing.
-  Umass smart dataset. http://traces.cs.umass.edu/index.php/Smart/Smart, 2017.
-  Daniel L Marino, Kasun Amarasinghe, and Milos Manic. Building energy load forecasting using deep neural networks. In Industrial Electronics Society, IECON 2016-42nd Annual Conference of the IEEE, pages 7046–7051. IEEE, 2016.
-  Henning Wilms, Marco Cupelli, and Antonello Monti. Combining auto-regression with exogenous variables in sequence-to-sequence recurrent neural networks for short-term load forecasting. In 2018 IEEE 16th International Conference on Industrial Informatics (INDIN), pages 673–679. IEEE, 2018.
-  Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli, and Rob J. Hyndman. Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. International Journal of Forecasting, 32(3):896 – 913, 2016.
-  A. Almalaq and G. Edwards. A review of deep learning methods applied on load forecasting. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 511–516, Dec 2017.
-  Balázs Csanád Csáji. Approximation with artificial neural networks.
-  G. Hinton. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
-  Matthew D. Zeiler. Adadelta: An adaptive learning rate method. 1212, 12 2012.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  S.T. Chen, D.C. Yu, and A.R. Moghaddamjo. Weather sensitive short-term load forecasting using nonfully connected artificial neural network. IEEE Transactions on Power Systems (Institute of Electrical and Electronics Engineers); (United States), (3), 8 1992.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770–778, 2016.
-  Jeffrey L. Elman. Finding structure in time. COGNITIVE SCIENCE, 14(2):179–211, 1990.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724–1734, 2014.
-  Paul J. Werbos. Backpropagation through time: What it does and how to do it. 1990.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Learning Internal Representations by Error Propagation, pages 318–362. MIT Press, Cambridge, MA, USA, 1986.
-  Ronald J. Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2, 09 1998.
-  Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. Neur. Netw., 5(2):157–166, March 1994.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1310–III–1318. JMLR.org, 2013.
-  K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, Oct 2017.
-  Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
-  Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923, 2017.
-  Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. CoRR, abs/1312.6026, 2013.
-  Jürgen Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Comput., 4(2):234–242, March 1992.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.
-  Michiel Hermans and Benjamin Schrauwen. Training and analysing deep recurrent neural networks. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 190–198. Curran Associates, Inc., 2013.
-  Souhaib Ben Taieb, Gianluca Bontempi, Amir F. Atiya, and Antti Sorjamaa. A review and comparison of strategies for multi-step ahead time series forecasting based on the nn5 forecasting competition. Expert Systems with Applications, 39(8):7067 – 7083, 2012.
-  Antti Sorjamaa and Amaury Lendasse. Time series prediction using dirrec strategy. volume 6, pages 143–148, 01 2006.
-  Gianluca Bontempi. Long term time series prediction with multi-input multi-output local learning. Proceedings of the 2nd European Symposium on Time Series Prediction (TSP), ESTSP08, 01 2008.
-  Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa, and Amaury Lendasse. Long-term prediction of time series by combining direct and mimo strategies. 2009 International Joint Conference on Neural Networks, pages 3054–3061, 2009.
-  F. M. Bianchi, E. De Santis, A. Rizzi, and A. Sadeghian. Short-term electric load forecasting using echo state networks and pca decomposition. IEEE Access, 3:1931–1943, 2015.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112, 2014.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
-  Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. 09 2016.
-  A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649, May 2013.
-  Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 577–585. Curran Associates, Inc., 2015.
-  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945–4949, March 2016.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2048–2057, Lille, France, 07–09 Jul 2015. PMLR.
-  Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1:270–280, 1989.
-  Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015.
-  Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1171–1179, Cambridge, MA, USA, 2015. MIT Press.
-  Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron C. Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In NIPS, 2016.
-  Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K. Reddy. Deep reinforcement learning for sequence to sequence models. CoRR, abs/1805.09461, 2018.
-  Henning Wilms, Marco Cupelli, and A Monti. Combining auto-regression with exogenous variables in sequence-to-sequence recurrent neural networks for short-term load forecasting. pages 673–679, 07 2018.
-  Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
-  Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel recursive super resolution. arXiv preprint arXiv:1702.00783, 2017.
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photo-realistic single image super-resolution using a generative adversarial network.In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 105–114. IEEE, 2017.
-  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In Arxiv, 2016.
-  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
Anastasia Borovykh, Sander Bohte, and Kees Oosterlee.
Conditional time series forecasting with convolutional neural
Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence, pages 729–730, September 2017.
-  Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. CoRR, abs/1603.07285, 2016.
-  Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
-  François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
-  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
-  Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg, 2006.
-  Sihan Li, Jiantao Jiao, Yanjun Han, and Tsachy Weissman. Demystifying resnet. arXiv preprint arXiv:1611.01186, 2016.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. pages 448–456, 2015.
-  A. Marinescu, C. Harris, I. Dusparic, S. Clarke, and V. Cahill. Residential electrical demand forecasting in very small scale: An evaluation of forecasting methods. In 2013 2nd International Workshop on Software Engineering Challenges for the Smart Grid (SE4SG), pages 25–32, May 2013.