1 Introduction
Forecasting the demand of resources within a distribution network of energy, telecommunication or transportation is of fundamental importance for managing the limited availability of the assets. An accurate Short Term Load Forecast (STLF) system Gooijer and Hyndman (2006) can reduce high cost of over and undercontracts on balancing markets due to load prediction errors. Moreover, it keeps power markets efficient and provides a better understanding of the dynamics of the monitored system SimchiLevi et al. (1999). On the other hand, a wrong prediction could cause either a load overestimation, which leads to the excess of supply and consequently more costs and contract curtailments for market participants, or a load underestimation, resulting in failures in gathering enough provisions, thereby more costly supplementary services Bunn (2000); Ruiz and Gross (2008). These reasons motivated the research of forecasting models capable of reducing this financial distress, by increasing the load forecasting accuracy even by a small percent Deihimi and Showkati (2012); Peng et al. (2014); Shen and Huang (2008); Bianchi et al. (2015a, b).
The load profile generally follows cyclic and seasonal patterns related to human activities and can be represented by a realvalued time series. The dynamics of the system generating the load time series can vary significantly during the observation period, depending on the nature of the system and on latent, external influences. For this reason, the forecasting accuracy can change considerably among different samples even when using the same prediction model Deihimi et al. (2013). Over the past years, the STLF problem has been tackled in several research areas Jan van Oldenborgh et al. (2005) by means of many different modelbased approaches, each one characterized by different advantages and drawbacks in terms of prediction accuracy, complexity in training, sensitivity to the parameters and limitations in the tractable forecasting horizon DangHa et al. (2017).
Autoregressive and exponential smoothing models represented for many years the baseline among systems for time series prediction Hyndman et al. (2008). Such models require to properly select the lagged inputs to identify the correct model orders, a procedure which demands a certain amount of skill and expertise Box et al. (2011)
. Moreover, autoregressive models make explicit assumptions about the nature of system under exam. Therefore, their use is limited to those settings in which such assumptions hold and where
apriori knowledge on the system is available Box and Cox (1964). Taylor (2008) showed that for long forecasting horizons a very basic averaging model, like AutoRegressive Integrated Moving Average or Triple Exponential Smoothing, can outperform more sophisticated alternatives. However, in many complicated systems the properties of linearity and even stationarity of the analyzed time series are not guaranteed. Nonetheless, given their simplicity, autoregressive models have been largely employed as practical implementations of forecast systems.The problem of time series prediction has been approached within a function approximation framework, by relying on the embedding procedure proposed by Takens (1981)
. Takens’ theorem transforms the prediction problem from time extrapolation to phase space interpolation. In particular, by properly sampling a time dependent quantity
, it is possible to predict the value of the th sample from the previous samples, given an appropriate choice of the sampling frequency and the number of samples :. Through the application of phasespace embedding, regression methods, such as Support Vector Regression (an extension of Support Vector Machines in the continuum) have been applied in time series prediction
Sapankevych and Sankar (2009), either by representing the sequential input as a static domain, described by frequency and phase, or by embedding sequential input values in time windows of fixed length. The approach can only succeed if there are no critical temporal dependencies exceeding the windows length, making the SVM unable to learn an internal state representation for sequence learning tasks involving time lags of arbitrary length. Other universal function approximators such as FeedForward Artificial Neural Networks Hornik et al. (1989) and ANFIS (Adaptive NetworkBased Fuzzy Inference System) Jang (1993) have been employed in time series prediction tasks by selecting a suitable interval of past values from the time series as the inputs and by training the network to forecast one or a fixed number of future values Zhang et al. (1998); Hippert et al. (2001); Law (2000); Tsaur et al. (2002); Kon and Turner (2005); Palmer et al. (2006); Claveria and Torra (2014). The operation is repeated to forecast next values by translating the time window of the considered inputs Kourentzes (2013). While this approach proved to be effective in many circumstances DíazRobles et al. (2008); Plummer (2000); Teixeira and Fernandes (2012); Claveria et al. (2015), it does not treat temporal ordering as an explicit feature of the time series and, in general, is not suitable in cases where the time series have significantly different lengths. On this account, a Recurrent Neural Network (RNN) is a more flexible model, since it encodes the temporal context in its feedback connections, which are capable of capturing the time varying dynamics of the underlying system Schäfer and Zimmermann (2007); Bianchi et al. (2017).RNNs are a special class of Neural Networks characterized by internal selfconnections, which can, in principle, any nonlinear dynamical system, up to a given degree of accuracy Schäfer and Zimmermann (2007). RNNs and their variants have been used in many contexts where the temporal dependency in the data is an important implicit feature in the model design. Noteworthy applications of RNNs include sequence transduction Graves (2012), language modeling Graves (2013); Pascanu et al. (2013a); Mikolov (2012); Sutskever et al. (2011), speech recognition Graves (2011), learning word embeddings Mikolov et al. (2013), audio modeling Oord et al. (2016), handwriting recognition Graves and Schmidhuber (2009); Graves et al. (2008), and image generation Gregor et al. (2015). In many of these works a popular variant of RNN was used, called LongShort Term Memory Hochreiter and Schmidhuber (1997). This latter has recently earned significant attention due to its capability of storing information for very long periods of time.
As an RNN processes sequential information, it performs the same operations on every element of the input sequence. Its output, at each time step, depends on previous inputs and past computations. This allows the network to develop a memory of previous events, which is implicitly encoded in its hidden state variables. This is certainly different from traditional feedforward neural networks, where it is assumed that all inputs (and outputs) are independent of each other. Theoretically, RNNs can remember arbitrarily long sequences. However, their memory is in practice limited by their finite size and, more critically, by the suboptimal training of their parameters. To overcome memory limitations, recent research efforts have led to the design of novel RNN architectures, which are equipped with an external, permanent memory capable of storing information for indefinitely long amount of time Weston et al. (2014); Graves et al. (2014).
Contrarily to other linear models adopted for prediction, RNNs can learn functions of arbitrary complexity and they can deal with time series data possessing properties such as saturation or exponential effects and nonlinear interactions between latent variables. However, if the temporal dependencies of data are prevalently contained in a finite and small time interval, the use of RNNs can be unnecessary. In these cases performances, both in terms of computational resources required and accuracy, are generally lower than the ones of timewindow approaches, like ARIMA, SVM, MultiLayer Perceptron and ANFIS. On the other hand, in many load forecasting problems the time series to be predicted are characterized by long temporal dependencies, whose extent may vary in time or be unknown in advance. In all these situations, the use of RNNs may turn out to be the best solution.
Despite the STLF problem has been one of the most important applications for both early RNNs models Gers et al. (2001) and most recent ones Flunkert et al. (2017), an uptodate and comprehensive analysis of the modern RNN architectures applied to the STLF problem is still lacking. In several recent works on STFL, NARX networks (see Sec. 4.1) or Echo State Networks (see Sec. 4.2) are adopted for time series prediction and their performance is usually compared with standard static models, rather than with other RNN architectures. With this paper, we aim to fill these gaps by performing a comparative study on the problem of STLF using different classes of stateoftheart RNNs. We provide an introduction to the RNN framework, presenting the most important architectures and their properties. We also furnish the guidelines for configuring and training the different RNN models to predict realvalued time series. In practice, we formulate the STLF problem as the prediction of a realvalued univariate time series, given its past values as input. In some cases, beside the time series of past target values, additional “context” time series are fed to the network in order to provide exogenous information related to the environment in which the system to be modeled operates.
The paper is structured as follows.
In Sec. 2
we provide a general overview of a standard RNN architecture and we discuss its general properties. We also discuss the main issues encountered in the training phase, the most common methodologies for learning the model parameters and common ways of defining the loss function to be optimized during the training.
In Sec. 3, we present the most basic architecture, called Elman RNN, and then we analyze two important variants, namely the LongShort Term Memory and Gated Recurrent Units networks. Despite the recent popularity of these architectures Greff et al. (2015), their application to prediction of realvalued time series has been limited so far Malhotra et al. (2015). For each RNN, we provide a brief review, explaining its main features, the approaches followed in the training stage and a short list of the main works concerning time series prediction in which the specific network has been applied.
Successively, in Sec. 4 we illustrate two particular RNN architectures, which differ from the previous ones, mainly due to their training procedure. In particular, we analyze the Nonlinear AutoRegressive with eXogenous inputs (NARX) neural network and the Echo State Network (ESN). These architectures have been successfully applied in the literature of time series prediction and they provide important advantages with respect to traditional models, due to their easy applicability and fast training procedures.
In Sec. 5 we describe three synthetic datasets, used to test and to compare the computational capabilities of the five RNN architectures in a controlled environment.
In Sec. 6, we present three realworld datasets of time series relative to the load profile in energy distribution and telecommunication networks. For each dataset, we perform a series of analysis with the purpose of choosing a suitable preprocessing for the data.
Sec. 7 is dedicated to the experiments and to the discussion of the performance of the RNN models. The first part of the experimental section focuses on the benchmark tests, while in the second part we employ the RNNs to solve STLF tasks on realworld time series.
Finally, in Sec. 8 we discuss our conclusions.
2 Properties and Training in Recurrent Neural Networks
RNNs are learning machines that recursively compute new states by applying transfer functions to previous states and inputs. Typical transfer functions are composed by an affine transformation followed by a nonlinear function, which are chosen depending on the nature of the particular problem at hand. It has been shown by Maass et al. (2007) that RNNs possess the socalled universal approximation property, that is, they are capable of approximating arbitrary nonlinear dynamical systems (under loose regularity conditions) with arbitrary precision, by realizing complex mappings from input sequences to output sequences Siegelmann and Sontag (1991)
. However, the particular architecture of an RNN determines how information flows between different neurons and its correct design is crucial in the realization of a robust learning system. In the context of prediction, an RNN is trained on input temporal data
in order to reproduce a desired temporal output . can be any time series related to the input and even a temporal shift of itself. The most common training procedures are gradientbased, but other techniques have been proposed, based on derivativefree approaches or convex optimization Schmidhuber et al. (2007); Jaeger (2001). The objective function to be minimized is a loss function, which depends on the error between the estimated output
and the actual output of the network . An interesting aspect of RNNs is that, upon suitable training, they can also be executed in generative mode, as they are capable of reproducing temporal patterns similar to those they have been trained on Gregor et al. (2015).The architecture of a simple RNN is depicted in Fig. 1.
In its most general form an RNN can be seen as a weighted, directed and cyclic graph that contains three different kinds of nodes, namely the input, hidden and output nodes Zhang et al. (2016). Input nodes do not have incoming connections, output nodes do not have outgoing connections, hidden nodes have both. An edge can connect two different nodes which are at the same or at different time instants. In this paper, we adopt the timeshift operator to represent a time delay of time steps between a source and a destination node. Usually , but also lower values are admitted and they represent the so called skip connections Koutník et al. (2014). Selfconnecting edges always implement a lag operator with . In some particular cases, the argument of the timeshift operator is positive and it represents a forwardshift in time Sutskever and Hinton (2010). This means that a node receives as input the content of a source node in a future time interval. Networks with those kind of connections are called bidirectional RNNs and are based on the idea that the output at a given time may not only depend on the previous elements in the sequence, but also on future ones Schuster and Paliwal (1997). These architectures, however, are not reviewed in this work as we only focus on RNNs with .
While, in theory, an RNN architecture can model any given dynamical system, practical problems arise during the training procedure, when model parameters must be learned from data in order to solve a target task. Part of the difficulty is due to a lack of well established methodologies for training different types of models. This is also because a general theory that might guide designer decisions has lagged behind the feverish pace of novel architecture designs Schoenholz et al. (2016); Lipton (2015)
. A large variety of novel strategies and heuristics have arisen from the literature in the past the years
Montavon et al. (2012); Scardapane et al. (2017) and, in many cases, they may require a considerable amount of expertise from the user to be correctly applied. While the standard learning procedure is based on gradient optimization, in some RNN architectures the weights are trained following different approaches Scardapane and Wang (2017); Jaeger (2002a), such as realtime recurrent learning Williams and Zipser (1989), extended Kalman filters
Haykin et al. (2001)John (1992), and in some cases they are not learned at all Lukoševičius and Jaeger (2009).2.1 Backpropagation Through Time
Gradientbased learning requires a closedform relation between the model parameters and the loss function. This relation allows to propagate the gradient information calculated on the loss function back to the model parameters, in order to modify them accordingly. While this operation is straightforward in models represented by a directed acyclic graph, such as a FeedForward Neural Network (FFNN), some caution must be taken when this reasoning is applied to RNNs, whose corresponding graph is cyclic. Indeed, in order to find a direct relation between the loss function and the network weights, the RNN has to be represented as an equivalent infinite, acyclic and directed graph. The procedure is called unfolding and consists in replicating the network’s hidden layer structure for each time interval, obtaining a particular kind of FFNN. The key difference of an unfolded RNN with respect to a standard FFNN is that the weight matrices are constrained to assume the same values in all replicas of the layers, since they represent the recursive application of the same operation.
Fig. 2 depicts the unfolding of the RNN, previously reported in Fig. 1. Through this transformation the network can be trained with standard learning algorithms, originally conceived for feedforward architectures. This learning procedure is called Back Propagation Through Time (BPTT) Rumelhart et al. (1985) and is one of the most successful techniques adopted for training RNNs. However, while the network structure could in principle be replicated an infinite number of times, in practice the unfolding is always truncated after a finite number of time instants. This maintains the complexity (depth) of the network treatable and limits the issue of the vanishing gradient (as discussed later). In this learning procedure called Truncated BPPT Williams and Peng (1990), the folded architecture is repeated up to a given number of steps , with upperbounded by the time series length . The size of the truncation depends on the available computational resources, as the network grows deeper by repeating the unfolding, and on the expected maximum extent of time dependencies in data. For example, in a periodic time series with period it may be unnecessary, or even detrimental, to set .
Another variable we consider is the frequency
at which the BPTT calculates the backpropagated gradients. In particular, let us define with
the truncated backpropagation that processes the sequence one time step at a time, and every time steps, it runs BPTT for time steps Sutskever (2013). Very often the term is omitted in the literature, as it is assumed equal to 1, and only the value for is specified. We refer to the case and as true BPTT, or .In order to improve the computational efficiency of the BPTT, the ratio can be decremented, effectively reducing the frequency of gradients evaluation. An example, is the socalled epochwise BPTT or , where Williams and Zipser (1995). In this case, the ratio . However, the learning procedure is in general much less accurate than , since the gradient is truncated too early for many values on the boundary of the backpropagation window.
A better approximation of the true BPTT is reached by taking a large difference , since no error in the gradient is injected for the earliest time steps in the buffer. A good tradeoff between accuracy and performance is , which keeps the ratio sufficiently close to 1 and the difference is large as in the true BPTT Williams and Peng (1990). Through preliminary experiments, we observed that achieves comparable performance to , in a significantly reduced training time. Therefore, we followed this procedure in all our experiments.
2.2 Gradient descent and loss function
Training a neural network commonly consists in modifying its parameters through a gradient descent optimization, which minimizes a given loss function that quantifies the accuracy of the network in performing the desired task. The gradient descent procedure consists in repeating two basic steps until convergence is reached. First, the loss function is evaluated on the RNN configured with weights , when a set of input data are processed (forward pass). Note that with we refer to all network parameters, while the index
identifies their values at epoch
, as they are updated during the optimization procedure. In the second step, the gradient is backpropagated through the network in order to update its parameters (backward pass).In a time series prediction problem, the loss function evaluates the dissimilarity between the predicted values and the actual future values of the time series, which is the ground truth. The loss function can be defined as
(1) 
where is a function that evaluates the prediction error of the network when it is fed with inputs in , in respect to a desired response .
is a regularization function that depends on a hyperparameter
, which weights the contribution of the regularization in the total loss.The error function that we adopt in this work is Mean Square Error (MSE). It is defined as
(2) 
where is the output of the RNN (configured with parameters ) when the input is processed and is the groundtruth value that the network must learn to reproduce.
The regularization term introduces a bias that improves the generalization capabilities of the RNN, by reducing overfitting on the training data. In this work, we consider four types of regularization:

: the regularization term in Eq. 1 has the form .
regularization enforces sparsity in the network parameters, is robust to noisy outliers and it can possibly deliver multiple optimal solutions. However, this regularization can produce unstable results, in the sense that a small variation in the training data can yield very different outcomes.

: in this case, . This function penalizes large magnitudes in the parameters, favouring dense weight matrices with low values. This procedure is more sensitive to outliers, but is more stable than
. Usually, if one is not concerned with explicit features selection, the use of
is preferred. 
Elastic net penalty: combines the two regularizations above, by joining both and terms as . This regularization method overcomes the shortcomings of the regularization, which selects a limited number of variables before it saturates and, in case of highly correlated variables, tends to pick only one and ignore the others. Elastic net penalty generalizes the and regularization, which can be obtained by setting and , respectively.

Dropout: rather than defining an explicit regularization function
, dropout is implemented by keeping a neuron active during each forward pass in the training phase with some probability. Specifically, one applies a randomly generated mask to the output of the neurons in the hidden layer. The probability of each mask element to be 0 or 1 is defined by a hyperparameter
. Once the training is over, the activations are scaled by in order to maintain the same expected output. Contrarily to feedforward architectures, a naive dropout in recurrent layers generally produces bad performance and, therefore, it has usually been applied only to input and output layers of the RNN Pham et al. (2014a). However, in a recent work, Gal and Ghahramani (2015) shown that this shortcoming can be circumvented by dropping the same network units in each epoch of the gradient descent. Even if this formulation yields a slightly reduced regularization, nowadays this approach is becoming popular Zilly et al. (2016); Che et al. (2016) and is the one followed in this paper.
2.3 Parameters update strategies
Rather than evaluating the loss function over the entire training set to perform a single update of the network parameters, a very common approach consists in computing the gradient over minibatches of the training data. The size of the batch is usually set by following rules of thumb Bengio (2012).
This gradientupdate method is called Stochastic Gradient Descent (SGD) and, in presence of a nonconvex function, its convergence to a local minimum is guaranteed (under some mild assumptions) if the learning rate is sufficiently small
Bottou (2004). The update equation reads(3) 
where is the learning rate, an important hyperparameter that must be carefully tuned to achieve an effective training Bottou (2012a). In fact, a large learning rate provides a high amount of kinetic energy in the gradient descent, which causes the parameter vector to bounce, preventing the access to narrow area of the search space, where the loss function is lower. On the other hand, a strong decay can excessively slow the training procedure, resulting in a waste of computational time.
Several solutions have been proposed over the years, to improve the convergence to the optimal solution Bottou (2012b). During the training phase it is usually helpful to anneal over time or when the performance stops increasing. A method called step decay reduces the learning rate by a factor , if after a given number of epochs the loss has not decreased. The exponential decay and the fractional decay instead, have mathematical forms and , respectively. Here and are hyperparameters, while is the current optimization epoch. In our experiments, we opted for the step decay annealing, when we train the networks with SGD.
Even if SGD usually represents a safe optimization procedure, its rate of convergence is slow and the gradient descent is likely to get stuck in a saddle point of the loss function landscape Dauphin et al. (2014). Those issues have been addressed by several alternative strategies proposed in the literature for updating the network parameters. In the following we describe the most commonly used ones.
Momentum
In this firstorder method, the weights are updated according to a linear combination of the current gradient and the previous update , which is scaled by a hyperparameter :
(4)  
With this approach, the updates will build up velocity toward a direction that shows a consistent gradient Sutskever (2013). A common choice is to set .
A variant of the original formulation is the Nesterov momentum, which often achieves a better convergence rate, especially for smoother loss functions Nesterov (1983). Contrarily to the original momentum, the gradient is evaluated at an approximated future location, rather than at the current position. The update equations are
(5)  
Adaptive learning rate
The first adaptive learning rate method, proposed by Duchi et al. (2011), is Adagrad. Unlike the previously discussed approaches, Adagrad maintains a different learning rate for each parameter. Given the update information from all previous iterations , with , a different update is specified for each parameter of the weight matrix:
(6) 
where is a small term used to avoid division by . A major drawback with Adagrad is the unconstrained growth of the accumulated gradients over time. This can cause diminishing learning rates that may stop the gradient descent prematurely.
A procedure called RMSprop
Tieleman and Hinton (2012) attempts to solve this issue by using an exponential decaying average of square gradients, which discourages an excessive shrinkage of the learning rates:(7)  
According to the update formula, if there are oscillation in gradient updates, the learning rate is reduced by , otherwise it is increased by . Usually the decay rate is set to .
Another approach called Adam and proposed by Kingma and Ba (2014)
, combines the principles of Adagrad and momentum update strategies. Usually, Adam is the adaptive learning method that yields better results and, therefore, is the gradient descent strategy most used in practice. Like RMSprop, Adam stores an exponentially decaying average of gradients squared, but it also keeps an exponentially decaying average of the moments of the gradients. The update difference equations of Adam are
(8)  
corresponds to the first moment and is the second moment. However, since both and are initialized as zerovectors, they are biased towards during the first epochs. To avoid this effect, the two terms are corrected as and . Default values of the hyperparameters are , and .
Secondorder methods
The methods discussed so far only consider firstorder derivatives of the loss function. Due to this approximation, the landscape of the loss function locally looks and behaves like a plane. Ignoring the curvature of the surface may lead the optimization astray and it could cause the training to progress very slowly. However, secondorder methods involve the computation of the Hessian, which is expensive and usually untreatable even in networks of medium size. A HessianFree (HF) method that considers derivatives of the second order, without explicitly computing the Hessian, has been proposed by Martens (2010). This latter, unlike other existing HF methods, makes use of the positive semidefinite GaussNewton curvature matrix and it introduces a damping factor based on the LevenbergMarquardt heuristic, which permits to train networks more effectively. However, Sutskever et al. (2013) showed that HF obtains similar performance to SGD with Nesterov momentum. Despite being a firstorder approach, Nestrov momentum is capable of accelerating directions of lowcurvature just like a HF method and, therefore, is preferred due to its lower computational complexity.
2.4 Vanishing and exploding gradient
Increasing the depth in an RNN, in general, improves the memory capacity of the network and its modeling capabilities Pascanu et al. (2013b). For example, stacked RNNs do outperform shallow ones with the same hidden size on problems where it is necessary to store more information throughout the hidden states between the input and output layer Sutskever et al. (2014). One of the principal drawback of early RNN architectures was their limited memory capacity, caused by the vanishing or exploding gradient problem El Hihi and Bengio (1995), which becomes evident when the information contained in past inputs must be retrieved after a long time interval Hochreiter et al. (2001). To illustrate the issue of vanishing gradient, one can consider the influence of the loss function (that depends on the network inputs and on its parameters) on the network parameters , when its gradient is backpropagated through the unfolded The network Jacobian reads as
(9) 
In the previous equation, the partial derivatives of the states with respect to their previous values can be factorized as
(10) 
To ensure local stability, the network must operate in a ordered regime Bianchi et al. (2016a), a property ensured by the condition . However, in this case the product expanded Eq. 10 rapidly (exponentially) converges to 0, when increases. Consequently, the sum in Eq. 9 becomes dominated by terms corresponding to shortterm dependencies and the vanishing gradient effect occurs. As principal side effect, the weights are less and less updated as the gradient flows backward through the layers of the network. On the other hand, the phenomenon of exploding gradient appears when and the network becomes locally unstable. Even if global stability can still be obtained under certain conditions, in general the network enters into a chaotic regime, where its computational capability is hindered Livi et al. (2017).
Models with large recurrent depths exacerbate these gradientrelated issues, since they posses more nonlinearities and the gradients are more likely to explode or vanish. A common way to handle the exploding gradient problem, is to clip the norm of the gradient if it grows above a certain threshold. This procedure relies on the assumption that exploding gradients only occur in contained regions of the parameters space. Therefore, clipping avoids extreme parameter changes without overturning the general descent direction
Pascanu et al. (2012).On the other hand, different solutions have been proposed to tackle the vanishing gradient issue. A simple, yet effective approach consists in initializing the weights to maintain the same variance withing the activations and backpropagated gradients, as one moves along the network depth. This is obtained with a random initialization that guarantees the variance of the components of the weight matrix in layer
to be , and being the number of units in the previous and the next layer respectively Glorot and Bengio (2010). He et al. (2015)proposed to initialize the network weights by sampling them from an uniform distribution in
and then rescaling their values by ,being the total number of hidden neurons in the network. Another option, popular in deep FFNN, consists in using ReLU
Nair and Hinton (2010)as activation function, whose derivative is
or , and it does not cause the gradient to vanish or explode. Regularization, besides preventing unwanted overfitting in the training phase, proved to be useful in dealing with exploding gradients. In particular, and regularizations constrain the growth of the components of the weight matrices and consequently limit the values assumed by the propagated gradient Pascanu et al. (2013a). Another popular solution is adopting gated architectures, like Long ShortTerm Memory (LSTM) or Gated Recurrent Unit (GRU), which have been specifically designed to deal with vanishing gradients and allow the network to learn much longerrange dependencies. Srivastava et al. (2015) proposed an architecture called Highway Network, which allows information to flow across several layers without attenuation. Each layer can smoothly vary its behavior between that of a plain layer, implementing an affine transform followed by a nonlinear activation, and that of a layer which simply passes its input through. Optimization in highway networks is virtually independent of depth, as information can be routed (unchanged) through the layers. The Highway architecture, initially applied to deep FFNN He et al. (2015), has recently been extended to RNN where it dealt with several modeling and optimization issues Zilly et al. (2016).Finally, gradientrelated problems can be avoided by repeatedly selecting new weight parameters using random guess or evolutionary approaches John (1992); Gomez and Miikkulainen (2003); in this way the network is less likely to get stuck in local minima. However, convergence time of these procedures is timeconsuming and can be impractical in many realworld applications. A solution proposed by Schmidhuber et al. (2007), consists in evolving only the weights of nonlinear hidden units, while linear mappings from hidden to output units are tuned using fast algorithms for convex problem optimization.
3 Recurrent Neural Networks Architectures
In this section, we present three different RNN architectures trainable through the BPPT procedure, which we employ to predict realvalued time series. First, in Sec. 3.1 we present the most basic version of RNN, called Elman RNN. In Sec. 3.2 and 3.3 we discuss two gated architectures, which are LSTM and GRU. For each RNN model, we provide a quick overview of the main applications in time series forecasting and we discuss its principal features.
3.1 Elman Recurrent Neural Network
The Elman Recurrent Neural Network (ERNN), also known as Simple RNN or Vanillan RNN, is depicted in Fig. 1 and is usually considered to be the most basic version of RNN. Most of the more complex RNN architectures, such as LSTM and GRU, can be interpreted as a variation or as an extension of ERNNs.
ERNN have been applied in many different contexts. In natural language processing applications, ERNN demonstrated to be capable of learning grammar using a training set of unannotated sentences to predict successive words in the sentence
Elman (1995); Ogata et al. (2007). Mori and Ogasawara (1993) studied ERNN performance in shortterm load forecasting and proposed a learning method, called “diffusion learning” (a sort of momentumbased gradient descent), to avoid local minima during the optimization procedure. Cai et al. (2007)trained a ERNN with a hybrid algorithm that combines particle swarm optimization and evolutionary computation to overcome the local minima issues of gradientbased methods. Furthermore, ERNNs have been employed by
Cho (2003) in tourist arrival forecasting and by Mandal et al. (2006) to predict electric load time series. Due to the critical dependence of electric power usage on the day of the week or month of the year, a preprocessing step is performed to cluster similar days according to their load profile characteristics. Chitsaz et al. (2015) proposes a variant of ERNN called SelfRecurrent Wavelet Neural Network, where the ordinary nonlinear activation functions of the hidden layer are replaced with wavelet functions. This leads to a sparser representation of the load profile, which demonstrated to be helpful for tackling the forecast task through smaller and more easily trainable networks.The layers in a RNN can be divided in input layers, hidden layers and the output layers (see Fig. 1). While input and output layers are characterized by feedforward connections, the hidden layers contain recurrent ones. At each time step , the input layer process the component of a serial input . The time series has length and it can contain real values, discrete values, onehot vectors, and so on. In the input layer, each component
is summed with a bias vector
( is the number of nodes in the hidden layer) and then is multiplied with the input weight matrix . Analogously, the internal state of the network from the previous time interval is first summed with a bias vector and then multiplied by the weight matrix of the recurrent connections. The transformed current input and past network state are then combined and processed by the neurons in the hidden layers, which apply a nonlinear transformation. The difference equations for the update of the internal state and the output of the network at a time step are:(11)  
where is the activation function of the neurons, usually implemented by a sigmoid or by a hyperbolic tangent. The hidden state conveys the content of the memory of the network at time step , is typically initialized with a vector of zeros and it depends on past inputs and network states. The output is computed through a transformation , usually linear, on the matrix of the output weights applied to the sum of the current state and the bias vector . All the weight matrices and biases can be trained through gradient descent, according to the BPPT procedure. Unless differently specified, in the following to compact the notation we omit the bias terms by assuming , , and by augmenting , , with an additional column.
3.2 Long ShortTerm Memory
The Long ShortTerm Memory (LSTM) architecture was originally proposed by Hochreiter and Schmidhuber (1997)
and is widely used nowadays due to its superior performance in accurately modeling both short and long term dependencies in data. LSTM tries to solve the vanishing gradient problem by not imposing any bias towards recent observations, but it keeps constant error flowing back through time. LSTM works essentially in the same way as the ERNN architecture, with the difference that it implements a more elaborated internal processing unit called
cell.LSTM has been employed in numerous sequence learning applications, especially in the field of natural language processing. Outstanding results with LSTM have been reached by Graves and Schmidhuber (2009) in unsegmented connected handwriting recognition, by Graves et al. (2013) in automatic speech recognition, by Eck and Schmidhuber (2002) in music composition and by Gers and Schmidhuber (2001)
in grammar learning. Further successful results have been achieved in the context of image tagging, where LSTM have been paired with convolutional neural network, to provide annotations on images automatically
Vinyals et al. (2017).However, few works exist where LSTM has been applied to prediction of realvalued time series. Ma et al. (2015) evaluated the performances of several kinds of RNNs in shortterm traffic speed prediction and compared them with other common methods like SVMs, ARIMA, and Kalman filters, finding that LSTM networks are nearly always the best approach. Pawlowski and Kurach (2015)
utilized ensembles of LSTM and feedforward architectures to classify the danger from concentration level of methane in a coal mine, by predicting future concentration values. By following a hybrid approach,
Felder et al. (2010)trains a LSTM network to output the parameter of a Gaussian mixture model that best fits a wind power temporal profile.
While an ERNN neuron implements a single nonlinearity (see Eq. 11), a LSTM cell is composed of 5 different nonlinear components, interacting with each other in a particular way. The internal state of a cell is modified by the LSTM only through linear interactions. This permits information to backpropagate smoothly across time, with a consequent enhancement of the memory capacity of the cell. LSTM protects and controls the information in the cell through three gates, which are implemented by a sigmoid and a pointwise multiplication. To control the behavior of each gate, a set of parameters are trained with gradient descent, in order to solve a target task.
Since its initial definition Hochreiter and Schmidhuber (1997), several variants of the original LSTM unit have been proposed in the literature. In the following, we refer to the commonly used architecture proposed by Graves and Schmidhuber (2005). A schema of the LSTM cell is depicted in Fig. 3.
The difference equations that define the forward pass to update the cell state and to compute the output are listed below.
(12)  
is the input vector at time . , , , and are rectangular weight matrices, that are applied to the input of the LSTM cell. , , , and are square matrices that define the weights of the recurrent connections, while , , , and are bias vectors. The function is a sigmoid ^{1}^{1}1the logistic sigmoid is defined as , while and are pointwise nonlinear activation functions, usually implemented as hyperbolic tangents that squash the values in . Finally, is the entrywise multiplication between two vectors (Hadamard product).
Each gate in the cell has a specific and unique functionality. The forget gate decides what information should be discarded from the previous cell state . The input gate operates on the previous state , after having been modified by the forget gate, and it decides how much the new state should be updated with a new candidate . To produce the output , first the cell filters its current state with a nonlinearity . Then, the output gate selects the part of the state to be returned as output. Each gate depends on the current external input and the previous cells output .
As we can see from the Fig. 3 and from the forwardstep equations, when and , the current state of a cell is transferred to the next time interval exactly as it is. By referring back to Eq. 10, it is possible to observe that in LSTM the issue of vanishing gradient does not occur, due to the absence of nonlinear transfer functions applied to the cell state. Since in this case the transfer function in Eq. 10 applied to the internal states is an identity function, the contribution from past states remains unchanged over time. However, in practice, the update and forget gates are never completely open or closed due to the functional form of the sigmoid, which saturates only for infinitely large values. As a result, even if long term memory in LSTM is greatly enhanced with respect to ERNN architectures, the content of the cell cannot be kept completely unchanged over time.
3.3 Gated Recurrent Unit
The Gated Recurrent Unit (GRU) is another notorious gated architecture, originally proposed by Cho et al. (2014), which adaptively captures dependencies at different time scales. In GRU, forget and input gates are combined into a single update gate, which adaptively controls how much each hidden unit can remember or forget. The internal state in GRU is always fully exposed in output, due to the lack of a control mechanism, like the output gate in LSTM.
GRU were firstly tested by Cho et al. (2014) on a statistical machine translation task and reported mixed results. In an empirical comparison of GRU and LSTM, configured with the same amount of parameters, Chung et al. (2014) concluded that on some datasets GRU can outperform LSTM, both in terms of generalization capabilities and in terms of time required to reach convergence and to update parameters. In an extended experimental evaluation, Zaremba (2015) employed GRU to (i) compute the digits of the sum or difference of two input numbers, (ii) predict the next character in a synthetic XML dataset and in the large words dataset Penn TreeBank, (iii) predict polyphonic music. The results showed that the GRU outperformed the LSTM on nearly all tasks except language modeling when using a naive initialization. Bianchi et al. (2017) compared GRU with other recurrent networks on the prediction of superimposed oscillators. However, to the best of author’s knowledge, at the moment there are no researches where the standard GRU architecture has been applied in STLF problems.
A schematic depiction of the GRU cell is reported in Fig. 4.
GRU makes use of two gates. The first is the update gate, which controls how much the current content of the cell should be updated with the new candidate state. The second is the reset gate that, if closed (value near to 0), can effectively reset the memory of the cell and make the unit act as if the next processed input was the first in the sequence. The state equations of the GRU are the following:
(13)  
Here, is a nonlinear function usually implemented by a hyperbolic tangent.
In a GRU cell, the number of parameters is larger than in the an ERNN unit, but smaller than in a LSTM cell. The parameters to be learned are the rectangular matrices , , , the square matrices , , , and the bias vectors , , .
4 Other Recurrent Neural Networks Models
In this section we describe two different types of RNNs, which are the Nonlinear AutoRegressive eXogenous inputs neural network (NARX) and the Echo State Network (ESN). Both of them have been largely employed in STLF. These two RNNs differ from the models described in Sec. 3, both in terms of their architecture and in the training procedure, which is not implemented as a BPPT. Therefore, some of the properties and training approaches discussed in Sec. 2 do not hold for these models.
4.1 NARX Network
NARX networks are recurrent dynamic architectures with several hidden layers and they are inspired by discretetime nonlinear models called Nonlinear AutoRegressive with eXogenous inputs Leontaritis and Billings (1985). Differently from other RNNs, the recurrence in the NARX network is given only by the feedback on the output, rather than from the whole internal state.
NARX networks have been employed in many different applicative contexts, to forecast future values of the input signal Diaconescu (2008); Lin et al. (1997). Menezes and Barreto (2008) showed that NARX networks perform better on predictions involving longterm dependencies. Xie et al. (2009)
used NARX in conjunction with an input embedded according to Takens method, to predict highly nonlinear time series. NARX are also employed as a nonlinear filter, whose target output is trained by using the noisefree version of the input signal
Napoli and Piroddi (2010). NARX networks have also been adopted by Plett (2003) in a graybox approach for nonlinear system identification.A NARX network can be implemented with a MultiLayer Perceptron (MLP), where the next value of the output signal
is regressed on previous values of the output signal and on previous values of an independent, exogenous input signal Billings (2013). The output equation reads(14) 
where is the nonlinear mapping function performed by the MLP, are the trainable network parameters, and are the input and the output time delays. Even if the numbers of delays and
is a finite (often small) number, it has been proven that NARX networks are at least as powerful as Turing machines, and thus they are universal computation devices
Siegelmann et al. (1997).The input of the NARX network has components, which correspond to a set of two TappedDelay Lines (TDLs), and it reads
(15) 
The structure of a MLP network consists of a set of source nodes forming the input layer, layers of hidden nodes, and an output layer of nodes. The output of the network is governed by the following difference equations
(16)  
(17)  
(18) 
where is the output of the ^{th} hidden layer at time , is a linear function and is the transfer function of the neuron, usually implemented as a sigmoid or tanh function.
The weights of the neurons connections are defined by the parameters . In particular, are the parameters that determine the weights in the input layer, are the parameters of the output layer and are the parameters of the ^{th} hidden layer. A schematic depiction of a NARX network is reported in Fig. 5.
Due to the architecture of the network, it is possible to exploit a particular strategy to learn the parameters . Specifically, during the training phase the time series relative to the desired output is fed into the network along with the input time series . At this stage, the output feedback is disconnected and the network has a purely feedforward architecture, whose parameters can be trained with one of the several, wellestablished standard backpropagation techniques. Notice that this operation is not possible in other recurrent networks such as ERNN, since the state of the hidden layer depends on the previous hidden state, whose ideal value is not retrievable from the training set. Once the training stage is over, the teacher signal of the desired output is disconnected and is replaced with the feedback of the predicted output computed by the network. The procedure is depicted in Fig. 6.
Similar to what discussed in Sec. 2.2 for the previous RNN architectures, the loss function employed in the gradient descent is defined as
(19) 
where is the error term defined in Eq. 2 and is the hyperparameter that weights the importance of the regularization term in the loss function. Due to the initial transient phase of the network, when the estimated output is initially fed back as network input, the first initial outputs are discarded.
Even if it reduces to a feedforward network in the training phase, NARX network is not immune to the problem of vanishing and exploding gradients. This can be seen by looking at the Jacobian of the statespace map at time expanded for
time step. In order to guarantee network stability, the Jacobian must have all of its eigenvalues inside the unit circle at each time step. However, this results in
, which implies that NARX networks suffer from vanishing gradients, like the other RNNs Lin et al. (1996).4.2 Echo State Network
While most hard computing approaches and ANNs demand long training procedures to tune the parameters through an optimization algorithm Huang et al. (2005), recently proposed architectures such as Extreme Learning Machines Cambria et al. (2013); Scardapane et al. (2015) and ESNs are characterized by a very fast learning procedure, which usually consists in solving a convex optimization problem. ESNs, along with Liquid State Machines Maass et al. (2002), belong to the class of computational dynamical systems implemented according to the socalled reservoir computing framework Lukoševičius and Jaeger (2009).
ESN have been applied in a variety of different contexts, such as static classification Alexandre et al. (2009), speech recognition Skowronski and Harris (2007), intrusion detection Haiyan et al. (2005), adaptive control Han and Lee (2014a), detrending of nonstationary time series Maiorino et al. (2017), harmonic distortion measurements Mazumdar and Harley (2008) and, in general, for modeling of various kinds of nonlinear dynamical systems Han and Lee (2014b).
ESNs have been extensively employed to forecast real valued time series. Niu et al. (2012) trained an ESN to perform multivariate time series prediction by applying a Bayesian regularization technique to the reservoir and by pruning redundant connections from the reservoir to avoid overfitting. Superior prediction capabilities have been achieved by projecting the highdimensional output of the ESN recurrent layer into a suitable subspace of reduced dimension Løkse et al. (2017). An important context of application with real valued time series is the prediction of telephonic or electricity load, usually performed 1hour and a 24hours ahead Deihimi and Showkati (2012); Deihimi et al. (2013); Bianchi et al. (2015b); Varshney and Verma (2014); Bianchi et al. (2015a). Deihimi et al. (2013) and Peng et al. (2014) decomposed the time series in wavelet components, which are predicted separately using distinct ESN and ARIMA models, whose outputs are combined to produce the final result. Important results have been achieved in the prediction of chaotic time series by Li et al. (2012a). They proposed an alternative to the Bayesian regression for estimating the regularization parameter and a Laplacian likelihood function, more robust to noise and outliers than a Gaussian likelihood. Jaeger and Haas (2004) applied an ESNbased predictor on both benchmark and real dataset, highlighting the capability of these networks to learn amazingly accurate models to forecast a chaotic process from almost noisefree training data.
An ESN consists of a large, sparsely connected, untrained recurrent layer of nonlinear units and a linear, memoryless readout layer, which is trained according to the task that the ESN is demanded to solve. A visual representation of an ESN is shown in Fig. 7
The difference equations describing the ESN stateupdate and output are, respectively, defined as follows:
(20)  
(21) 
where is a small noise term. The reservoir contains neurons whose transfer/activation function is typically implemented by a hyperbolic tangent. The readout instead, is implemented usually by a linear function . At time instant , the network is driven by the input signal and produces the output , and being the dimensionality of input and output, respectively. The vector has components and it describes the ESN (instantaneous) state. The weight matrices (reservoir connections), (inputtoreservoir), and (outputtoreservoir feedback) contain real values in the interval drawn from a uniform distribution and are left untrained. Alternative options have been explored recently by Rodan and Tiňo (2011) and Appeltant et al. (2011) to generate the connection weights. The sparsity of the reservoir is controlled by a hyperparameter , which determines the number of nonzero elements in . According to the ESN theory, the reservoir must satisfy the socalled “echo state property” (ESP) Lukoševičius and Jaeger (2009). This means that the effect of a given input on the state of the reservoir must vanish in a finite number of timeinstants. A widely used ruleofthumb to obtain this property suggests to rescale the matrix in order to have , where denotes the spectral radius. However, several theoretical approaches have been proposed in the literature to tune more accurately, depending on the problem at hand Boedecker et al. (2012); Bianchi et al. (2016a); Verstraeten and Schrauwen (2009); Bianchi Filippo Maria et al. (2017).
On the other hand, the weight matrices and are optimized for the target task. To determine them, let us consider the training sequence of desired inputoutputs pairs given by:
(22) 
where is the length of the training sequence. In the initial phase of training, called state harvesting, the inputs are fed to the reservoir, producing a sequence of internal states , as defined in Eq. (20). The states are stacked in a matrix and the desired outputs in a vector :
(23)  
(24) 
The initial rows in (and ) are discarded, since they refer to a transient phase in the ESN’s behavior.
The training of the readout consists in learning the weights in and so that the output of the ESN matches the desired output . This procedure is termed teacher forcing and can be accomplished by solving a convex optimization problem, for which several closed form solution exist in the literature. The standard approach, originally proposed by Jaeger (2001), consists in applying a leastsquare regression, defined by the following regularized leastsquare problem:
(25) 
where and is the regularization factor.
5 Synthetic time series
We consider three different synthetically generated time series in order to provide controlled and easily replicable benchmarks for the architectures under analysis. The three forecasting exercises that we study have a different level of difficulty, given by the nature of the signal and the complexity of the task to be solved by the RNN. In order to obtain a prediction problem that is not too simple, it is reasonable to select as forecast horizon a time interval that guarantees the measurements in the time series to become decorrelated. Hence, we consider the first zero of the autocorrelation function of the time series. Alternatively, the first minimum of the average mutual information Fraser and Swinney (1986) or of the correlation sum Liebert and Schuster (1989) could be chosen to select a where the signal shows a moregeneral form of independence. All the time series introduced in the following consist of time steps. We use the first 60% of the time series as training set, to learn the parameters of the RNN models. The next 20% of the data are used as validation set and the prediction accuracy achieved by the RNNs on this second dataset is used to tune the hyperparameters of the models. The final model performance is evaluated on a test set, corresponding to the last 20% of the values in the time series.
MackeyGlass time series
The MackeyGlass (MG) system is commonly used as benchmark for prediction of chaotic time series. The input signal is generated from the MG timedelay differential system, described by the following equation:
(28) 
For this prediction task, we set , initial condition , 0.1 as integration step for (28) and the forecast horizon .
NARMA signal
The NonLinear AutoRegressive Moving Average (NARMA) task, originally proposed by Jaeger (2002b), consists in modeling the output of the following order system:
(29) 
The input to the system is uniform random noise in [0, 1], and the model is trained to reproduce . The NARMA task is known to require a memory of at least past timesteps, since the output is determined by input and outputs from the last timesteps. For this prediction task we set and the forecast step in our experiments.
Multiple superimposed oscillator
The prediction of a sinusoidal signal is a relatively simple task, which demands a minimum amount of memory to determine the next network output. However, superimposed sine waves with incommensurable frequencies are extremely difficult to predict, since the periodicity of the resulting signal is extremely long. The time series we consider is the Multiple Superimposed Oscillator (MSO) introduced by Jaeger and Haas (2004), and it is defined as
(30) 
This academic, yet important task, is particularly useful to test the memory capacity of a recurrent neural network and has been studied in detail by Xue et al. (2007) in a dedicated work. Indeed, to accurately predict the unseen values of the time series, the network requires a large amount of memory to simultaneously implement multiple decoupled internal dynamics Wierstra et al. (2005). For this last prediction task, we chose a forecast step .
6 Realworld load time series
In this section, we present three different realworld dataset, where the time series to be predicted contain measurements of electricity and telephonic activity load. Two of the dataset contain exogenous variables, which are used to provide additional context information to support the prediction task. For each dataset, we perform a preanalysis to study the nature of the time series and to find the most suitable data preprocessing. In fact, forecast accuracy in several prediction models, among which neural networks, can be considerably improved by applying a meaningful preprocessing Zhang and Qi (2005).
6.1 Orange dataset – telephonic activity load
The first realworld dataset that we analyze is relative to the load of phone calls registered over a mobile network. Data come from the Orange telephone dataset Orange (2013), published in the Data for Development (D4D) challenge Blondel et al. (2012). D4D is a collection of call data records, containing anonymized events of Orange’s mobile phone users in Ivory Coast, in a period spanning from December 1, 2011 to April 28, 2012. More detailed information on the data are available in Ref. Bianchi et al. (2016b). The time series we consider are relative to antennatoantenna traffic. In particular, we selected a specific antenna, retrieved all the records in the dataset relative to the telephone activity issued each hour in the area covered by the antenna and generated 6 time series:

ts1: number of incoming calls in the area covered by the antenna;

ts2: volume in minutes of the incoming calls in the area covered by the antenna;

ts3: number of outgoing calls in the area covered by the antenna;

ts4: volume in minutes of the outgoing calls in the area covered by the antenna;

ts5: hour when the telephonic activity is registered;

ts6: day when the telephonic activity is registered.
In this work, we focus on predicting the volume (in minutes) of the incoming calls in ts1 of the next day. Due to the hourly resolution of the data, the STFL problem consists of a 24 stepahead prediction. The profile of ts1 for 300 hours is depicted in Fig. 8(a).
The remaining time series are treated as exogenous variables and, according to a common practice in time series forecasting Franses (1991), they are fed into the network to provide the model with additional information for improving the prediction of the target time series. Each time series contain measurements, hourly sampled. We used the first 70% as training set, the successive 15% as validation set and the remaining 15% as test set. The accuracy of each RNN model is evaluated on this last set.
In each time series there is a (small) fraction of missing values. In fact, if in a given hour no activities are registered in the area covered by the considered antenna, the relative entries do not appear in the database. As we require the target time series and the exogenous ones to have same lengths and to contain a value in each time interval, we inserted an entry with value “0” in the dataset to fill the missing values. Another issue is the presence of corrupted data, marked by a “1” in the dataset, which are relative to periods when the telephone activity is not registered correctly. To address this problem, we followed the procedure described by Shen and Huang (2005) and we replaced the corrupted entries with the average value of the corresponding periods (same weekday and hour) from the two adjacent weeks. Contrarily to some other works on STLF Ibrahim and L’Ecuyer (2013); Shen and Huang (2008); Andrews and Cunningham (1995), we decided to not discard outliers, such as holidays or days with an anomalous number of calls, nor we modeled them as separate variables.
As next step in our preanalysis, we identify the main seasonality in the data. We analyze ts1, but similar considerations hold also for the remaining time series. Through frequency analysis and by inspecting the autocorrelation function, depicted as a gray line in Fig. 8(b), it emerges a strong seasonal pattern every hours. As expected, data experience regular and predictable daily changes, due to the nature of the telephonic traffic. This cycle represents the main seasonality and we filter it out by applying a seasonal differencing with lag 24. In this way, the RNNs focus on learning to predict the series of changes in each seasonal cycle. The practice of removing the seasonal effect from the time series, demonstrated to improve the prediction accuracy of models based on neural networks Zhang and Kline (2007); Claveria et al. . The black line in Fig. 8(b) depicts the autocorrelation of the time series after seasonal differentiation. Except from the high anticorrelation at lag , introduced by the differentiation, the time series appears to be uncorrelated elsewhere and, therefore, we can exclude the presence of a second, less obvious seasonality.
Due to the nature of the seasonality in the data, we expect a strong relationship between the time series of the loads (ts1  ts4) and ts5, which is relative to the hour of the day. On the other hand, we envisage a lower dependency of the loads with ts6, the time series of the week days, since we did not notice the presence of a second seasonal cycle after the differentiation at lag . To confirm our hypothesis, we computed the mutual information between the time series, which are reported in the Hinton diagram in Fig. 9.
The size of the blocks is proportional to the degree of mutual information among the time series. Due to absence of strong relationships, we decided to discard ts6 to reduce the complexity of the model by excluding a variable with potentially low impact in the prediction task. We also discarded ts5 because the presence of the cyclic daily pattern is already accounted by doing the seasonal differencing at lag . Therefore, there is not need to provide daily hours as an additional exogenous input.
Beside differentiation, a common practice in STLF is to apply some form of normalization to the data. We applied a standardization (zscore), but rescaling into the interval
or are other viable options. Additionally, a nonlinear transformation of the data by means of a nonlinear function (e.g., squareroot or logarithm) can remove some kinds of trend and stabilize the variance in the data, without altering too much their underlying structure Weinberg et al. (2007); Shen and Huang (2008); Ibrahim and L’Ecuyer (2013). In particular, a logtransform is suitable for a set of random variables characterized by a high variability in their statistical dispersion (heteroscedasticity), or for a process whose fluctuation of the variance is larger than the fluctuation of the mean (overdispersion). To check those properties, we analyze the mean and the variance of the telephonic traffic within the main seasonal cycle across the whole dataset.
Average weekly load (solid black line) and the standard deviation (shaded gray area) of the telephonic activity in the whole dataset.
The solid black line in Fig. 10(a), represents the mean load of ts1, while the shaded gray area illustrates the variance. As we can see, the data are not characterized by overdispersion, since the fluctuations of the mean are greater than the ones of the variance. However, we notice the presence of heteroscedasticity, since the amount of variance changes in different hours of the day. In fact, the central hours where the amount of telephonic activity is higher, are characterized by a greater standard deviation in the load. In Fig. 10(b), we observe that by applying a logtransform we significantly reduce the amount of variance in the periods characterized by a larger traffic load. However, after the logtransformation the mean value of the load become more flattened and the variance relative to periods with lower telephonic activity is enhanced. This could cause issues during the training of the RNN, hence in the experiments we evaluate the prediction accuracy both with and without applying the logtransformation to the data.
Preprocessing transformations are applied in this order: (i) logtransform, (ii) seasonal differencing at lag 24, (iii) standardization. Each preprocessing operation is successively reversed to evaluate the forecast produced by each RNN.
6.2 ACEA dataset – electricity load
The second time series we analyze is relative to the electricity consumption registered by ACEA (Azienda Comunale Energia e Ambiente), the company which provides the electricity to Rome and some neighbouring regions. The ACEA power grid in Rome consists of 10.490 km of medium voltage lines, while the low voltage section covers 11.120 km. The distribution network is constituted of backbones of uniform section, exerting radially and with the possibility of countersupply if a branch is out of order. Each backbone is fed by two distinct primary stations and each halfline is protected against faults through the breakers. Additional details can be found in Ref. Santis et al. (2015). The time series we consider concerns the amount of supplied electricity, measured on a medium voltage feeder from the distribution network of Rome. Data are collected every minutes for days of activity (almost 3 years), spanning from 2009 to 2011, for a total of measurements. Also in this case, we train the RNNs to predict the electricity load 24h ahead, which corresponds to 144 time step ahead prediction. For this forecast task we do not provide any exogenous time series to the RNNs. In the hyperparameter optimization, we use the load relative to the first 3 months as training set and the load of the 4^{th} month as validation set. Once the best hyperparameter configuration is identified, we finetune each RNN on the first 4 months and we use the 5^{th} month as test set to evaluate and to compare the accuracy of each network.
A profile of the electric consumption over one week (1008 measurements), is depicted in Fig. 11(a).
In the ACEA time series there are no missing values, but 742 measurements (which represent 0.54% of the whole dataset) are corrupted. The consumption profile is more irregular in this time series, with respect to the telephonic data from the Orange dataset. Therefore, rather than replacing the corrupted values with an average load, we used a form of imputation with a less strong bias. Specifically, we first fit a cubic spline to the whole dataset and then we replaced the corrupted entries with the corresponding values from the fitted spline. In this way, the imputation better accounts for the local variations of the load.
Also in this case, we perform a preemptive analysis in order to understand the nature of the seasonality, to detect the presence of hidden cyclic patterns, and to evaluate the amount of variance in the time series. By computing the autocorrelation function up to a sufficient number of lags, depicted as a gray line in Fig. 11(b), it emerges a strong seasonality pattern every 144 time intervals. As expected, this corresponds exactly to the number of measurements in one day. By differencing the time series at lag 144, we remove the main seasonal pattern and the trend. Also in this case, the negative peak at lag 144 is introduced by the differentiation. If we observe the autocorrelation plot of the time series after seasonal differencing (black line in 11(b)), a second strong correlation appears each 1008 lags. This second seasonal pattern represents a weekly cycle, that was not clearly visible before the differentiation. Due to the long periodicity of the time cycle, to account this second seasonality a predictive model would require a large amount of memory to store information for a longer time interval. While a second differentiation can remove this second seasonal pattern, we would have to discard the values relative to the last week of measurements. Most importantly, the models we train could not learn the similarities in consecutive days at a particular time, since they would be trained on the residuals of the load at the same time and day in two consecutive weeks. Therefore, we decided to apply only the seasonal differentiation at lag 144.
To study the variance in the time series, we consider the average daily load over the main seasonal cycle of 144 time intervals. As we can see from Fig. 12(a), data appear to be affected by overdispersion, as the standard deviation (gray shaded areas) fluctuates more than the mean. Furthermore, the mean load value (black solid line) seems to not change much across the different hours, while it is reasonable to expect significant differences in the load between night and day. However, we remind that the Acea time series spans a long time lapse (almost 3 years) and that the electric consumption is highly related to external factors such as temperature, daylight saving time, holidays and other seasonal events that change over time. Therefore, in different periods the load profile may vary significantly. For example, in Fig. 12(b) we report the load profile relative to the month of January, when temperatures are lower and there is a high consumption of electricity, also in the evening, due to the usage of heating. In June instead (Fig. 12(c)), the overall electricity consumption is lower and mainly concentrated on the central hours of the day. Also, it is possible to notice that the load profile is shifted due to the daylight saving time. As we can see, the daily averages within a single month are characterized by a much lower standard deviation (especially in the summer months, with lower overall load consumption) and the mean consumption is less flat. Henceforth, a nonlinear transformation for stabilizing the variance is not required and, also in this case, standardization is suitable for normalizing the values in the time series. Since we focus on a short term forecast, having a high variance in loads relative to very distant periods is not an issue, since the model prediction will depends mostly on the most recently seen values.
To summarize, as preprocessing operation we apply: (i) seasonal differencing at lag 144, (ii) standardization. As before, the transformations are reverted to estimate the forecast.
6.3 GEFCom2012 dataset – electricity load
The last real world dataset that we study is the time series of electricity consumption from the Global Energy Forecasting Competition 2012 (GEFCom2012) Kaggle (2012). The GEFCom 2012 dataset consists of 4 years (2004 – 2007) of hourly electricity load collected from a US energy supplier. The dataset comprehends time series of consumption measurements, from 20 different feeders in the same geographical area. The values in each time series represent the average hourly load, which varies from kWh to kWh. The dataset also includes time series of the temperatures registered in the area where the electricity consumption is measured.
The forecast task that we tackle is the hours ahead prediction of the aggregated electricity consumption, which is the sum of the 20 different load time series in year 2006. The measurements relative to the to first 10 months of the 2006 are used as training set, while the 11^{th} month is used as validation set for guiding the hyperparameters optimization. The time series of the temperature in the area is also provided to the RNNs as an exogenous input. The prediction accuracy of the optimized RNNs is then evaluated on the last month of the 2006. A depiction of the load profile of the aggregated load time series is reported in Fig. 13(a).
We can observe a trend in the time series, which indicates a decrement in the energy demand over time. This can be related to climate conditions since, as the temperature becomes warmer during the year, the electricity consumption for the heating decreases.
To study the seasonality in the aggregated time series, we evaluate the autocorrelation function, which is depicted as the gray line in Fig. 13(b). From the small subplot in topright part of the figure, relative to a small segment of the time series, it emerges a strong seasonal pattern every 24 hours. By applying a seasonal differentiation with lag 24 the main seasonal pattern is removed, as we can see from the autocorrelation function of the differentiated time series, depicted as a black line in the figure. After differentiation, the autocorrelation becomes close to zero after the first lags and, therefore, we can exclude the presence of a second, strong seasonal pattern (e.g. a weekly pattern).
Similarly to what we did previously, we analyze the average load of the electricity consumption during one week. As for the ACEA dataset, rather than considering the whole dataset, we analyze separately the load in one month of winter and one month in summer. In Fig. 14(a), we report the mean load (black line) and standard deviation (gray area) in January. Fig. 14(b) instead, depicts the measurements for May. It is possible to notice a decrement of the load during the spring period, due to the reduced usage of heating. It is also possible to observe a shift in the consumption profile to later hours in the day, due to the time change. By analyzing the amount of variance and the fluctuations of the mean load, we can exclude the presence of overdispersion and heteroscedasticity phenomena in the data.
To improve the forecasting accuracy of the electricity consumption, a common practice is to provide to the prediction system the time series of the temperature as an exogenous variable. In general, the load and the temperature are highly related, since both in the coldest and in the warmest months electricity demand increases, due to the usage of heating and air conditioning, respectively. However, the relationship between temperature and load cannot be captured by the linear correlation, since the consumption increases both when temperatures are too low or too high. Indeed, the estimated correlation between the aggregated load time series of interest and the time series of the temperature in the area yields only a value of 0.2. However, their relationship is evidenced by computing a 2dimensional histogram of the two variables, proportional to their estimated joint distribution, which is reported in Fig
15. The Vshape, denotes an increment of the electricity consumption for low and high temperatures with respect to a mean value of about C.The preprocessing operations we apply on the GEFCom dataset are: (i) seasonal differencing at lag 24, (ii) standardization. Also in this case, these transformations are reverted to estimate the forecast.
7 Experiments
In this section we compare the prediction performance achieved by the network architectures presented in Sec. 3 and 4 on different time series. For each architecture, we describe the validation procedure we follow to tune the hyperparameters and to find an optimal learning strategy for training the weights. During the validation phase, different configurations are randomly selected from admissible intervals and, once the training is over, their optimality is evaluated as the prediction accuracy achieved on the validation set. We opted for a random search as it can find more accurate results than a grid search, when the same number of configurations are evaluated Bergstra and Bengio (2012). Once the (sub)optimal configuration is identified, we train each model times on the training and validation data, using random and independent initializations of the network parameters, and we report the highest prediction accuracy obtained on the unseen values of the test set.
To compare the forecast capability of each model, we evaluate the prediction accuracy as . NRMSE is the Normalized Root Mean Squared Error that reads
(31) 
where computes the mean, are the RNN outputs and are the groundtruth values.
In the following, we present two types of experiments. The first experiment consists in the prediction of the synthetic time series presented in Sec. 5, commonly considered as benchmarks in forecast applications, and the results are discussed in Sec. 7.2. In the second experiment we forecast the realworld telephonic and electricity load time series, presented in Sec. 6. The results of this second experiment are discussed in Sec. 7.3.
7.1 Experimental settings
ERNN, LSTM and GRU
The three RNNs described in Sec. 3
have been implemented in Python, using Keras library with Theano
Theano Development Team (2016) as backend^{2}^{2}2Keras library is available at https://github.com/fchollet/keras. Theano library is available at http://deeplearning.net/software/theano/.To identify an optimal configuration for the specific task at hand, we evaluate for each RNN different values of the hyperparameters and training procedures. The configurations are selected randomly and their performances are evaluated on the validation set, after having trained the network for epochs. To get rid of the initial transient phase, we drop the first outputs of the network. A total of random configurations for each RNN are evaluated and, once the optimal configuration is found, we compute the prediction accuracy on the test set. In the test phase, each network is trained for epochs.
The optimization is performed by assigning to each hyperparameter a value uniformly sampled from a given interval, which can be continuous or discrete. The gradient descent strategies are selected from a set of possible alternatives, which are SGD, Nesterov momentum and Adam. For SGD and Nesterov, we anneal the learning rate with a step decay of in each epoch. The learning rate is sampled from different intervals, depending on the strategy selected. Specifically, for SGD we set , with uniformly sampled in . For Nesterov and Adam, since they benefit from a smaller initial value of the learning rate, we sample uniformly in . The remaining hyperparameters used in the optimization strategies are kept fixed to their default values (see Sec. 2.3). Regarding the number of hidden units in the recurrent hidden layer, we randomly chose for each architecture four possible configurations that yield an amount of trainable parameters approximately equal to , , , and . This corresponds to in ERNN, in LSTM and in GRU. For each RNNs, is randomly selected from these sets. To deal with the problem of vanishing gradient discussed in Sec. 2.4, we initialize the RNN weights by sampling them from an uniform distribution in and then rescaling their values by . For the and regularization terms, we sample independently and from , an interval containing values commonly assigned to these hyperparameters in RNNs Zeyer et al. (2016). We apply the same regularization to input, recurrent and output weights. As suggested by Gal and Ghahramani (2015), we drop the same input and recurrent connections at each time step in the BPTT, with a dropout probability drawn from , which are commonly used values Pham et al. (2014b). If , we also apply a regularization. This combination usually yields a lowest generalization error than dropout alone Srivastava et al. (2014). Note that another possible approach combines dropout with the maxnorm constraint, where the norm of the weights is clipped whenever it grows beyond a given constant, which, however, introduces another hyperparameter.
For the training we consider the backpropagation through time procedure with . The parameter is randomly selected from the set . As we discussed in Sec. 2.1, this procedure differs from both the true BPTT and the epochwise BPTT Williams and Zipser (1995)
, which is implemented as default by popular deep learning libraries such as TensorFlow
Abadi et al. (2015).Narx
This RNN is implemented using the Matlab Neural Network toolbox^{3}^{3}3https://se.mathworks.com/help/nnet/ref/narxnet.html. We configured NARX network with an equal number of input and output lags on the TDLs () and with the same number of neurons in each one of the hidden layers. Parameters relative to weight matrices and bias values are trained with a variant of the quasi Newton search, called LevenbergMarquardt optimization algorithm. This is an algorithm for error backpropagation that provides a good tradeoff between the speed of the Newton algorithm and the stability of the steepest descent method Battiti (1992). The loss function to be minimized is defined in Eq. 19.
NARX requires the specification of hyperparameters, which are uniformly drawn from different intervals. Specifically, TDL lags are drawn from ; the number of hidden layers is drawn from ; the number of neurons in each layer is drawn from ; the regularization hyperparameter in the loss function is randomly selected from ; the initial value of learning rate is randomly selected from .
A total of random configurations for NARX are evaluated and, for each hyperparameters setting, the network is trained for epochs in the validation. In the test phase, the network configured with the optimal hyperparameters is trained for epochs. Also in this case, we discard the first network outputs to get rid of the initial transient phase of the network.
Esn
For the ESN, we used a modified version of the Python implementation^{4}^{4}4https://github.com/siloekse/PythonESN, provided by Løkse et al. (2017)
. Learning in ESN is fast, as the readout is trained by means of a linear regression. However, the training does not influence the internal dynamics of the random reservoir, which can be controlled only through the ESN hyperparameters. This means that a more accurate (and computationally intensive) search of the optimal hyperparametyers is required with respect to the other RNN architectures. In RNNs, the precise, yet slow gradientbased training procedure is mainly responsible for learning the necessary dynamics and it can compensate a suboptimal choice of the hyperparameters.
Therefore, in the ESN validation phase we evaluate a larger number of configurations (), by uniformly drawing different hyperparameters from specific intervals. In particular, the number of neurons in the reservoir, , is drawn from ; the reservoir spectral radius, , is drawn in the interval ; the reservoir connectivity is drawn from ; the noise term in Eq. (20
) comes from a Gaussian distribution with zero mean and variance drawn from
; scaling of input signal and desired response are drawn from ; scaling of output feedback is drawn from ; the linear regression regularization parameter is drawn from . Also in this case, we discarded the first ESN outputs relative to the initial transient phase.7.2 Results on synthetic dataset
In Fig. 16 we report the prediction accuracy obtained by the RNNs on the test set of the three synthetic problems. The best configurations of the architectures identified for each task through random search are reported in Tab. 1.
Network  Task  RNN Configuration  
Narx  TDL  
5pt.  MG  15  2  6  3.8E6  
NARMA  17  2  10  2.4E4  
MSO  12  5  2  
ERNN  OPT  
5pt.  MG  20  10  80  Adam  0.00026  0  0  0.00037 
NARMA  50  25  80  Nesterov  0.00056  0  0  1E5  
MSO  50  25  60  Adam  0.00041  0  0  0.00258  
LSTM  OPT  
5pt.  MG  50  25  40  Adam  0.00051  0  0  0.00065 
NARMA  40  20  40  Adam  0.00719  0  0  0.00087  
MSO  50  25  20  Adam  0.00091  0  0  0.0012  
GRU  OPT  
5pt.  MG  40  20  46  SGD  0.02253  0  0  6.88E6 
NARMA  40  20  46  Adam  0.00025  0  0  0.00378  
MSO  50  25  35  Adam  0.00333  0  0  0.00126  
ESN 