1 Background
Recurrent neural networks (RNNs) are the building blocks of modern sequential learning. RNNs use recurrent layers to capture nonlinear temporal dependencies with a relatively small number of parameters (Graves2013). They learn temporal dynamics by mapping an input sequence to a hidden state sequence and outputs, via a recurrent layer and a feedforward layer. However, despite the success of these and their successors, there appears to be a chasm between the statistical modeling literature (see e.g. boxjen76; UBMA_276336909; GVK126800421) and the machine learning literature (see e.g. bayer; pascanu2012difficulty; Hochreiter1997; schmidhuber1997long).
While there have been many recent theoretical developments in recurrent neural networks from a dynamical systems perspective (lim2019predicting; chen2019symplectic; niu2019recurrent), there are still open fundamental questions as to how the type and properties of the time series data informs the choice of architectures. For example, mcdermott2018deep find empirical evidence that Echo State Networks are well suited for spatiotemporal modeling and Uber (SMYL202075) won the 2019 M4 forecasting competition with a hybrid exponential smoothingRNN method, which exponentiates the output of a neural network and combines it with a past observed time series level. There have been exhaustive empirical studies on the application of recurrent neural networks to prediction from financial time series data such as historical limit order book and price history (Dixon2017b; doi:10.1002/for.2585; borovykh2017conditional; doi:10.1080/14697688.2019.1622295; doi:10.1080/14697688.2019.1622287; doi:10.1080/14697688.2019.1634277). (doi:10.1080/14697688.2019.1622295) find evidence that stacking networks leads to superior performance on intraday stock data combined with technical indicators, whereas (10.1371/journal.pone.0180944)
combine wavelet transforms and stacked autoencoders with LSTMs on OHLC bars and technical indicators.
borovykh2017conditional find evidence that dilated convolutional networks out perform LSTMs on various indices. (Dixon2017b) demonstrate that RNNs outperform feedforward networks with lagged features on limit order book data.There are still open fundamental questions as to how the type and properties of the time series data inform the choice of these architectures. One of the main contributions of this paper is to introduce a new class of RNNs, with supporting theoretical justification for architectural design choices.
A second challenge with applying neural networks to time series data is the overreliance on extensive hyperparameter tuning, a computationally complex global optimization problem with undesirable outcomes which are dependent on initial conditions for parameter choices. Moreover, little is known about how often to retune the parameters and the length of historical data used to train the model can affect the hyperparameter tuning.
A third challenge is how to combine more traditional and informative diagnostics of time series data, such as stationarity and memory cutoff, into the model selection procedure. Such an approach is standard in the time series modeling literature boxjen76 but absent in the machine learning literature. Finally, most of the aforementioned studies are engineering orientated and do not provide substantive insight to justify why the architectural choice is well suited to the dataset.
One of the main contributions of this paper is to cast RNNs into a time series modeling framework and rely on statistical diagnostics in combination with crossvalidation to tune the architecture. The statistical tests characterize stationarity and memory cutoff length and provide insight into whether the data is suitable for longerterm forecasting and whether the model must be nonstationarity.
It is well known that plain RNNs have difficultly in learning longterm dynamics, due in part to the vanishing and exploding gradients that can result from back propagating the gradients down through the many unfolded layers of the network (bayer; pascanu2012difficulty)
. A particular type of RNN, called a Long ShortTerm Memory (LSTM)
(Hochreiter1997; schmidhuber1997long)was proposed to address this issue of vanishing or exploding gradients which essentially arises due to the shape of the activation function. A memory unit used in a LSTM allows the network to learn which previous states can be forgotten and alleviates the vanishing gradient. Partly for this reason, LSTMs have demonstrated much empirical success in the literature
(Gers2001; 7926112).The inclusion of forget, reset and update gates in the LSTM, and a slightly simpler variant — Gated Recurrent Units (GRUs)
(DBLP:journals/corr/ChungGCB14), provides a switching mechanism to forget memory while simultaneously updating a hidden state vector. These units do not, however, provide an easily accessible mathematical structure from which to study their time series modeling properties. As such, there is much opaqueness about the types of architectures which are best suited to prediction tasks based on data properties such as wide sense nonstationarity (see Appendix
A). However we shall show that exponential smoothing not only alleviates the gradient problem but characterizes the time series modeling properties of these architectures.The main outcome of applying this approach is that it partially identifies the choice of architecture based on both its time series properties and that of the data. The approach is general and easily extensible to GRUs and LSTMs with the inclusion of the reset gate and cell memory. In this paper, the input sequence is assumed to be of fixed length, although the methodology could be extended to variable length as in sequence to sequence learning models (NIPS2014_5346). A summary of the main contributions of this paper are

We show how plain RNNs, with shortterm memory, can be generalized to exhibit longterm memory with a smoothing scalar
—smoothing also helps offset the infamous vanishing gradient problem in plain RNNs with only one extra parameter;

We show how time series analysis guides the architectural parameters —the sequence length of the RNN can be determined from the estimated partial autocorrelogram, and tanh activation of the recurrent layer is needed for stability; and

We demonstrate how a dynamic RNN model for nonstationary data (Dixon2021), a lightweight version of GRUs and LSTMs, has the ability to model complex nonstationary times series data with comparable performance.
The remainder of this paper is outlined as follows. Section 2 introduces the RNN and Section 3 applies time series analysis to guide the architectural properties. Section 4 introduces a dynamic version of the model and illustrates the dynamical behavior of . Details of the training, implementation and experiments together with the results are presented in Section 6. Finally, Section 7 concludes with directions for future research.
2 RNNs
Given autocorrelated observations of covariates or predictors, , and responses at times , in the time series data , our goal is to construct an mstep () ahead times series predictor , of an observed target, , from a p length input sequence
is a lagged observation of , for and is the homoschedastic model error at time . We introduce the RNN:
(1) 
where is an smoothed RNN with weight matrices
, and biases vectors
.For each time step in the sequence , forward passes separately update a hidden internal state , using the recurrence relations:
where the input weight matrix and the recurrence weight matrix . is the number of hidden units and is the dimensionality of the input space. is an exponentially smoothed version of the hidden state . When the output is continuous, the output from the final hidden state is given by:
(2) 
with the starting condition in each sequence, .
3 Times Series Modeling
This section bridges the time series modeling literature (boxjen76; UBMA_276336909; doi:10.1111/jtsa.12464) and the machine learning literature. We shall assume here for ease of exposition that the time series data is univariate (), , and thus the predictor is endogenous^{2}^{2}2The sequence of features is from the same time series as the predictor..
Since autoregressive () models are well known in time series modeling, we find it instructive to show that plain RNNs are nonlinear AR(p) models. For ease of exposition, consider the simplest case of a RNN with one hidden unit, . Without loss of generality, we set , , and . Then we can show by backward substitution that a plainRNN, , with sequence length , is a nonlinear autoregressive, , model of order .
then
(3) 
When the activation is the identity function , then we recover the AR(p) model
(4) 
with geometrically decaying autoregressive coefficients when .
RNNs
The RNN(p) is almost identical to a plain RNN, but with an additional scalar smoothing parameter, , which provides the recurrent network with “longmemory”^{3}^{3}3Long memory refers to autoregressive memory beyond the sequence length. This is also sometimes referred to as “stateful”. For avoidance of doubt, we are not suggesting that the RNN has an additional cellular memory, as in LSTMs.. To see this, let us consider a onestep ahead univariate RNN(p) in which the smoothing parameter is fixed. For each time step :
This model augments the plainRNN by replacing in the hidden layer with an exponentially smoothed hidden state . The effect of the smoothing is to provide infinite memory when . For the special case when , we recover the plain RNN with short memory of length .
We can easily see this informally by simplifying the parameterization and considering the unactivated case. Setting , and :
(5)  
(6)  
(7) 
with the starting condition in each sequence, . With out loss of generality, consider lags in the model so that . Then
(8) 
and the model can be written in the simpler form
(9) 
with autoregressive weights and . We now see that there is a third term on the RHS of Eq. 9 which vanishes when but provides infinite memory to the model since depends on , the first observation in the whole time series, not just the first observation in the sequence. To see this, we unroll the recursion relation in the exponential smoother:
(10) 
where we used the property that . It is often convenient to characterize exponential smoothing by the halflife— the number of lags needed for the coefficient to equal a half, which is . To gain further insight we use partial autocorrelations to characterize the memory of the model.
3.1 Partial Autocorrelation
We consider autoregressive time series models, with additive white noise
^{4}^{4}4The assumption of Gaussian white noise is convenient but not necessary — any type of white noise is possible., of the formwhich carry a signature which allows its order, , to be determined from “covariance stationary” time series data (see Appendix A
for a definition of covariance stationarity). This signature encodes the memory in the model and is given by “partial autocorrelations”. Informally each partial autocorrelation measures the correlation of a random variable,
, with its lag, , while controlling for intermediate lags. The partial autocorrelation must be nonzero to be able to predict from . The sign of the partial autocorrelation is also useful for interpretability and describes the directional relationship between the random variables. The formal definition of the partial autocorrelation is now given.Definition 3.1 (Partial Autocorrelation).
A partial autocorrelation at lag is a conditional autocorrelation between a variable, , and its lag, under the assumption that the values of the intermediate lags, are controlled:
where
is the lag partial autocovariance, is an orthogonal projection of W onto the set and
(11) 
The partial autocorrelation function (PACF) is a map . The plot of against is referred to as the partial correlogram.
The PACF of the RNN(p) can be used to determine the lower bound on the sequence length in an RNN(p). To see this, we first show that the partial autocorrelation of the RNN(p) is time independent and has a sharp cutoff after lags if .
Theorem 1 (Partial autocorrelation of an RNN(p)).
The partial autocorrelation of the RNN(p) is time independent and exhibits a cutoff after lags: if . If the RNN(p) has nonzero partial autocorrelations at lags beyond the sequence length.
See Appendix B for a proof. For , the RNN has nonzero partial autocorrelations . It is easy to see this from the additional term containing in Equation 9. Further insight can be gained from Figure 1 which shows the fitted partial correlogram from data generated by an RNN(3) with additive white noise. We observe that the memory is always longer than the sequence length of 3 when . As approaches zero, the model has increasing memory^{5}^{5}5Although the model has no memory in the limit .. The theorem and the properties of the RNN(p) suggest to determine the sequence length from the fitted PACF of each covariate. Moreover, the prediction horizon should not exceed that suggested by the maximum order of statistically significant partial autocorrelations.
3.2 Stability
Times series modeling also places emphasis on model ”stability” — this is the model attribute that past random disturbances decay in the model and the effect of lagged data becomes less relevant to the model output with increasing lag. We present the following theorem which shows that the stability of the RNN model is determined solely by hidden state’s activation function, with the property that .
Theorem 2 (Stability of RNN(p) models).
Suppose that there exists an invertible nonlinear function of the lag operator of the form:
where, without loss of generality, we have again conveniently set , and . Then the RNN is stable if and only if for all finite .
See Appendix C for a proof. In particular, the theorem justifies the common choice of and while sigmoid is also another viable choice, it is too restrictive as it prohibits negative partial autocorrelations. We shall see in Section 6 that negative partial autocorrelations arise in time series data.
3.3 Vanishing gradient
It is well known that plainRNNs exhibit a vanishing gradient problem (pascanu2012difficulty; bayer). Following pascanu2012difficulty, we extend some of the analysis of BPTT to RNNs. The RNN(p) BPTT can be written as:
(12)  
(13) 
for some generic loss function
, whereSubstituting the expression for into Equation 12 gives an expression proportional to
When this expression is . When , this product goes to zero with increasing function compositions due to the gradient of tanh vanishing with large and the product of small functions. However, when , the additional term provides an additional contribution to the gradient which is non trivial for small and independent of the input.
Thus the RNN will not only alleviate the vanishing gradient problem, has guaranteed stability if we use choose a tanh activation function, but exhibits nonzero partial autocorrelations up to at least lag for .
The RNN model can be trained by treating
as an additional parameter to be fitted with stochastic gradient descent. The choice to predetermine that
is independent of time is obviously limited to stationary time series. While this is restrictive, it suggests that a simple statistical test of the time series can predetermine the efficacy of the approach^{6}^{6}6For multivariate data, the test should be applied to each covariate.. Moreover, if the data is covariance stationary, then the RNN will preserve the stationarity of the partial autocorrelation structure, eliminating the need for more complex architectures such as GRUs and LSTMs (which are often motivated purely on the basis of the vanishing gradient problem). Such a procedure shall be demonstrated in Section 6. We can extend the model to nonstationary time series, which exhibit dynamic partial autocorrelation, by using a dynamic version of exponential smoothing.4 Dynamic RNNs
The extension of RNNs to dynamical time series models, suitable for nonstationary time series data, relies on dynamic exponential smoothing is a time dependent, convex, combination of the smoothed output, , and the hidden state :
(14) 
where denotes the dynamic smoothing factor which can be equivalently written in the onestepahead forecast of the form
(15) 
Hence the smoothing can be viewed as a form of dynamic forecast error correction; When , the forecast error is ignored and the smoothing merely repeats the current hidden state , to the effect of the model losing its memory. When , the forecast error overwrites the current hidden state . The smoothing can also be viewed a weighted sum of the lagged observations, with lower or equal weights, at the lagged hidden state, :
where . Note that for any , the smoothed hidden state will have no dependency on all lagged hidden states . The model simply forgets the hidden states at or beyond the lag.
4.1 Neural network exponential smoothing
While the class of RNN models under consideration is free to define how is updated (including changing the frequency of the update) based on the hidden state and input, a convenient choice is use a recurrent layer. Returning again to the more general setup with a hidden state vector , let us model the smoothing parameter to give a filtered time series
(16) 
where denotes the Hadamard product between vectors. This smoothing is a vectorized form of the above classical setting, only here we note that when , the component of the hidden variable is unmodified and the past filtered hidden variable is forgotten. On the other hand, when the , the component of the hidden variable is obsolete, instead setting the current filtered hidden variable to its past value. The smoothing in Equation 16 can be viewed then as updating longterm memory, maintaining a smoothed hidden state variable as the memory through a convex combination of the current hidden variable and the previous smoothed hidden variable.
The hidden variable is given by the semiaffine transformation:
(17) 
which in turns depends on the previous smoothed hidden variable. Substituting Equation 17 into Equation 16 gives a function of and :
(18)  
(19) 
We see that when , the smoothed hidden variable is not updated by the input . Conversely, when , we observe that the hidden variable locally behaves like a nonlinear autoregressive series. Thus the smoothing parameter can be viewed as the sensitivity of the smoothed hidden state to the input .
The challenge becomes how to determine dynamically how much error correction is needed. As in GRUs and LSTMs, we can address this problem by learning from the input variables with the recurrent layer parameterized by weights and biases . The onestep ahead forecast of the smoothed hidden state, , is the filtered output of another plain RNN with weights and biases .
Comparison with RNNs
Figure 1(a) shows the response of a univariate RNN model when the input consists of two unit impulses and zeros otherwise. For simplicity, the sequence length is assumed to be (i.e. the RNN has a memory of 3 lags), the biases are set to zero, all the weights are set to one^{7}^{7}7Note that the weights have not been fitted here, we are merely observing the effect of smoothing on the hidden state for the simplest choice of parameter values. and . The RNN loses memory of the unit impulse after three lags, whereas the RNNs with smooth hidden states maintain memory of the first unit impulse even when the second unit impulse arrives. The figure also shows an RNN model, which although appears insignificant in this example, allows the model to fit nonstationary time series data. This is because the time dependent results in dynamic partial autocorrelations. The response of to shocks can be seen in Figure 1(b)^{8}^{8}8Note that the value of is initially set to one since no longterm memory is needed. Hence the response to the second shock at time is more insightful..
See Appendix 5 for a discussion of the differences between the RNN and the GRU and LSTM.
5 Relationship to GRUs and LSTMs
5.1 GRUs
The RNN model has no means to entirely reset its memory and become a feedforward network (FFN). This is because the hidden variables update equation always depends on the previous smoothed hidden state, unless . By adding a reset layer, we recover a GRU:
When viewed as an extension of our RNN model, we observe that the effect of introducing a reset, or switch, , is to forget the dependence of on the smoothed hidden state. Effectively, we turn the update for from a plain RNN to a FFN and entirely neglect the recurrence. The recurrence in the update of is thus dynamic. It may appear that the combination of a reset and adaptive smoothing is redundant. But remember that effects the level of error correction in the update of the smoothed hidden state, , whereas adjusts the level of recurrence in the unsmoothed hidden state . Put differently, by itself can not disable the memory in the smoothed hidden state (internal memory), whereas in combination with can. More precisely, when and , which is reset to the latest input, , and the GRU is just a FFN. Also, when and , a GRU acts like a plain RNN. Thus a GRU can be seen as a more general architecture which is capable of being a FFN or a plain RNN under certain parameter values.
These additional layers (or cells) enable a GRU to learn extremely complex longterm temporal dynamics that a vanilla RNN is not capable of. Lastly, we comment in passing that in the GRU, as in a RNN, there is a final feedforward layer to transform the (smoothed) hidden state to a response:
(20) 
5.2 LSTMs
The RNN model, like the GRU, provides a mechanism for propagating a smoothed hidden state — a long term memory which can be overridden and even turn the network into a plain RNN (with short memory) or even a memoryless FFN. More complex models using hidden units with varying connections within the memory unit have been proposed in the engineering literature with empirical success (Hochreiter1997; Gers2001; 7926112). LSTMs are similar to GRUs but have a separate (cell) memory, , in addition to a hidden state . LSTMs also do not require that the memory updates are a convex combination. Hence they are more general than exponential smoothing. The mathematical description of LSTMs is rarely given in an intuitive form, but the model can be found in, for example, Hochreiter1997.
The cell memory is updated by the following expression involving a forget gate, , an input gate and a cell gate
(21) 
In the terminology of LSTMs, the triple are respectively referred to as the forget gate, output gate, and input gate. Our change of terminology is deliberate and designed to provided more intuition and continuity with RNNs and the statistics literature. We note that in the special case when we obtain a similar exponential smoothing expression to that used in our RNN. Beyond that, the role of the input gate appears superfluous and difficult to reason with using time series analysis.
When the forget gate, , then the cell memory depends solely on the cell memory gate update . By the term , the cell memory has longterm memory which is only forgotten beyond lag s if . Thus the cell memory has an adaptive autoregressive structure.
The extra “memory”, treated as a hidden state and separate from the cell memory, is nothing more than a Hadamard product:
(22) 
which is reset if . If , then the cell memory directly determines the hidden state.
Thus the reset gate can entirely override the effect of the cell memory’s autoregressive structure, without erasing it. In contrast, the RNN and the GRU has one memory, which serves as the hidden state, and it is directly affected by the reset gate.
The reset, forget, input and cell memory gates are updated by plain RNNs all depending on the hidden state .
Like the RNN, the LSTM can function as a shortmemory, plainRNN; just set in Equation 21. However, the LSTM can also function as a coupling of FFNs; just set so that and hence there is no recurrence structure in any of the gates. For avoidance of doubt, since the nomenclature doesn’t suggest it, all models in this paper can model long and shortterm autoregressive memory. The RNN couples these memories through a smoothed hidden state variable. The LSTM separates out the long memory, stored in the cellular memory, but uses a copy of it, which may additionally be reset. Strictly speaking, the cellular memory has longshort autoregressive memory structure, so it would be misleading in the context of time series analysis to strictly discern the two memories as long and short (as the nomenclature suggests). The latter can be thought of as a truncated version of the former.
6 Numerical Experiments
This section describes numerical experiments using financial time series data to evaluate the various RNN models. All models are implemented in v1.15.0 of TensorFlow
abadi2016tensorflow. Times series crossvalidation is performed using separate training, validation and test sets. To preserve the time structure of the data and avoid look ahead bias, each set represents a contiguous sampling period with the test set containing the most recent observations. To prepare the training, validation and testing sets for mstep ahead prediction, we set the target variables (responses) to the observation, , and use the lags from for each input sequence. This is repeated by incrementing until the end of each set. In our experiments, each element in the input sequence is either a scalar or vector and the target variables are scalar.We use the SimpleRNN
Keras method with the default settings to implement a fully connected RNN. Tanh activation functions are used for the hidden layer with the number of units found by time series crossvalidation with five folds to be and regularization, . The Glorot and
Bengio uniform method (Glorot2010)
is used to initialize the nonrecurrent weight matrices and an orthogonal method is used to initialize the recurrence weights as a random orthogonal matrix. Keras’s
GRU
method is implemented using version 1406.1078v, which applies the reset gate to the hidden state before matrix multiplication. Similarly, the LSTM
method in Keras is used. Tanh activation functions are used for the recurrence layer and sigmoid activation functions are used for all other gates. The AlphaRNN
and AlphatRNN
classes are implemented by the authors for use in Keras. Statefulness is always disabled.
Each architecture is trained for up to 2000 epochs with an Adam optimization algorithm with default parameter values and using a minibatch size of 1000 drawn from the training set. Early stopping is used with a patience of 50 to 100 and minimum delta between
and . No randomization is used in the minibatching sampling in order to preserve the ordering of the data. To evaluate the forecasting accuracy, we set the forecast horizon to up to ten steps ahead instead of the usual step ahead forecasts often presented in the machine learning literature — longer forecasting horizons are often more relevant due to operational constraints in industry applications and are more challenging when the data is nonstationary since the network’s fixed partial autocorrelation will not adequately capture the observed changing autoregressive structure. The numerical results are separated in two areas of interest: (i) properties of recurrent architectures on archetypal data properties such as seasonality and regime switching; and (ii) evaluation on real data to demonstrate their utility in industrial forecasting.6.1 LLM Seasonality DGP
To characterize the ability of exponentially smoothed RNNs to directly capture seasonality from noisy data, without the need to separately deseasonalize the data, we generate hourly data from an additive local level model with daily seasonality and i.i.d. noise harvey1990forecasting:
for . Choosing , we simulate
observations under noise variances
. The first observations are used for training and the remaining are used for testing.The data is nonstationary — we accept the Null hypothesis of the augmented DickeyFuller test which states that the AR model contains a unit root. The test statistic is
and the pvalue is (the critical values are 1%: 3.431, 5%: 2.862, and 10%: 2.567). We choose a sequence length of based on the PACF (not shown as the DGP is known).Figure 3 compares the PACFs of various architectures on the test set. Figure 3a shows the PACF on the generated data — the positive lags at 24, 48, 72 and 96, due to the seasonality, are clearly shown. Figure 3b and 3c show the PACF on the RNN and RNN, both stationary architectures. Neither are able to capture the seasonality. On the other hand, the RNN, the GRU and the LSTM, shown in Figure 3(df), adequately capture the seasonality. This is an important observation as typically deasonalization techniques such as differencing or convolutional filtering are needed. For completeness, the crossvalidated parameters, using five folds, are shown in Table 1.
Architecture  Parameters  H  

RNN  41  0  5 
RNN  132  0  10 
RNN  86  0  5 
GRU  1,341  0  20 
LSTM  1,781  0  20 
6.2 MultiRegime DGP
We evaluate the various times series forecasting models on a simulated univariate nonstationary dataset. Specifically we assume the existence of a discrete latent variable on which a conditionally stationary random distribution exists. Such a model is dynamically representative of the types of noisy time series generated by many applications such as electricity demand motor neuron events from electroencephalography (EEG) data
10.3389/fnins.2018.00024 and channel detection in cognitive radio 7009404. Indeed regimes are ubiquitous in industrial applications such as recession periods in economic data, onpeak and offpeak periods in traffic and electricity demand.observations of the univariate time series are generated from a tworegime AR(30) model. The two regime AR(30) model is a linear model of the form given in Appendix D.
To test that the time series is indeed nonstationary, we accept the Null hypothesis of the augmented DickeyFuller test at the 99% confidence level which states that the AR model contains a unit root and is thus nonstationary. The test statistic is and the pvalue is (the critical values are 1%: 3.434, 5%: 2.863, and 10%: 2.568). We choose a sequence length of based on the PACF (not shown as the DGP is known).
Figure 12 compares the performance of the various forecasting networks. The tenstep ahead forecasts are compared in ascending order of the number of hidden layers — plain RNNs, RNNs, RNNs, GRUs and LSTMs. The figure shows that stationary models such as the plain RNN and the RNN inadequately capture the regime change, with the peaks being severely underestimated. The latter effect is symptomatic of a model which has learned the autoregressive structure in the lower regime, to the detriment of the other. All methods produce significant forecast error at the regime transition. We further observe minor differences in the performance of the GRU versus the RNN model suggesting that the reset gate provides marginal effect for this dataset. We also observe no additional benefit in maintaining an additional cell memory. See Appendix 5 for further discussion on the cellular memory.
Architecture  Parameters  H  MSE (test)  

RNN  461  0.001  20  (0.0649) 
RNN  462  0.001  20  (0.0636) 
RNN  941  0  20  (0.0360) 
GRU  371  0  10  (0.0347) 
LSTM  146  0.001  5  (0.0288) 
Table 2
compares the average MSEs and their standard deviations each tenstep ahead forecasting model using time series cross validation with ten folds. To assess overfitting, the MSEs over the training periods are also provided for comparison with in and outofdistribution model performance. The stationary models are observed to exhibit the worst MSE— the plain RNN and the
RNN are similar in performance, the latter alleviates the vanishing gradient problem. The RNN and the GRU compare in performance suggesting the effect of the reset gate in the latter is marginal.6.3 Shortterm climate forecasting
The Jena climate modeling dataset was recorded at the Weather Station operated by the Max Planck Institute for Biogeochemistry in Jena, Germany. 14 different quantities (such as air temperature, atmospheric pressure, humidity, wind direction etc) were recorded every 10 minutes, over several years. The dataset contains 420,551 observations covering the period 20092016 (jena). We demonstrate how the different networks forecast the temperature using all lagged observed variables. Each covariate in the training and
the test set is normalized using the moments of the training data only so as to avoid lookahead bias or introduce a bias in the test data.
We reject the Null hypothesis of the augmented DickeyFuller test at the 99% confidence level for each covariate in favor of the alternative hypothesis that the data is stationary (contains at least one unit root). The largest test statistic is and the pvalue is (the critical values are 1%: 3.431, 5%: 2.862, and 10%: 2.567). However, we observe some evidence of cyclical memory in some of the covariates as seen in Figure 5.
We choose a sequence length of based on the PACF and perform a tenstep ahead forecast. Figure 6 compares the performance of the various forecasting networks and shows that stationary models such as the plain RNN and the RNN adequately capture the temperature dynamics, even when the forecasting date moves further from the training period — this is expected because the partial autocorrelation is stationary and there is hence no secular drift in the error.
Architecture  Parameters  H  MSE (test)  

RNN  261  0  10  (0.176) 
RNN  107  0.001  5  (0.217) 
RNN  216  0  5  (0.190) 
GRU  761  0.001  10  (0.205) 
LSTM  406  0  5  (0.218) 
Viewing the results of time series cross validation, using the first 23,971 observations, in Table 3, we observe minor differences in the performance of the GRU versus the RNN, suggesting that the reset gate provides no benefit for this dataset. In this case, we observe no additional benefit in the LSTM.
6.4 Electricity consumption
observations of hourly power system loads are provided over the period from January 1st 2008 to June 20th 2011 for the DK2 bidding zone collected in the open power systems time series data (dk2). The consumption time series is chosen as it exhibits both short term cyclical patterns and longerterm trends. However, while these cyclical patterns correspond to peak and offpeak consumption periods, the transitions are gradual. Without attempting to deseasonalize the data nor detrend it, we apply the various architectures as in the previous experiment.
We reject the Null hypothesis of the augmented DickeyFuller test at the 99% confidence level in favor of the alternative hypothesis that the data is stationary (contains no unit roots). The test statistic is and the pvalue is (the critical values are 1%: 3.431, 5%: 2.862, and 10%: 2.567).
The PACF in Figure 7 is observed to exhibit seasonality. We choose a sequence length of and perform a tenstep ahead forecast to highlight the limitations of not including high order lags.
Figure 12 compares the performance of the various networks and shows that plain RNN performs poorly, whereas and the RNN better captures the load dynamics. From Table 6, we further observe relatively minor differences in the performance of the GRU versus the RNN, suggesting that the reset gate provides no benefit. We also observe no additional benefit in the LSTM.
Architecture  Parameters  H  MSE (test)  

RNN  461  0  20  0.1062 (0.326) 
RNN  462  0  20  (0.258) 
RNN  941  0  20  (0.179) 
GRU  1,341  0  20  (0.218) 
LSTM  1,781  0  20  (0.207) 
6.5 Bitcoin forecasting
One minute snapshots of USD denominated Bitcoin midprices are captured from Coinbase over the period from January 1st to November 10th, 2018. We demonstrate how the different networks forecast Bitcoin prices using lagged observations of prices. The predictor in the training and the test set is normalized using the moments of the training data only so as to avoid lookahead bias or introduce a bias in the test data. We accept the Null hypothesis of the augmented DickeyFuller test as we can not reject it at even the 90% confidence level. The data is therefore stationary (contains at least one unit root). The largest test statistic is and the pvalue is (the critical values are 1%: 3.431, 5%: 2.862, and 10%: 2.567). While the partial autocovariance structure is expected to be time dependent, we observe a short memory of only four lags by estimating the PACF over the entire history (see Figure 9).
We choose a sequence length of based on the PACF and perform a fourstep ahead forecast. We comment in passing that there is little, if any, merit in forecasting beyond this time horizon given the largest significant lag indicated by the PACF. Figure 10 compares the performance of the various forecasting networks and shows that stationary models such as the plain RNN and the RNN least capture the price dynamics — this is expected because the partial autocorrelation is nonstationary.
Architecture  Parameters  H  MSE (test)  

RNN  461  0  20  
RNN  132  0  10  
RNN  86  0  5  
GRU  371  0  10  
LSTM  491  0  10 
Viewing the results of time series cross validation, using the first 30,000 observations, in Table 5, we observe minor differences in the performance of the LSTM, GRU versus the RNN, suggesting that the reset gate and extra cellular memory in the LSTM provides negligible benefit for this dataset. In this case, we observe very marginal additional benefit in the LSTM, yet the complexity of the latter is approximately 50x that of the RNN.
6.6 High Frequency Trading Data
Our dataset consists of observations of tickbytick Volume Weighted Average Prices (VWAPs) of CME listed ESU6 level II data over the month of August 2016 (doi:10.1002/asmb.2399; Dixon2017b).
We reject the Null hypothesis of the augmented DickeyFuller test at the 99% confidence level in favor of the alternative hypothesis that the data is stationary (contains no unit roots). The test statistic is and the pvalue is (the critical values are 1%: 3.431, 5%: 2.862, and 10%: 2.567).
The PACF in Figure 11 is observed to exhibit a cutoff at approximately 23 lags. We therefore choose a sequence length of and perform a tenstep ahead forecast. Note that the timestamps of the tick data are not uniform and hence a step refers to a tick.
Figure 12 compares the performance of the various networks and shows that plain RNN performs poorly, whereas and the RNN better captures the VWAP dynamics. From Table 6, we further observe relatively minor differences in the performance of the GRU versus the RNN, again suggesting that the reset gate and extra cellular memory in the LSTM provides no benefit. In this case, we find that the GRU is approximately 10x the complexity of the RNN with very marginal benefit.
Architecture  Parameters  H  MSE (test)  

RNN  41  0  5  
RNN  132  0  10  
RNN  86  0  5  1 
GRU  1,341  0  20  
LSTM  491  0  10 
7 Conclusion
Industrial forecasting has entered an era of unprecedented growth in the size and complexity of data which require new modeling methodologies. This paper presented a general class of exponential smoothed recurrent neural networks (RNNs) which are well suited to modeling nonstationary dynamical systems arising in industrial applications such as electricity load management and financial risk and trading. In particular, we demonstrated how they characterize the nonlinear partial autocorrelation structure of time series and directly capture dynamic effects such as seasonality and regime changes. Application of exponentially smoothed RNNs to electricity load forecasting, weather data and financial time series, such as minute level Bitcoin prices and CME futures tick data, highlighted the efficacy of exponential smoothing for multistep time series forecasting. In all of these examples, we show that exponentially smoothed RNNs are well suited to forecasting, being much simpler than more complex architectures such as GRUs and LSTMs, yet retaining the most important aspects needed for forecasting nonstationary series. These methods scale to large numbers of covariates and complex data. The experimental design and architectural parameters, such as the predictive horizon and model parameters, can be determined by simple statistical tests and diagnostics, without the need for extensive parameter optimization. Moreover, unlike traditional time series methods such as ARIMA models, these methods are shown to be unconditionally stable without the need to preprocess the data.
References
Appendix A Time Series Modeling Definitions
Definition A.1 (Time series).
A time series is one observation of a stochastic process, over a specific interval: .
Definition A.2 (Autocovariance).
The autocovariance of a time series is where .
Definition A.3 (Covariance (weak) stationarity).
A time series is weak (or widesense) covariance stationary if it has time constant mean and autocovariances of all orders:
As we’ve seen, this implies that : the autocovariances depend only on the interval between observations, but not the time of the observations.
Definition A.4 (Autocorrelation).
The autocorrelation, is just the autocovariance divided by the variance:
(23) 
Definition A.5 (White noise).
White noise, , is i.i.d. error which satisfies all three conditions:

;

; and

and are independent, .
Gaussian white noise just adds a normality assumption to the error. White noise error is often referred to as a “disturbance”, “shock” or “innovation” in the time series literature.
Appendix B Proof of Partial Autocovariance Property of RNNs
Proof.
Let’s first consider a RNN(1) process, i.e. . The lag1 partial autocovariance is
(24) 
and using the RNN(1) model with, for simplicity, a single recurrence weight, :
(25) 
gives
(26) 
where we have assumed in the second part of the expression.
Continuing with the lag2 autocovariance gives:
(27) 
and is approximated by the RNN(1):
(28) 
and is approximated by the backward RNN(1):
(29) 
so that we see, crucially, that depends on but not on . Substituting the backward RNN(1) and into Equation 27 gives
(30) 
and hence depends on . Thus we have that .
Now suppose . Repeating the above we have the backward RNN:
(31) 
and we see that, by virtue of the dependency of on and hence , that the lag2 autocovariance is no longer zero.
Now consider the lag2 partial autocovariance of the RNN(2) process, again with . Using the backward RNN(2) model:
(32) 
which depends on and hence the lag2 partial autocovariance:
(33) 
is not zero. It follows by induction that lags partial autocorrelations
(34) 
since
Comments
There are no comments yet.