Industrial Forecasting with Exponentially Smoothed Recurrent Neural Networks

04/09/2020 ∙ by Matthew F. Dixon, et al. ∙ Illinois Institute of Technology 0

Industrial forecasting has entered an era of unprecedented growth in the size and complexity of data which require new modeling methodologies. While many new general purpose machine learning approaches have emerged, they remain poorly understand and irreconcilable with more traditional statistical modeling approaches. We present a general class of exponential smoothed recurrent neural networks (RNNs) which are well suited to modeling non-stationary dynamical systems arising in industrial applications such as electricity load management and financial risk and trading. In particular, we analyze their capacity to characterize the non-linear partial autocorrelation structure of time series and directly capture dynamic effects such as seasonality and regime changes. Application of exponentially smoothed RNNs to electricity load forecasting, weather data and financial time series, such as minute level Bitcoin prices and CME futures tick data, highlight the efficacy of exponential smoothing for multi-step time series forecasting. The results also suggest that popular, but more complicated neural network architectures originally designed for speech processing, such as LSTMs and GRUs, are likely over-engineered for industrial forecasting and light-weight exponentially smoothed architectures capture the salient features while being superior and more robust than simple RNNs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

Recurrent neural networks (RNNs) are the building blocks of modern sequential learning. RNNs use recurrent layers to capture non-linear temporal dependencies with a relatively small number of parameters (Graves2013). They learn temporal dynamics by mapping an input sequence to a hidden state sequence and outputs, via a recurrent layer and a feedforward layer. However, despite the success of these and their successors, there appears to be a chasm between the statistical modeling literature (see e.g. boxjen76; UBMA_276336909; GVK126800421) and the machine learning literature (see e.g. bayer; pascanu2012difficulty; Hochreiter1997; schmidhuber1997long).

While there have been many recent theoretical developments in recurrent neural networks from a dynamical systems perspective (lim2019predicting; chen2019symplectic; niu2019recurrent), there are still open fundamental questions as to how the type and properties of the time series data informs the choice of architectures. For example, mcdermott2018deep find empirical evidence that Echo State Networks are well suited for spatio-temporal modeling and Uber (SMYL202075) won the 2019 M4 forecasting competition with a hybrid exponential smoothing-RNN method, which exponentiates the output of a neural network and combines it with a past observed time series level. There have been exhaustive empirical studies on the application of recurrent neural networks to prediction from financial time series data such as historical limit order book and price history (Dixon2017b; doi:10.1002/for.2585; borovykh2017conditional; doi:10.1080/14697688.2019.1622295; doi:10.1080/14697688.2019.1622287; doi:10.1080/14697688.2019.1634277). (doi:10.1080/14697688.2019.1622295) find evidence that stacking networks leads to superior performance on intra-day stock data combined with technical indicators, whereas (10.1371/journal.pone.0180944)

combine wavelet transforms and stacked autoencoders with LSTMs on OHLC bars and technical indicators.

borovykh2017conditional find evidence that dilated convolutional networks out perform LSTMs on various indices. (Dixon2017b) demonstrate that RNNs outperform feed-forward networks with lagged features on limit order book data.

There are still open fundamental questions as to how the type and properties of the time series data inform the choice of these architectures. One of the main contributions of this paper is to introduce a new class of RNNs, with supporting theoretical justification for architectural design choices.

A second challenge with applying neural networks to time series data is the over-reliance on extensive hyper-parameter tuning, a computationally complex global optimization problem with undesirable outcomes which are dependent on initial conditions for parameter choices. Moreover, little is known about how often to re-tune the parameters and the length of historical data used to train the model can affect the hyper-parameter tuning.

A third challenge is how to combine more traditional and informative diagnostics of time series data, such as stationarity and memory cut-off, into the model selection procedure. Such an approach is standard in the time series modeling literature boxjen76 but absent in the machine learning literature. Finally, most of the aforementioned studies are engineering orientated and do not provide substantive insight to justify why the architectural choice is well suited to the dataset.

One of the main contributions of this paper is to cast RNNs into a time series modeling framework and rely on statistical diagnostics in combination with cross-validation to tune the architecture. The statistical tests characterize stationarity and memory cut-off length and provide insight into whether the data is suitable for longer-term forecasting and whether the model must be non-stationarity.

It is well known that plain RNNs have difficultly in learning long-term dynamics, due in part to the vanishing and exploding gradients that can result from back propagating the gradients down through the many unfolded layers of the network (bayer; pascanu2012difficulty)

. A particular type of RNN, called a Long Short-Term Memory (LSTM)

(Hochreiter1997; schmidhuber1997long)

was proposed to address this issue of vanishing or exploding gradients which essentially arises due to the shape of the activation function. A memory unit used in a LSTM allows the network to learn which previous states can be forgotten and alleviates the vanishing gradient. Partly for this reason, LSTMs have demonstrated much empirical success in the literature

(Gers2001; 7926112).

The inclusion of forget, reset and update gates in the LSTM, and a slightly simpler variant — Gated Recurrent Units (GRUs)

(DBLP:journals/corr/ChungGCB14)

, provides a switching mechanism to forget memory while simultaneously updating a hidden state vector. These units do not, however, provide an easily accessible mathematical structure from which to study their time series modeling properties. As such, there is much opaqueness about the types of architectures which are best suited to prediction tasks based on data properties such as wide sense non-stationarity (see Appendix

A). However we shall show that exponential smoothing not only alleviates the gradient problem but characterizes the time series modeling properties of these architectures.

The main outcome of applying this approach is that it partially identifies the choice of architecture based on both its time series properties and that of the data. The approach is general and easily extensible to GRUs and LSTMs with the inclusion of the reset gate and cell memory. In this paper, the input sequence is assumed to be of fixed length, although the methodology could be extended to variable length as in sequence to sequence learning models (NIPS2014_5346). A summary of the main contributions of this paper are

  • We show how plain RNNs, with short-term memory, can be generalized to exhibit long-term memory with a smoothing scalar

    —smoothing also helps offset the infamous vanishing gradient problem in plain RNNs with only one extra parameter;

  • We show how time series analysis guides the architectural parameters —the sequence length of the RNN can be determined from the estimated partial autocorrelogram, and tanh activation of the recurrent layer is needed for stability; and

  • We demonstrate how a dynamic -RNN model for non-stationary data (Dixon2021), a lightweight version of GRUs and LSTMs, has the ability to model complex non-stationary times series data with comparable performance.

The remainder of this paper is outlined as follows. Section 2 introduces the -RNN and Section 3 applies time series analysis to guide the architectural properties. Section 4 introduces a dynamic version of the model and illustrates the dynamical behavior of . Details of the training, implementation and experiments together with the results are presented in Section 6. Finally, Section 7 concludes with directions for future research.

2 -RNNs

Given auto-correlated observations of covariates or predictors, , and responses at times , in the time series data , our goal is to construct an m-step () ahead times series predictor , of an observed target, , from a p length input sequence

is a lagged observation of , for and is the homoschedastic model error at time . We introduce the -RNN:

(1)

where is an smoothed RNN with weight matrices

, and biases vectors

.

For each time step in the sequence , forward passes separately update a hidden internal state , using the recurrence relations:

where the input weight matrix and the recurrence weight matrix . is the number of hidden units and is the dimensionality of the input space. is an exponentially smoothed version of the hidden state . When the output is continuous, the output from the final hidden state is given by:

(2)

with the starting condition in each sequence, .

3 Times Series Modeling

This section bridges the time series modeling literature (boxjen76; UBMA_276336909; doi:10.1111/jtsa.12464) and the machine learning literature. We shall assume here for ease of exposition that the time series data is univariate (), , and thus the predictor is endogenous222The sequence of features is from the same time series as the predictor..

Since autoregressive () models are well known in time series modeling, we find it instructive to show that plain RNNs are non-linear AR(p) models. For ease of exposition, consider the simplest case of a RNN with one hidden unit, . Without loss of generality, we set , , and . Then we can show by backward substitution that a plain-RNN, , with sequence length , is a non-linear auto-regressive, , model of order .

then

(3)

When the activation is the identity function , then we recover the AR(p) model

(4)

with geometrically decaying autoregressive coefficients when .

-RNNs

The -RNN(p) is almost identical to a plain RNN, but with an additional scalar smoothing parameter, , which provides the recurrent network with “long-memory”333Long memory refers to autoregressive memory beyond the sequence length. This is also sometimes referred to as “stateful”. For avoidance of doubt, we are not suggesting that the -RNN has an additional cellular memory, as in LSTMs.. To see this, let us consider a one-step ahead univariate -RNN(p) in which the smoothing parameter is fixed. For each time step :

This model augments the plain-RNN by replacing in the hidden layer with an exponentially smoothed hidden state . The effect of the smoothing is to provide infinite memory when . For the special case when , we recover the plain RNN with short memory of length .

We can easily see this informally by simplifying the parameterization and considering the unactivated case. Setting , and :

(5)
(6)
(7)

with the starting condition in each sequence, . With out loss of generality, consider lags in the model so that . Then

(8)

and the model can be written in the simpler form

(9)

with auto-regressive weights and . We now see that there is a third term on the RHS of Eq. 9 which vanishes when but provides infinite memory to the model since depends on , the first observation in the whole time series, not just the first observation in the sequence. To see this, we unroll the recursion relation in the exponential smoother:

(10)

where we used the property that . It is often convenient to characterize exponential smoothing by the half-life— the number of lags needed for the coefficient to equal a half, which is . To gain further insight we use partial auto-correlations to characterize the memory of the model.

3.1 Partial Autocorrelation

We consider autoregressive time series models, with additive white noise

444The assumption of Gaussian white noise is convenient but not necessary — any type of white noise is possible., of the form

which carry a signature which allows its order, , to be determined from “covariance stationary” time series data (see Appendix A

for a definition of covariance stationarity). This signature encodes the memory in the model and is given by “partial autocorrelations”. Informally each partial autocorrelation measures the correlation of a random variable,

, with its lag, , while controlling for intermediate lags. The partial autocorrelation must be non-zero to be able to predict from . The sign of the partial autocorrelation is also useful for interpretability and describes the directional relationship between the random variables. The formal definition of the partial autocorrelation is now given.

Definition 3.1 (Partial Autocorrelation).

A partial autocorrelation at lag is a conditional autocorrelation between a variable, , and its lag, under the assumption that the values of the intermediate lags, are controlled:

where

is the lag- partial autocovariance, is an orthogonal projection of W onto the set and

(11)

The partial autocorrelation function (PACF) is a map . The plot of against is referred to as the partial correlogram.

The PACF of the RNN(p) can be used to determine the lower bound on the sequence length in an -RNN(p). To see this, we first show that the partial autocorrelation of the -RNN(p) is time independent and has a sharp cut-off after lags if .

Theorem 1 (Partial autocorrelation of an -RNN(p)).

The partial autocorrelation of the -RNN(p) is time independent and exhibits a cut-off after lags: if . If the -RNN(p) has non-zero partial autocorrelations at lags beyond the sequence length.

See Appendix B for a proof. For , the -RNN has non-zero partial autocorrelations . It is easy to see this from the additional term containing in Equation 9. Further insight can be gained from Figure 1 which shows the fitted partial correlogram from data generated by an -RNN(3) with additive white noise. We observe that the memory is always longer than the sequence length of 3 when . As approaches zero, the model has increasing memory555Although the model has no memory in the limit .. The theorem and the properties of the -RNN(p) suggest to determine the sequence length from the fitted PACF of each covariate. Moreover, the prediction horizon should not exceed that suggested by the maximum order of statistically significant partial autocorrelations.

Figure 1: The fitted partial correlogram of univariate data generated by various additive noise -RNN(3) models.

3.2 Stability

Times series modeling also places emphasis on model ”stability” — this is the model attribute that past random disturbances decay in the model and the effect of lagged data becomes less relevant to the model output with increasing lag. We present the following theorem which shows that the stability of the -RNN model is determined solely by hidden state’s activation function, with the property that .

Theorem 2 (Stability of RNN(p) models).

Suppose that there exists an invertible nonlinear function of the lag operator of the form:

where, without loss of generality, we have again conveniently set , and . Then the RNN is stable if and only if for all finite .

See Appendix C for a proof. In particular, the theorem justifies the common choice of and while sigmoid is also another viable choice, it is too restrictive as it prohibits negative partial auto-correlations. We shall see in Section 6 that negative partial auto-correlations arise in time series data.

3.3 Vanishing gradient

It is well known that plain-RNNs exhibit a vanishing gradient problem (pascanu2012difficulty; bayer). Following pascanu2012difficulty, we extend some of the analysis of BPTT to -RNNs. The -RNN(p) BPTT can be written as:

(12)
(13)

for some generic loss function

, where

Substituting the expression for into Equation 12 gives an expression proportional to

When this expression is . When , this product goes to zero with increasing function compositions due to the gradient of tanh vanishing with large and the product of small functions. However, when , the additional term provides an additional contribution to the gradient which is non trivial for small and independent of the input.

Thus the -RNN will not only alleviate the vanishing gradient problem, has guaranteed stability if we use choose a tanh activation function, but exhibits non-zero partial auto-correlations up to at least lag for .

The -RNN model can be trained by treating

as an additional parameter to be fitted with stochastic gradient descent. The choice to pre-determine that

is independent of time is obviously limited to stationary time series. While this is restrictive, it suggests that a simple statistical test of the time series can pre-determine the efficacy of the approach666For multivariate data, the test should be applied to each covariate.. Moreover, if the data is covariance stationary, then the -RNN will preserve the stationarity of the partial auto-correlation structure, eliminating the need for more complex architectures such as GRUs and LSTMs (which are often motivated purely on the basis of the vanishing gradient problem). Such a procedure shall be demonstrated in Section 6. We can extend the model to non-stationary time series, which exhibit dynamic partial autocorrelation, by using a dynamic version of exponential smoothing.

4 Dynamic -RNNs

The extension of RNNs to dynamical time series models, suitable for non-stationary time series data, relies on dynamic exponential smoothing is a time dependent, convex, combination of the smoothed output, , and the hidden state :

(14)

where denotes the dynamic smoothing factor which can be equivalently written in the one-step-ahead forecast of the form

(15)

Hence the smoothing can be viewed as a form of dynamic forecast error correction; When , the forecast error is ignored and the smoothing merely repeats the current hidden state , to the effect of the model losing its memory. When , the forecast error overwrites the current hidden state . The smoothing can also be viewed a weighted sum of the lagged observations, with lower or equal weights, at the lagged hidden state, :

where . Note that for any , the smoothed hidden state will have no dependency on all lagged hidden states . The model simply forgets the hidden states at or beyond the lag.

4.1 Neural network exponential smoothing

While the class of -RNN models under consideration is free to define how is updated (including changing the frequency of the update) based on the hidden state and input, a convenient choice is use a recurrent layer. Returning again to the more general setup with a hidden state vector , let us model the smoothing parameter to give a filtered time series

(16)

where denotes the Hadamard product between vectors. This smoothing is a vectorized form of the above classical setting, only here we note that when , the component of the hidden variable is unmodified and the past filtered hidden variable is forgotten. On the other hand, when the , the component of the hidden variable is obsolete, instead setting the current filtered hidden variable to its past value. The smoothing in Equation 16 can be viewed then as updating long-term memory, maintaining a smoothed hidden state variable as the memory through a convex combination of the current hidden variable and the previous smoothed hidden variable.

The hidden variable is given by the semi-affine transformation:

(17)

which in turns depends on the previous smoothed hidden variable. Substituting Equation 17 into Equation 16 gives a function of and :

(18)
(19)

We see that when , the smoothed hidden variable is not updated by the input . Conversely, when , we observe that the hidden variable locally behaves like a non-linear autoregressive series. Thus the smoothing parameter can be viewed as the sensitivity of the smoothed hidden state to the input .

The challenge becomes how to determine dynamically how much error correction is needed. As in GRUs and LSTMs, we can address this problem by learning from the input variables with the recurrent layer parameterized by weights and biases . The one-step ahead forecast of the smoothed hidden state, , is the filtered output of another plain RNN with weights and biases .

Comparison with -RNNs

Figure 1(a) shows the response of a univariate -RNN model when the input consists of two unit impulses and zeros otherwise. For simplicity, the sequence length is assumed to be (i.e. the RNN has a memory of 3 lags), the biases are set to zero, all the weights are set to one777Note that the weights have not been fitted here, we are merely observing the effect of smoothing on the hidden state for the simplest choice of parameter values. and . The RNN loses memory of the unit impulse after three lags, whereas the RNNs with smooth hidden states maintain memory of the first unit impulse even when the second unit impulse arrives. The figure also shows an -RNN model, which although appears insignificant in this example, allows the model to fit non-stationary time series data. This is because the time dependent results in dynamic partial autocorrelations. The response of to shocks can be seen in Figure 1(b)888Note that the value of is initially set to one since no long-term memory is needed. Hence the response to the second shock at time is more insightful..

(a) Comparison of model responses in the presence of shocks to the inputs.
(b) Response of to shocks in the input.
Figure 2: An illustrative example of the response of an -RNN in comparison with a plain-RNN. The RNN model is chosen for illustrative purposes to be a RNN(3) model, i.e. with a sequence length of 3.

See Appendix 5 for a discussion of the differences between the -RNN and the GRU and LSTM.

5 Relationship to GRUs and LSTMs

5.1 GRUs

The -RNN model has no means to entirely reset its memory and become a feed-forward network (FFN). This is because the hidden variables update equation always depends on the previous smoothed hidden state, unless . By adding a reset layer, we recover a GRU:

When viewed as an extension of our RNN model, we observe that the effect of introducing a reset, or switch, , is to forget the dependence of on the smoothed hidden state. Effectively, we turn the update for from a plain RNN to a FFN and entirely neglect the recurrence. The recurrence in the update of is thus dynamic. It may appear that the combination of a reset and adaptive smoothing is redundant. But remember that effects the level of error correction in the update of the smoothed hidden state, , whereas adjusts the level of recurrence in the unsmoothed hidden state . Put differently, by itself can not disable the memory in the smoothed hidden state (internal memory), whereas in combination with can. More precisely, when and , which is reset to the latest input, , and the GRU is just a FFN. Also, when and , a GRU acts like a plain RNN. Thus a GRU can be seen as a more general architecture which is capable of being a FFN or a plain RNN under certain parameter values.

These additional layers (or cells) enable a GRU to learn extremely complex long-term temporal dynamics that a vanilla RNN is not capable of. Lastly, we comment in passing that in the GRU, as in a RNN, there is a final feedforward layer to transform the (smoothed) hidden state to a response:

(20)

5.2 LSTMs

The -RNN model, like the GRU, provides a mechanism for propagating a smoothed hidden state — a long term memory which can be overridden and even turn the network into a plain RNN (with short memory) or even a memoryless FFN. More complex models using hidden units with varying connections within the memory unit have been proposed in the engineering literature with empirical success (Hochreiter1997; Gers2001; 7926112). LSTMs are similar to GRUs but have a separate (cell) memory, , in addition to a hidden state . LSTMs also do not require that the memory updates are a convex combination. Hence they are more general than exponential smoothing. The mathematical description of LSTMs is rarely given in an intuitive form, but the model can be found in, for example, Hochreiter1997.

The cell memory is updated by the following expression involving a forget gate, , an input gate and a cell gate

(21)

In the terminology of LSTMs, the triple are respectively referred to as the forget gate, output gate, and input gate. Our change of terminology is deliberate and designed to provided more intuition and continuity with RNNs and the statistics literature. We note that in the special case when we obtain a similar exponential smoothing expression to that used in our -RNN. Beyond that, the role of the input gate appears superfluous and difficult to reason with using time series analysis.

When the forget gate, , then the cell memory depends solely on the cell memory gate update . By the term , the cell memory has long-term memory which is only forgotten beyond lag s if . Thus the cell memory has an adaptive autoregressive structure.

The extra “memory”, treated as a hidden state and separate from the cell memory, is nothing more than a Hadamard product:

(22)

which is reset if . If , then the cell memory directly determines the hidden state.

Thus the reset gate can entirely override the effect of the cell memory’s autoregressive structure, without erasing it. In contrast, the -RNN and the GRU has one memory, which serves as the hidden state, and it is directly affected by the reset gate.

The reset, forget, input and cell memory gates are updated by plain RNNs all depending on the hidden state .

Like the -RNN, the LSTM can function as a short-memory, plain-RNN; just set in Equation 21. However, the LSTM can also function as a coupling of FFNs; just set so that and hence there is no recurrence structure in any of the gates. For avoidance of doubt, since the nomenclature doesn’t suggest it, all models in this paper can model long and short-term autoregressive memory. The -RNN couples these memories through a smoothed hidden state variable. The LSTM separates out the long memory, stored in the cellular memory, but uses a copy of it, which may additionally be reset. Strictly speaking, the cellular memory has long-short autoregressive memory structure, so it would be misleading in the context of time series analysis to strictly discern the two memories as long and short (as the nomenclature suggests). The latter can be thought of as a truncated version of the former.

6 Numerical Experiments

This section describes numerical experiments using financial time series data to evaluate the various RNN models. All models are implemented in v1.15.0 of TensorFlow

abadi2016tensorflow. Times series cross-validation is performed using separate training, validation and test sets. To preserve the time structure of the data and avoid look ahead bias, each set represents a contiguous sampling period with the test set containing the most recent observations. To prepare the training, validation and testing sets for m-step ahead prediction, we set the target variables (responses) to the observation, , and use the lags from for each input sequence. This is repeated by incrementing until the end of each set. In our experiments, each element in the input sequence is either a scalar or vector and the target variables are scalar.

We use the SimpleRNNKeras method with the default settings to implement a fully connected RNN. Tanh activation functions are used for the hidden layer with the number of units found by time series cross-validation with five folds to be and regularization, . The Glorot and Bengio uniform method (Glorot2010)

is used to initialize the non-recurrent weight matrices and an orthogonal method is used to initialize the recurrence weights as a random orthogonal matrix. Keras’s

GRU method is implemented using version 1406.1078v, which applies the reset gate to the hidden state before matrix multiplication. Similarly, the LSTM method in Keras is used. Tanh activation functions are used for the recurrence layer and sigmoid activation functions are used for all other gates. The AlphaRNN and AlphatRNN classes are implemented by the authors for use in Keras. Statefulness is always disabled.

Each architecture is trained for up to 2000 epochs with an Adam optimization algorithm with default parameter values and using a mini-batch size of 1000 drawn from the training set. Early stopping is used with a patience of 50 to 100 and minimum delta between

and . No randomization is used in the mini-batching sampling in order to preserve the ordering of the data. To evaluate the forecasting accuracy, we set the forecast horizon to up to ten steps ahead instead of the usual step ahead forecasts often presented in the machine learning literature — longer forecasting horizons are often more relevant due to operational constraints in industry applications and are more challenging when the data is non-stationary since the network’s fixed partial auto-correlation will not adequately capture the observed changing autoregressive structure. The numerical results are separated in two areas of interest: (i) properties of recurrent architectures on archetypal data properties such as seasonality and regime switching; and (ii) evaluation on real data to demonstrate their utility in industrial forecasting.

6.1 LLM Seasonality DGP

To characterize the ability of exponentially smoothed RNNs to directly capture seasonality from noisy data, without the need to separately deseasonalize the data, we generate hourly data from an additive local level model with daily seasonality and i.i.d. noise harvey1990forecasting:

for . Choosing , we simulate

observations under noise variances

. The first observations are used for training and the remaining are used for testing.

The data is non-stationary — we accept the Null hypothesis of the augmented Dickey-Fuller test which states that the AR model contains a unit root. The test statistic is

and the p-value is (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567). We choose a sequence length of based on the PACF (not shown as the DGP is known).

Figure 3 compares the PACFs of various architectures on the test set. Figure 3a shows the PACF on the generated data — the positive lags at 24, 48, 72 and 96, due to the seasonality, are clearly shown. Figure 3b and 3c show the PACF on the RNN and -RNN, both stationary architectures. Neither are able to capture the seasonality. On the other hand, the -RNN, the GRU and the LSTM, shown in Figure 3(d-f), adequately capture the seasonality. This is an important observation as typically deasonalization techniques such as differencing or convolutional filtering are needed. For completeness, the cross-validated parameters, using five folds, are shown in Table 1.

(a) Observed
(b) Simple RNN
(c) -RNN
(d) -RNN
(e) GRU
(f) LSTM
Figure 3: The PACFs are compared for various recurrent architectures over a local level seasonality dataset with a periodicity of 24 hours.
Architecture Parameters H
RNN 41 0 5
-RNN 132 0 10
-RNN 86 0 5
GRU 1,341 0 20
LSTM 1,781 0 20
Table 1: The cross-validated parameters of the ten-step ahead forecasting models for the llm+seasonality data generation process.

6.2 Multi-Regime DGP

We evaluate the various times series forecasting models on a simulated univariate non-stationary dataset. Specifically we assume the existence of a discrete latent variable on which a conditionally stationary random distribution exists. Such a model is dynamically representative of the types of noisy time series generated by many applications such as electricity demand motor neuron events from electroencephalography (EEG) data

10.3389/fnins.2018.00024 and channel detection in cognitive radio 7009404. Indeed regimes are ubiquitous in industrial applications such as recession periods in economic data, on-peak and off-peak periods in traffic and electricity demand.

observations of the univariate time series are generated from a two-regime AR(30) model. The two regime AR(30) model is a linear model of the form given in Appendix D.

To test that the time series is indeed non-stationary, we accept the Null hypothesis of the augmented Dickey-Fuller test at the 99% confidence level which states that the AR model contains a unit root and is thus non-stationary. The test statistic is and the p-value is (the critical values are 1%: -3.434, 5%: -2.863, and 10%: -2.568). We choose a sequence length of based on the PACF (not shown as the DGP is known).

Figure 12 compares the performance of the various forecasting networks. The ten-step ahead forecasts are compared in ascending order of the number of hidden layers — plain RNNs, -RNNs, -RNNs, GRUs and LSTMs. The figure shows that stationary models such as the plain RNN and the -RNN inadequately capture the regime change, with the peaks being severely underestimated. The latter effect is symptomatic of a model which has learned the auto-regressive structure in the lower regime, to the detriment of the other. All methods produce significant forecast error at the regime transition. We further observe minor differences in the performance of the GRU versus the -RNN model suggesting that the reset gate provides marginal effect for this dataset. We also observe no additional benefit in maintaining an additional cell memory. See Appendix 5 for further discussion on the cellular memory.

Architecture Parameters H MSE (test)
RNN 461 0.001 20 (0.0649)
-RNN 462 0.001 20 (0.0636)
-RNN 941 0 20 (0.0360)
GRU 371 0 10 (0.0347)
LSTM 146 0.001 5 (0.0288)
Table 2: The ten-step ahead forecasting models for regime switching data are compared for various architectures using time series cross-validation.

Table 2

compares the average MSEs and their standard deviations each ten-step ahead forecasting model using time series cross validation with ten folds. To assess overfitting, the MSEs over the training periods are also provided for comparison with in and out-of-distribution model performance. The stationary models are observed to exhibit the worst MSE— the plain RNN and the

-RNN are similar in performance, the latter alleviates the vanishing gradient problem. The -RNN and the GRU compare in performance suggesting the effect of the reset gate in the latter is marginal.


(a) The forecasts for each architecture and the observed out-of-sample time series.
(b) The errors for each architecture over the same test period.
Figure 4: The ten-step ahead forecasts are compared for various recurrent architectures over a regime switching dataset whose regime alternates every 100 observations.

6.3 Short-term climate forecasting

The Jena climate modeling dataset was recorded at the Weather Station operated by the Max Planck Institute for Biogeochemistry in Jena, Germany. 14 different quantities (such as air temperature, atmospheric pressure, humidity, wind direction etc) were recorded every 10 minutes, over several years. The dataset contains 420,551 observations covering the period 2009-2016 (jena). We demonstrate how the different networks forecast the temperature using all lagged observed variables. Each covariate in the training and

the test set is normalized using the moments of the training data only so as to avoid look-ahead bias or introduce a bias in the test data.

We reject the Null hypothesis of the augmented Dickey-Fuller test at the 99% confidence level for each covariate in favor of the alternative hypothesis that the data is stationary (contains at least one unit root). The largest test statistic is and the p-value is (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567). However, we observe some evidence of cyclical memory in some of the covariates as seen in Figure 5.

Figure 5: The partial autocorrelogram (PACF) for each of the covariates (features) used in the model. Some of the covariates exhibit a monotonically decaying PACF while others exhibit oscillatory decay, with positive and negative partial autocorrelations.

We choose a sequence length of based on the PACF and perform a ten-step ahead forecast. Figure 6 compares the performance of the various forecasting networks and shows that stationary models such as the plain RNN and the -RNN adequately capture the temperature dynamics, even when the forecasting date moves further from the training period — this is expected because the partial autocorrelation is stationary and there is hence no secular drift in the error.

Architecture Parameters H MSE (test)
RNN 261 0 10 (0.176)
-RNN 107 0.001 5 (0.217)
-RNN 216 0 5 (0.190)
GRU 761 0.001 10 (0.205)
LSTM 406 0 5 (0.218)
Table 3: The ten-step ahead climate forecasts are compared for various architectures using time series cross-validation.

Viewing the results of time series cross validation, using the first 23,971 observations, in Table 3, we observe minor differences in the performance of the GRU versus the -RNN, suggesting that the reset gate provides no benefit for this dataset. In this case, we observe no additional benefit in the LSTM.


(a) The forecasts for each architecture and the observed out-of-sample time series.
(b) The errors for each architecture over the same test period.
Figure 6: The ten-step ahead forecasts of temperature using the Jena weather data with MSEs shown in parentheses.

6.4 Electricity consumption

observations of hourly power system loads are provided over the period from January 1st 2008 to June 20th 2011 for the DK2 bidding zone collected in the open power systems time series data (dk2). The consumption time series is chosen as it exhibits both short term cyclical patterns and longer-term trends. However, while these cyclical patterns correspond to peak and off-peak consumption periods, the transitions are gradual. Without attempting to deseasonalize the data nor de-trend it, we apply the various architectures as in the previous experiment.

We reject the Null hypothesis of the augmented Dickey-Fuller test at the 99% confidence level in favor of the alternative hypothesis that the data is stationary (contains no unit roots). The test statistic is and the p-value is (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567).

Figure 7: The PACF of the electricity load data exhibits seasonality

The PACF in Figure 7 is observed to exhibit seasonality. We choose a sequence length of and perform a ten-step ahead forecast to highlight the limitations of not including high order lags.

Figure 12 compares the performance of the various networks and shows that plain RNN performs poorly, whereas and the -RNN better captures the load dynamics. From Table 6, we further observe relatively minor differences in the performance of the GRU versus the -RNN, suggesting that the reset gate provides no benefit. We also observe no additional benefit in the LSTM.


(a) The forecasts for each architecture and the observed out-of-sample time series.
(b) The errors for each architecture over the same test period.
Figure 8: The ten-step ahead forecasts are compared for various architectures over the hourly electricity consumption dataset.
Architecture Parameters H MSE (test)
RNN 461 0 20 0.1062 (0.326)
-RNN 462 0 20 (0.258)
-RNN 941 0 20 (0.179)
GRU 1,341 0 20 (0.218)
LSTM 1,781 0 20 (0.207)
Table 4: The ten-step ahead forecasting models for electricity load are compared for various architectures using time series cross-validation.

6.5 Bitcoin forecasting

One minute snapshots of USD denominated Bitcoin mid-prices are captured from Coinbase over the period from January 1st to November 10th, 2018. We demonstrate how the different networks forecast Bitcoin prices using lagged observations of prices. The predictor in the training and the test set is normalized using the moments of the training data only so as to avoid look-ahead bias or introduce a bias in the test data. We accept the Null hypothesis of the augmented Dickey-Fuller test as we can not reject it at even the 90% confidence level. The data is therefore stationary (contains at least one unit root). The largest test statistic is and the p-value is (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567). While the partial autocovariance structure is expected to be time dependent, we observe a short memory of only four lags by estimating the PACF over the entire history (see Figure 9).

Figure 9: The partial autocorrelogram (PACF) for 1 minute snapshots of Bitcoin mid-prices (USD) over the period 2018-01-01 to 2018-11-10.

We choose a sequence length of based on the PACF and perform a four-step ahead forecast. We comment in passing that there is little, if any, merit in forecasting beyond this time horizon given the largest significant lag indicated by the PACF. Figure 10 compares the performance of the various forecasting networks and shows that stationary models such as the plain RNN and the -RNN least capture the price dynamics — this is expected because the partial autocorrelation is non-stationary.

Architecture Parameters H MSE (test)
RNN 461 0 20
-RNN 132 0 10
-RNN 86 0 5
GRU 371 0 10
LSTM 491 0 10
Table 5: The four-step ahead Bitcoin forecasts are compared for various architectures using time series cross-validation. The half-life of the -RNN is found to be 1.077 minutes ().

Viewing the results of time series cross validation, using the first 30,000 observations, in Table 5, we observe minor differences in the performance of the LSTM, GRU versus the -RNN, suggesting that the reset gate and extra cellular memory in the LSTM provides negligible benefit for this dataset. In this case, we observe very marginal additional benefit in the LSTM, yet the complexity of the latter is approximately 50x that of the -RNN.


(a) The forecasts for each architecture and the observed out-of-sample time series.
(b) The errors for each architecture over the same test period.
Figure 10: The four-step ahead forecasts of temperature using the minute snapshot Bitcoin prices (USD) with MSEs shown in parentheses. Note that the prices have been standardized.

6.6 High Frequency Trading Data

Our dataset consists of observations of tick-by-tick Volume Weighted Average Prices (VWAPs) of CME listed ESU6 level II data over the month of August 2016 (doi:10.1002/asmb.2399; Dixon2017b).

We reject the Null hypothesis of the augmented Dickey-Fuller test at the 99% confidence level in favor of the alternative hypothesis that the data is stationary (contains no unit roots). The test statistic is and the p-value is (the critical values are 1%: -3.431, 5%: -2.862, and 10%: -2.567).

Figure 11: The PACF of the tick-by-tick VWAP of ESU6 over the month of August 2016.

The PACF in Figure 11 is observed to exhibit a cut-off at approximately 23 lags. We therefore choose a sequence length of and perform a ten-step ahead forecast. Note that the time-stamps of the tick data are not uniform and hence a step refers to a tick.

Figure 12 compares the performance of the various networks and shows that plain RNN performs poorly, whereas and the -RNN better captures the VWAP dynamics. From Table 6, we further observe relatively minor differences in the performance of the GRU versus the -RNN, again suggesting that the reset gate and extra cellular memory in the LSTM provides no benefit. In this case, we find that the GRU is approximately 10x the complexity of the -RNN with very marginal benefit.


(a) The forecasts for each architecture and the observed out-of-sample time series.
(b) The errors for each architecture over the same test period.
Figure 12: The ten-step ahead forecasts of VWAPs are compared for various architectures using the tick-by-tick dataset.
Architecture Parameters H MSE (test)
RNN 41 0 5
-RNN 132 0 10
-RNN 86 0 5 1
GRU 1,341 0 20
LSTM 491 0 10
Table 6: The ten-step ahead forecasting models for VWAPs are compared for various architectures using time series cross-validation. The half-life of the -RNN is found to be 2.398 periods ().

7 Conclusion

Industrial forecasting has entered an era of unprecedented growth in the size and complexity of data which require new modeling methodologies. This paper presented a general class of exponential smoothed recurrent neural networks (RNNs) which are well suited to modeling non-stationary dynamical systems arising in industrial applications such as electricity load management and financial risk and trading. In particular, we demonstrated how they characterize the non-linear partial autocorrelation structure of time series and directly capture dynamic effects such as seasonality and regime changes. Application of exponentially smoothed RNNs to electricity load forecasting, weather data and financial time series, such as minute level Bitcoin prices and CME futures tick data, highlighted the efficacy of exponential smoothing for multi-step time series forecasting. In all of these examples, we show that exponentially smoothed RNNs are well suited to forecasting, being much simpler than more complex architectures such as GRUs and LSTMs, yet retaining the most important aspects needed for forecasting non-stationary series. These methods scale to large numbers of covariates and complex data. The experimental design and architectural parameters, such as the predictive horizon and model parameters, can be determined by simple statistical tests and diagnostics, without the need for extensive parameter optimization. Moreover, unlike traditional time series methods such as ARIMA models, these methods are shown to be unconditionally stable without the need to pre-process the data.

References

Appendix A Time Series Modeling Definitions

Definition A.1 (Time series).

A time series is one observation of a stochastic process, over a specific interval: .

Definition A.2 (Autocovariance).

The autocovariance of a time series is where .

Definition A.3 (Covariance (weak) stationarity).

A time series is weak (or wide-sense) covariance stationary if it has time constant mean and autocovariances of all orders:

As we’ve seen, this implies that : the autocovariances depend only on the interval between observations, but not the time of the observations.

Definition A.4 (Autocorrelation).

The autocorrelation, is just the autocovariance divided by the variance:

(23)
Definition A.5 (White noise).

White noise, , is i.i.d. error which satisfies all three conditions:

  1. ;

  2. ; and

  3. and are independent, .

Gaussian white noise just adds a normality assumption to the error. White noise error is often referred to as a “disturbance”, “shock” or “innovation” in the time series literature.

Appendix B Proof of Partial Autocovariance Property of RNNs

Proof.

Let’s first consider a RNN(1) process, i.e. . The lag-1 partial autocovariance is

(24)

and using the RNN(1) model with, for simplicity, a single recurrence weight, :

(25)

gives

(26)

where we have assumed in the second part of the expression.

Continuing with the lag-2 autocovariance gives:

(27)

and is approximated by the RNN(1):

(28)

and is approximated by the backward RNN(1):

(29)

so that we see, crucially, that depends on but not on . Substituting the backward RNN(1) and into Equation 27 gives

(30)

and hence depends on . Thus we have that .

Now suppose . Repeating the above we have the backward -RNN:

(31)

and we see that, by virtue of the dependency of on and hence , that the lag-2 autocovariance is no longer zero.

Now consider the lag-2 partial autocovariance of the RNN(2) process, again with . Using the backward RNN(2) model:

(32)

which depends on and hence the lag-2 partial autocovariance:

(33)

is not zero. It follows by induction that lag-s partial autocorrelations

(34)

since