Log In Sign Up

Stanza: A Nonlinear State Space Model for Probabilistic Inference in Non-Stationary Time Series

Time series with long-term structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting settings. Traditionally, state space models have been successful in providing uncertainty estimates of trajectories in the latent space. More recently, deep learning, attention-based approaches have achieved state of the art performance for sequence modeling, though often require large amounts of data and parameters to do so. We propose Stanza, a nonlinear, non-stationary state space model as an intermediate approach to fill the gap between traditional models and modern deep learning approaches for complex time series. Stanza strikes a balance between competitive forecasting accuracy and probabilistic, interpretable inference for highly structured time series. In particular, Stanza achieves forecasting accuracy competitive with deep LSTMs on real-world datasets, especially for multi-step ahead forecasting.


DynaConF: Dynamic Forecasting of Non-Stationary Time-Series

Deep learning models have shown impressive results in a variety of time ...

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series

Integrating deep learning with latent state space models has the potenti...

SSDNet: State Space Decomposition Neural Network for Time Series Forecasting

In this paper, we present SSDNet, a novel deep learning approach for tim...

RNN with Particle Flow for Probabilistic Spatio-temporal Forecasting

Spatio-temporal forecasting has numerous applications in analyzing wirel...

Probabilistic Time Series Forecasting with Structured Shape and Temporal Diversity

Probabilistic forecasting consists in predicting a distribution of possi...

Analysis and modeling to forecast in time series: a systematic review

This paper surveys state-of-the-art methods and models dedicated to time...

A Case-Study of Sample-Based Bayesian Forecasting Algorithms

For a Bayesian, real-time forecasting with the posterior predictive dist...

1 Introduction

Time series with long-term structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting. Traditionally, state space models like Kalman Filters

(Kalman, 1960), Dynamic Linear Models (DLMs) (West and Harrison, 1997) and their numerous extensions (Ljung, 1979; Eric A. Wan and Rudolph van der Merwe, 2000) have been utilized for probabilistic inference in time series analysis. This general class of state space models has been very successful for modeling systems where the dynamics are approximately linear and relatively simple. Alternatively, deep learning, attention-based models (Vaswani et al., 2017; Chung et al., 2015) have achieved state-of-the-art forecasting performance in modeling longer-term structure in time series and sequences, especially for applications relating to language. However, these methods often require large amounts of training data.

Traditional Kalman Filters and DLMs are very good for inference tasks, especially in settings where probabilistic filtering and smoothing of the state space is important. However, Kalman Filters often are not competitive for forecasting, especially for data that does not meet the model specifications. On the other hand, deep learning models like Long Short-Term Memory (LSTM) models

(Hochreiter and Schmidhuber, 1997) achieve excellent forecasting accuracy in many settings, but cannot be used for inference tasks, as filtering and smoothing is done implicitly and deterministically in the model. Thus, there still remains a gap between time series models designed for inference and those that achieve competitive forecasting accuracy.

We propose Stanza, a State Space Model with Nonlinear Activations and Dynamically Weighted Latent (Z) States, as an intermediate approach to alleviate this trade-off between inference and forecasting. Our motivation is to fill the gap between large deep learning, attention-based models and traditional state space models. Stanza achieves forecasting accuracy comparable to deep LSTMs on three real-world datasets and is able to perform inference for complex, nonlinear, non-stationary time series. Furthermore, Stanza is fully probabilistic in the state space, allowing for straightforward multi-step ahead forecasts out to long time horizons and provides forecast coverage estimates which are not possible in traditional LSTMs. Stanza incorporates ideas from deep learning into traditional state space models, resulting in a nonlinear, non-stationary model capable of performing inference in more general and complex settings, as compared to classical state space models. We design Stanza specifically for settings where there is not enough data for a full attention-based model (i.e. the small to medium data regime) and/or where the model needs to be interpretable and uncertainty in the inferential task needs to be quantified.

The optimal setting for Stanza is quite general and occurs in many different application areas. For example, in many medical settings, the quantity of data may be limited, yet it is critical to have an interpretable and probabilistic model for inference tasks and simple DLMs may not be rich enough to capture the complexity in the data, for example, with EEG signals (West et al., 1999). In retail sales settings, there may be an abundance of data, yet the goal may be to understand seasonality patterns in consumer behavior, necessitating a model with interpretable parameters that can either uncover structure in the data or specifically incorporate known structure into the model (Berry and West, 2019). Finally, even in financial settings where forecasting is more important than inference, it may be the case that there is not enough data to train a full deep learning, attention-based model, yet competitive forecasting accuracy is still required (for example, in the case of the Exchange Rate data considered in (Lai et al., 2017)). Stanza is motivated by a need for powerful inference and forecasting.

Our contributions are as follows:

  1. Stanza: We propose Stanza as a nonlinear, non-stationary, interpretable time series model that gives full uncertainty estimates for parameters.

  2. Versatility: We demonstrate Stanza’s ability to model complex and highly structured time series and provide interpretable parameters with uncertainty estimates for inference. We also demonstrate that Stanza is competitive with deep LSTMs in terms of forecasting accuracy on real-world datasets, particularly for multi-step ahead forecasts.

  3. Intermediate Approach: Stanza incorporates ideas from deep learning based models directly into the state space model framework.

2 Stanza: a State Space Model with Nonlinear Activations and Dynamically Weighted Latent (Z) States

Stanza is designed for competitive performance on both inference and forecasting tasks, specifically for settings with limited data and/or where parameter uncertainty estimates and interpretable models are critical. Building off ideas from traditional state space models, particularly Time-Varying Autoregressive (TVAR) models (West et al., 1999), Stanza combines ideas from deep learning approaches into the state space model framework for an intermediate approach that balances advantages of probabilistic inference and competitive forecasting accuracy.

2.1 Model

Let be a univariate or multivariate time series of length centered about 0, and a sequence of real-valued latent states. The Stanza model is specified in Equation 1:


where . The dynamic weights, , allow for non-stationarity in the modeling of the time series and can directly capture structure in the observed time series.111Note: for clarity of notation, when

, there are 0s in the weight vector

for the elements corresponding to times . For example, if and , then . The hyper-parameter determines the order of the lag dependence in the state space and can be specified directly with knowledge about the application. For example, for daily time series data, if there is known weekly seasonality, then can be set to 7 and the dynamic weights, are expected to play an important role in the modeling of the time series. The lag dependence can also be specified via cross-validation, allowing for structure to be learned from the time series. The lagged dependence enables series with longer term structure to be modeled as compared to the standard DLM or Kalman Filter, which utilize only a first-order Markov assumption in the latent space. The function is the hyperbolic tangent function, which helps prevent the values of from increasing without bound and also aids in constraining the model for identifiability. This specification extends the basic TVAR framework of (West et al., 1999) with ideas from attention-based models, such as (Alaa and van der Schaar, 2019), where the complexity of the model is captured in the state space, rather than the observed space, via dynamic weights. In the case of Stanza, these dynamic weights, can be directly interpreted in terms of structure in the observed time series.

Similar to deep models, Stanza specifies the model complexity in the latent space, rather than the emission distribution, which is kept simple. This setup allows for flexibility to model either univariate or multivariate series

within the same framework due to the simplicity of the emission distribution. This is in contrast to TVAR models, which would require matrix-normal distributions to model multivariate sequences. Additionally, in contrast to LSTMs, Stanza allows for probabilistic, interpretable inference, while still allowing for long-term dependencies, specified by

. An external latent sequence can optionally be added to the state space to serve as an external input signal, as often used in traditional Kalman Filters (Dan Simon, 2006) or to be used as a multi-scale latent factor in a decouple-recouple setting (Berry and West, 2019; Ferreira et al., 2006; West and Harrison, 1997).

2.2 Inference

Filtering and smoothing of the latent space are critical to inference in DLMs and Kalman filters, and inference in Stanza proceeds similarly. The Extended Kalman Filter (Ljung, 1979) and Unscented Kalman Filter (Eric A. Wan and Rudolph van der Merwe, 2000)

are extensions to the forward filtering algorithms in the standard Kalman Filter for estimation in the nonlinear setting. We utilize both approaches in the filtering of parameters for Stanza. The Unscented Kalman Filter uses a deterministic sampling approach to approximate the mean and variance of the state variables and can capture the mean and variance of the filtered latent state variable out to the third order

(Eric A. Wan and Rudolph van der Merwe, 2000). The Extended Kalman Filter uses a first order Taylor approximation of the nonlinearity to update the traditional Kalman Filter (Ljung, 1979). Stanza is motivated as a competitive approach to existing deep and state space models and we thus want inference to be relatively fast and efficient. Thus, we avoid the use of particle filters for inference due to their computational cost. A comparison of the Unscented and Extended Kalman Filters is discussed in (Gustafsson and Hendeby, 2012). Smoothing results in posterior distributions for each time .

The inference procedure for Stanza is outlined in Algorithm 1

and mainly consists of nonlinear filtering and smoothing of the latent variables, followed by EM algorithm updates for the remaining parameters. First, we perform Unscented Filtering

(Eric A. Wan and Rudolph van der Merwe, 2000; Dan Simon, 2006) and Smoothing (SÄrkkÄ, 2008; Dan Simon, 2006) for the latent sequence . Specifically, we specify Sigma Points, a minimal set of sample points, for all lagged values for use in the Unscented Filter and Smoother to increase the fidelity of the approximation. We approximate the covariance matrix of these lagged values with a diagonal matrix of the filtered variances at previous time points. Then, we perform Extended Kalman Filtering (Ljung, 1979; Dan Simon, 2006) and regular RTS Smoothing (Rauch et al., 1965) for the latent weight vectors . There are no nonlinearities in the state equation for , so the Unscented Kalman Filter results in degenerate cross-covariance matrices, as the Sigma Points used in inference expect nonlinearities in the state space and observation space. Thus, we use the Extended Kalman Filter combined with the TVAR inference algorithm from (West et al., 1999), with corresponding first derivative corrections due to the nonlinearity.

We initialize these latent variables via a “warm start” approach, which ignores the nonlinearity in the state space. As filtering and smoothing depend on both and , it is important to start with reasonable values of the latent variables for fast and accurate inference in the Stanza model. We first initialize the sequence by the traditional, linear Kalman Filter and RTS Smoother with randomly sampled, time-invariant parameters. Then, we sample a latent trajectory from the smoothed estimates to initialize the sequence via the TVAR inference algorithm (West et al., 1999), which is similar and fully linear. This warm start approach gives reasonable trajectories of the state variables and to start the Stanza inference procedure.

As it can be difficult to learn or to fully specify time-varying variances like and , we use discount factors (West and Harrison, 1997) to aid in the specification of these latent variances. For a discount factor , between 0 and 1 (but for most practical models, ), we specify , where is the variance of from forward filtering (West and Harrison, 1997). The discount factor thus inflates the variance , relative to the filtered uncertainty and also represents a decay of information over time. A similar discount factor setup can be specified for . Lower discount factors allow for more volatility and adaptability of the filtered parameters (and hence forecast distributions), while higher discount factors result in much less inflation of the variance and less increase in the uncertainty from one time step to the next. Discount factors are thus directly interpretable in terms of the increase in the uncertainty and adaptability of the parameters for filtering from one time step to the next, and can thus either be specified by domain knowledge or by cross-validation. Discount factors are very general and can be extended to many settings; for a full discussion, see (West and Harrison, 1997). We specify discount factors and for and , respectively.

Finally, we learn the parameters and via the EM algorithm, following (Khan and Dutt, 2007), where these updates have closed forms and use the mean and variance of the smoothed from the Unscented Kalman Smoother. Alternatively, priors could be specified for these parameters and and could be learned via Gibbs Sampling. The simplicity of the emission distribution in Equation 1 allows for simple updates of and and also enables equally simple inference for these parameters whether is a univariate or multivariate sequence. The convergence criteria of Algorithm 1 can either be on the log-likelihood of the model or on the change in one of the EM parameters, or . Inference in Stanza is relatively fast and efficient; a full timing comparison with related approaches is discussed in Section 4 and full inference details are given in the Supplementary Materials.

There are three hyper-parameters to specify for Stanza: , the lagged dependence in the latent space, , the discount factor for and , the discount factor for . All of these parameters are directly interpretable and can either be specified directly via domain knowledge or selected using cross validation.

Result: Posterior distributions for and , estimates , , ,
Specify , and ;
Warm Start: Filter and smooth via the Kalman Filter and RTS Smoother; filter and smooth via the TVAR inference algorithm;
while While not converged do
       Filter via the Unscented Kalman Filter;
       Smooth via the Unscented Kalman Smoother;
       Filter via the Extended Kalman Filter; update and using discount factors;
       Smooth via the RTS Smoother;
       Update and via the EM Algorithm.  
end while
Algorithm 1 Stanza Inference

2.3 Forecasting

As in the basic DLM setting, forecasting in Stanza proceeds very naturally via simulation. At a high-level, forecasting via simulation proceeds similarly to the regular filtering procedure in Algorithm 1, but where values and are sampled from the one step-ahead forecast distribution and used as the “observed” values in propagating the state variables forwards in time. Specifically, for one-step ahead forecasting at time , we first propagate the estimate for through the state update equation via the Extended Kalman Filter, where the value for at time t is sampled from the prior distribution. Then, after completing the update from time to time for , we can sample a value of from the Extended Kalman Filter update, that is used to update the estimate for via the state update equation and the Unscented Kalman Filter. After propagating both of the state space sequences forwards in time, a value of can easily be sampled from the measurement update equations in the Unscented Kalman Filter. This facilitates probabilistic forecasting of and can be repeated to forecast steps ahead in time (West and Harrison, 1997). It is also possible to propagate only the forecast mean, for example, rather than the full forecast distribution, for computational efficiency.

3 Related Work

Long Short-Term Memory (LSTM) models (Hochreiter and Schmidhuber, 1997)

are widely used deep learning models for sequences or time series and address several of the training issues associated with recurrent neural networks

(Goodfellow et al., 2016), and have been applied in many different applications. However, LSTMs can still struggle to capture long-term structure in sequences and time series (Greaves-Tunnell and Harchaoui, 2019) and more recently, attention-based approaches (Vaswani et al., 2017; Chung et al., 2015)

have achieved state-of-the-art results for sequence modeling. While these models are capable of capturing long-term structure in time series, they require large amounts of data to train and are not necessarily suited to settings where probabilistic, interpretable inference is important or for settings in the small to medium data regime. Autoregressive models such as

(Sen et al., 2019) are additionally well suited to large amounts of data, but not necessarily for settings where interpretable, probabilistic inference is required.

Dynamic Linear Models (DLMs) (West and Harrison, 1997), which include Kalman Filters (Kalman, 1960), are linear models that apply probabilistic filtering and smoothing. Extensions include Dynamic Generalized Linear Models, which model from exponential family distributions (West and Harrison, 1997; Berry and West, 2019), and to include nonlinearities in the state space and observation equations (Dan Simon, 2006; Eric A. Wan and Rudolph van der Merwe, 2000; Valpola and Karhunen, 2002). Stanza extends ideas from TVAR models, in particular (West et al., 1999). DLMs and various extensions are discussed in detail in (Roweis and Ghahramani, 1999) and in a Bayesian time series setting in (West and Harrison, 1997).

Finally, there are several related approaches that try to address the range between full deep learning approaches and linear state space models, either by including probabilistic inference in deep learning approaches or by extending state space models to more complex settings. Deep learning focused models include Deep Kalman Filters (Krishnan et al., 2015), which use neural networks for the emission and transition distributions in the DLM specification, probabilistic recurrent state space models (Doerr et al., 2018), which combine RNN training with Gaussian processes and Gaussian process dynamic models (Wang et al., 2006) for motion capture data. Additional prior work for probabilistic time series focuses on different aspects of improving inference in complex models (Archer et al., 2015; Karl et al., 2017; Krishnan et al., 2017). Finally, there is recent work focusing specifically on improving forecasting using deep models to extend state space models, such as (Rangapuram et al., 2018; Salinas et al., 2019).

Related approaches that extend state space models directly to nonlinear or more complex settings include (Raiko and Tornio, 2009), which applies the variational inference approach proposed in (Valpola and Karhunen, 2002) to learn nonlinear state space models for model predictive control, rather than inference or forecasting, and (Alaa and van der Schaar, 2019), which uses a discrete state space and an attention mechanism learned via a sequence-to-sequence model to model disease progression.

Stanza, however, is distinct from these previously proposed approaches. Unlike (Raiko and Tornio, 2009; Rangapuram et al., 2018), Stanza is motivated by a desire to perform competitively for both inference and forecasting settings. Thus, Stanza is designed to be a versatile model that can perform well under a variety of different circumstances. Additionally, rather than applying state space ideas to deep learning approaches (Krishnan et al., 2015; Doerr et al., 2018; Rangapuram et al., 2018; Archer et al., 2015), Stanza incorporates ideas from deep learning into a state space framework. Similar to (Alaa and van der Schaar, 2019), Stanza creates dynamic dependencies in the latent space using dynamic weighting, rather than an attention mechanism with a discrete state space (Alaa and van der Schaar, 2019), to do so. Additionally, Stanza specifies complexity in the latent space and adds nonlinearities to improve performance.

4 Experiments

Stanza is designed for competitive performance in inference and forecasting settings. We demonstrate Stanza’s inference utility on two simulation experiments, both of which consider settings where Stanza is able to capture relevant features in the data, while traditional DLMs are not sufficiently rich to do so. We also demonstrate Stanza’s competitive forecasting accuracy on three real-world datasets.

4.1 Simulations

We first demonstrate the ability of Stanza to recover structure in time series. We consider a highly structured series (Figure 0(a) top with autocorrelation function (ACF) and partial autocorrelation function (PACF) plots Figure 0(b)). There is clear seasonality out to lag 6, so we can specify directly in the Stanza model. The resulting posterior mean weight matrix, is shown in the bottom of Figure 0(a). The weight corresponding to times is the largest across time, reflecting the importance of the seasonality evident in the original series. Additionally, the structure of the time series changes slightly between times 150-250 and 300-400, and this change is evident in decreased values of the weights at these same times (Figure 0(a) bottom). Additionally, we reproduce this structure when we simulate from from the learned Stanza model, see the ACF/PACF plots in Figure 0(b) right. Thus, for a highly structured time series, Stanza is able to recover key features of the structure and dependence of the series over time in the dynamic weight matrix , which can be directly interpreted in the context of the application.

(a) and .
(b) ACF and PACF plots.
Figure 1: For a highly structured, periodic time series (a, upper) with strong seasonality, as evidenced in the ACF/PACF plots (b, left), Stanza is able to recover this structure in the parameters. i.e. is important and changes when the pattern in changes (a, lower). Additionally, when generating from the learned Stanza model, the generated series is able to exhibit the same structure as the original series, in terms of the ACF and PACF (b, right).
(a) Posterior predictive mean.
(b) Coverage plots.
Figure 2: (a) Posterior predictive distribution for a Stanza(5) model fit to the trajectory for a Lorenz attractor. The posterior predictive mean, , agrees very well with the observed trajectory. (b) Forecast distribution coverage for step-ahead forecasts for the Stanza(5) model fit to the Lorenz attractor trajectory.

To demonstrate Stanza’s ability to model complex, nonlinear time series, we generate data from a Lorenz attractor (Lorenz, 1963), following (Garcia Satorras et al., 2019). A Lorenz attractor is defined by a system of three differential equations and is an example of a chaotic system. We simulate 1000 time points from a Lorenz attractor with a time delta of 0.01, and consider the trajectory of the first variable, . We then fit a Stanza(5) (i.e. ) model with high discount factors of 0.99 for and , indicating little volatility. The posterior predictive distribution for the learned Stanza model is shown in Figure 1(a). There is excellent correspondence between the posterior predictive mean and the observed Lorenz attractor trajectory, as well as good coverage for multi-step ahead forecasting (Figure 1(b)

). As Stanza performs probabilistic forecasting, we can calculate the coverage of the forecast distribution as the percentage of observed values that fall within specific percentiles of the forecast distribution. That is, we expect approximately 95% of the observed values to fall within the 95% credible intervals for the forecast distribution.

4.2 Forecasting

We additionally demonstrate Stanza’s competitive forecasting performance on three real-world datasets. Following (Salinas et al., 2019; Lai et al., 2017; Sen et al., 2019), we compare Stanza’s forecasting performance on the following publicly available datasets: Exchange Rate (Lai et al., 2017), the daily exchange rate for 8 different currencies from 1990-2011, and Electricity222, hourly electricity consumption for 370 customers (subset to the first 50 customers and the last 10000 time points). We additionally analyze the Weather dataset from (Chollet, 2017

; TensorFlow Tutorials,


, which contains 13 different weather measurements. We subset the Weather dataset to analyze the data from 10/01/2016 to 12/31/2016. For each forecasting experiment considered below, the last 1000 time points of each series are treated as test data and the preceding points as the training data. All series are individually normalized using the mean and standard deviation of the training set.

(a) Exchange Rate.
(b) Electricity.
Figure 3: Coverage for step-ahead forecasting on the (a) Exchange Rate and (b) Electricity datasets. Stanza achieves excellent coverage on both datasets.

We fit several models independently to each series in each dataset and compare forecasting accuracy with the root-mean squared error (RMSE) for , and step-ahead forecasting, where is the forecast horizon. We compare Stanza to two baseline models, a normal DLM with time-invariant parameters (West and Harrison, 1997) and a TVAR model (West et al., 1999; Prado and West, 2010), as well as two LSTMs, one with 2 hidden layers (LSTM(2)) and one with 5 hidden layers (LSTM(5)). For a direct comparison across series, we use cross-validation to tune the three hyper-parameters for Stanza (using one-step ahead forecasting RMSE as a metric) on each dataset and use the same hyper-parameters for all series for both the Stanza and TVAR models for consistency. As Exchange Rate is a daily dataset, the optimal value of for Stanza represents weekly seasonality, and Stanza is able to capture relevant structure in the data. To train the LSTMs, we use a rolling window approach, where the previous 50 data points are used to predict the next 10. The LSTMs are trained with the Adam optimizer on MSE loss. Additionally, as all models except for the LSTMs make probabilistic forecasts, we compare the 95% forecast coverage of the baseline models to Stanza.

The forecasting results for the individual series are given in Table 1. For all three datasets and across all three forecast horizons, Stanza achieves a good balance of high forecast accuracy and good empirical coverage (Figure 3). As compared to competing approaches, Stanza’s advantages are the ability to model multivariate time series (unlike TVAR), to produce probabilistic forecasts and thus coverage estimates (unlike deep LSTMs) and to model non-stationary time series (unlike DLMs). The forecasting accuracy of Stanza does not decay over longer forecasting horizons, and is competitive with the performance of the deep LSTMs across all forecast horizons. Additionally, LSTMs do not provide probabilistic forecast distributions and thus cannot be used to calculate coverage of the observed time series, while Stanza achieves excellent coverage (Figure 3), even compared to the DLM and TVAR models.

RMSE 95% Coverage
Dataset Model
Exchange Rate DLM
Electricity DLM
Weather DLM 97 97 97
TVAR(6) 95 96 96
Stanza(6) 99 98 97
Table 1: Forecast performance of Stanza as compared to several baseline models. Average RMSE and empirical 95% coverage, where the results are averaged across series. denotes the forecast horizon. The numbers in parentheses refer to the order of the model; Stanza(7) is a Stanza model where and LSTM(2) is an LSTM with two layers.

The TVAR model achieves very good forecasting accuracy, across the datasets. However, the TVAR model is not designed for multivariate series, while Stanza is easily extended to multivariate sequences. We fit Stanza to the full multivariate Exchange Rate and Weather datasets, due to their relatively low dimension. These results are given in Table 2. This extension to the full multivariate sequence further improves the forecasting ability of Stanza, especially compared to the TVAR models and the deep LSTMs in Table 1, on the Exchange Rate data.

Finally, we can look at the computational efficiency of the Stanza model. For the Exchange Rate data, the average time in seconds for training Stanza is 41 seconds, while for the LSTM(5) model is 222 seconds (averaged across series, run on a single core). While forecasting via simulation for Stanza is much slower than for the LSTM, this computational time could be reduced via propagating the mean only or via parallelization.

Dataset Model
Exchange Rate Stanza(7) 0.09 0.09 0.09
Weather Stanza(6) 0.81 0.91 0.92
Table 2: Forecast performance of Stanza on multivariate sequences.

5 Discussion

Stanza is motivated by a general goal of integrating modern machine learning approaches with traditional stochastic models. Specifically, we propose Stanza as a versatile, non-stationary state space model for competitive performance on inference and forecasting tasks. Stanza is able to achieve competitive forecasting accuracy with deep LSTMs, while still maintaining probabilistic estimates of the state space and interpretability of parameters. Stanza is suitable for settings where parameter interpretability is important for modeling complex time series and is suitable for either univariate or multivariate modeling. There is a long history of simple machine learning models, like random forests, outperforming deep models on certain tasks, and Stanza fits into this paradigm. Future directions include extending these ideas further in the direction of deep learning models, for example by adding additional depth to the latent space in the proposed Stanza model or extensions for higher-dimensional time series, for example with ideas from

(Sen et al., 2019; Salinas et al., 2019). Stanza is a step towards a powerful, general framework of incorporating deep learning ideas directly into traditional probabilistic models.


  • A. M. Alaa and M. van der Schaar (2019) Attentive state-space modeling of disease progression. In Advances in Neural Information Processing Systems 32, pp. 11334–11344. Cited by: §2.1, §3, §3.
  • E. Archer, I. M. Park, L. Buesing, J. Cunningham, and L. Paninski (2015) Black box variational inference for state space models. CoRR abs/1511.07367. External Links: 1511.07367, Link Cited by: §3, §3.
  • L. R. Berry and M. West (2019) Bayesian forecasting of many count-valued time series. Journal of Business and Economic Statistics. Cited by: §1, §2.1, §3.
  • F. Chollet (2017) Deep learning with python. Manning Publications. Cited by: Table 3, §4.2.
  • J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems 28, pp. 2980–2988. Cited by: §1, §3.
  • Dan Simon (2006) Optimal State Estimation. Wiley. Cited by: Appendix A, §B.1, §2.1, §2.2, §3.
  • A. Doerr, C. Daniel, M. Schiegg, D. Nguyen-Tuong, S. Schaal, M. Toussaint, and S. Trimpe (2018) Probabilistic recurrent state-space models. CoRR abs/1801.10395. External Links: 1801.10395, Link Cited by: §3, §3.
  • Eric A. Wan and Rudolph van der Merwe (2000) The unscented kalman filter for nonlinear estimation. Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications and Control Symposium, pp. 153–158. Cited by: Appendix A, §1, §2.2, §2.2, §3.
  • M. A. R. Ferreira, M. West, H. K. H. Lee, and D. M. Higdon (2006) Multiscale and hidden resolution time series models. Bayesian Analysis 2, pp. 294–314. Cited by: §2.1.
  • V. Garcia Satorras, Z. Akata, and M. Welling (2019) Combining generative and discriminative models for hybrid inference. In Advances in Neural Information Processing Systems 32, pp. 13802–13812. Cited by: §B.3, §4.1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §3.
  • A. Greaves-Tunnell and Z. Harchaoui (2019) A statistical investigation of long memory in language and music. CoRR abs/1904.03834. External Links: 1904.03834, Link Cited by: §3.
  • F. Gustafsson and G. Hendeby (2012) Some relations between extended and unscented kalman filters. IEEE Transactions on Signal Processing 60 (2), pp. 545–555. Cited by: §2.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1, §3.
  • R. E. Kalman (1960) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82 (1), pp. 35–45. Cited by: §B.1, §1, §3.
  • M. Karl, M. Sölch, J. Bayer, and P. van der Smagt (2017)

    Deep variational bayes filters: unsupervised learning of state space models from raw data

    In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §3.
  • M. E. Khan and D. N. Dutt (2007)

    An expectation-maximization algorithm based kalman smoother approach for event-related desynchronization (ERD) estimation from EEG

    IEEE Transactions on Biomedical Engineering 54 (7), pp. 1191 – 1198. Cited by: §2.2.
  • R. G. Krishnan, U. Shalit, and D. Sontag (2015) Deep kalman filters. CoRR abs/1511.05121. External Links: 1511.05121, Link Cited by: §3, §3.
  • R. G. Krishnan, U. Shalit, and D. Sontag (2017) Structured inference networks for nonlinear state space models. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    AAAI’17, pp. 2101–2109. Cited by: §3.
  • G. Lai, W. Chang, Y. Yang, and H. Liu (2017) Modeling long- and short-term temporal patterns with deep neural networks. CoRR abs/1703.07015. External Links: 1703.07015, Link Cited by: §C.1, Table 3, §1, §4.2.
  • L. Ljung (1979) Asymptotic behavior of the extended kalman filter as a parameter estimate for linear systems. IEEE Transactions on Automatic Control 24 (1), pp. 36–50. Cited by: Appendix A, §1, §2.2, §2.2.
  • E. N. Lorenz (1963) Deterministic nonperiodic flow. Journal of the Atmospheric Sciences 20, pp. 130–141. Cited by: §B.3, §4.1.
  • R. Prado and M. West (2010) Time Series: Modeling, Computation and Inference. CRC Press. Cited by: §4.2.
  • T. Raiko and M. Tornio (2009) Variational Bayesian learning of nonlinear hidden state-space models for model predictive control. Neurocomputing 72 (16-18), pp. 3704–3712. Cited by: §3, §3.
  • S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, and T. Januschowski (2018) Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems 31, pp. 7785–7794. Cited by: §3, §3.
  • H. E. Rauch, C. Striebel, and F. Tung (1965) Maximum likelihood estimates of linear dynamic systems. AIAA Journal 3 (8), pp. 1445–1450. Cited by: Appendix A, §2.2.
  • S. Roweis and Z. Ghahramani (1999) A unifying review of linear gaussian models. Neural Computation 11, pp. 305–345. Cited by: §3.
  • D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, and J. Gasthaus (2019) High-dimensional multivariate forecasting with low-rank gaussian copula processes. In Advances in Neural Information Processing Systems 32, pp. 6824–6834. Cited by: §C.1, §3, §4.2, §5.
  • S. SÄrkkÄ (2008) Unscented rauch–tung–striebel smoother. IEEE Transactions on Automatic Control 53 (3), pp. 845–849. Cited by: Appendix A, §2.2.
  • R. Sen, H. Yu, and I. Dhillon (2019) Think globally, act locally: a deep neural network approach to high-dimensional time series forecasting. CoRR abs/1905.03806. External Links: 1905.03806, Link Cited by: §C.1, §3, §4.2, §5.
  • TensorFlow Tutorials (2020) Time series forecasting. Note: Cited by: Table 3, §4.2.
  • H. Valpola and J. Karhunen (2002) An unsupervised ensemble learning method for nonlinear dynamic state-space models. Neural Computation 14 (11), pp. 2647–2692. Cited by: §3, §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998–6008. Cited by: §1, §3.
  • J. Wang, A. Hertzmann, and D. J. Fleet (2006) Gaussian process dynamical models. In Advances in Neural Information Processing Systems 18, pp. 1441–1448. Cited by: §3.
  • M. West and J. Harrison (1997) Bayesian Forecasting and Dynamic Models. Springer. Cited by: Appendix A, §1, §2.1, §2.2, §2.3, §3, §4.2.
  • M. West, R. Prado, and A. D. Krystal (1999) Evaluation and comparison of EEG Traces: latent structure in nonstationary time series. Journal of the American Statistical Association 94 (448), pp. 1083–1095. Cited by: Appendix A, §1, §2.1, §2.2, §2.2, §2, §3, §4.2.

Appendix A Inference

Let be a univariate or multivariate time series of length centered about 0, and a sequence of real-valued latent states. The Stanza model is specified in Equation 2:


where . Additionally, and , where we specify , where

is the identity matrix of dimension

. Additionally, in the TVAR model framework [West et al., 1999], we specify for the volatility for ; see [West et al., 1999, West and Harrison, 1997] for details.

Filtering and smoothing of proceeds via the Unscented Kalman Filter [Eric A. Wan and Rudolph van der Merwe, 2000] and Unscented Smoother [SÄrkkÄ, 2008, Dan Simon, 2006]. Inference for proceeds via the Extended Kalman Filtering [Ljung, 1979, Dan Simon, 2006] and regular RTS Smoothing [Rauch et al., 1965]. The EM updates for the parameters and are given in Equation 3 for univariate and Equation 4 for multivariate , where the expectations are obtained via the smoothing relationships for , as smoothing results in posterior distributions for each time .


Appendix B Simulation Experiments

Additional details about the Simulation Experiments in the main paper are discussed in this section.

b.1 Kalman Filter Recovery

Stanza can also be fit to time series that follow dynamics that are simpler than those assumed in the Stanza model. We generate data from a regular Kalman Filter with time-invariant parameters and then fit a Stanza(3) model with discount factors . The generated series and the generating latent sequence are shown in Figure 4. A time invariant Kalman Filter can be expressed as in Equation 5 [Kalman, 1960, Dan Simon, 2006]:


The filtered, , and smoothed, , means from Stanza are shown in Figure 5 and correspond well to the generating latent sequence, . Finally, the posterior predictive mean for is shown in Figure 6 with credible intervals. There is good correspondence between the observed, generated sequence and the Stanza model fit. Thus, Stanza is able to model simpler dynamics then specified in the full model well.

Figure 4: Latent, and observed, sequences generated from a time-invariant Kalman Filter. The parameters for this Kalman Filter are , , and .
Figure 5: Filtered () and smoothed () means from Stanza for the Kalman Filter generated data. The generating latent sequence is shown in black. The shaded region shows the 95% credible intervals using the smoothed mean and variance from the Unscented Kalman Filter.
Figure 6: True sequence and posterior predictive mean from the Stanza(3) model on the Kalman Filter Data. The shaded regions represent 95% credible intervals. There is good correspondence and coverage of the Stanza model fit.

b.2 Structure Recovery

A Stanza(6) model with is fit to highly periodic, structured data. The learned model parameters are then used to generate new data, which corresponds well in terms of temporal structure to the original data (Figure 7).

Figure 7: Left, original, highly periodic series plotted with ACF and PACF plots. There is clear seasonal structure, with period 6. Right, time series generated by a Stanza(6) model on the time series on the left. The structure in the ACF and PACF plots is recovered well by the Stanza(6) generated series.

b.3 Lorenz Attractor Details

To demonstrate Stanza’s ability to model complex, nonlinear time series, we generate data from a Lorenz attractor [Lorenz, 1963], following [Garcia Satorras et al., 2019]. A Lorenz attractor is defined by a system of three differential equations and is an example of a chaotic system. The Lorenz equations used to generate the data for this simulation are given in Equation 6. We simulate 1000 time points from a Lorenz attractor with a time delta of 0.01, and consider the trajectory of the first variable, . We then fit a Stanza(5) model with high discount factors of 0.99 for and , indicating little volatility.


Appendix C Forecasting Experiments

Additional details about the forecasting experiments considered in the main paper are included here.

c.1 Datasets

We demonstrate Stanza’s competitive forecasting ability on three real world, publicly available datasets, following [Salinas et al., 2019, Lai et al., 2017, Sen et al., 2019]. Additional details about each dataset are given in Table 3. All subsetting is performed for computational efficiency only. The Electricity333 data is subset to the first 50 customers and the last 10000 time points. The Weather data is subset for the time range 10/01/2016 to 12/31/2016. For all datasets and for each forecasting experiment considered below, the last 1000 time points of each series are treated as test data and the preceding points as the training data. All series are individually normalized using the mean and standard deviation of the training set. For the multivariate modeling, the Weather dataset is subset to 9 series.

Dataset Frequency Dimension Number of Time Points
Exchange Rate [Lai et al., 2017] Daily 8 7588
Electricity Hourly 50 10000
Weather [Chollet, 2017, TensorFlow Tutorials, 2020] 10 mins 14 12804
Table 3: Dataset details for the forecasting experiments.

c.2 Forecasting Experiment Hyper-Parameter Details

For the univariate forecasting experiment, the hyper-parameters for the Stanza models are tuned on an aggregate series for the Electricity and Exchange Rate datasets, and then the same hyper-parameters are used for modeling all the individual series for the Stanza and TVAR models. The hyper-parameters are selected based on which combination gives the best 1-step ahead forecasting RMSE, using a grid of values. The hyper-parameters for the Electricity data are and for the Exchange Rate data are . For the Weather dataset, as an aggregate series does not make sense (each individual series measures a different physical quantity on a different scale), we manually select (corresponding to hourly structure in the data) and , allowing for a relatively large amount of volatility in the measurements.

c.3 Timing

Full timing details for training and forecasting on the Exchange Rate data are given in Table 4.

Dataset Model Training Forecasting
Exchange Rate DLM 1.9 37.4
TVAR(7) 0.57 124.9
LSTM(2) 214.8 1.4
LSTM(5) 222.1 1.4
Stanza(7) 41.8 669.9
Table 4: Time in seconds for the training and forecasting of each model on a single core.