1 Introduction
Time series with longterm structure arise in a variety of contexts and capturing this temporal structure is a critical challenge in time series analysis for both inference and forecasting. Traditionally, state space models like Kalman Filters
(Kalman, 1960), Dynamic Linear Models (DLMs) (West and Harrison, 1997) and their numerous extensions (Ljung, 1979; Eric A. Wan and Rudolph van der Merwe, 2000) have been utilized for probabilistic inference in time series analysis. This general class of state space models has been very successful for modeling systems where the dynamics are approximately linear and relatively simple. Alternatively, deep learning, attentionbased models (Vaswani et al., 2017; Chung et al., 2015) have achieved stateoftheart forecasting performance in modeling longerterm structure in time series and sequences, especially for applications relating to language. However, these methods often require large amounts of training data.Traditional Kalman Filters and DLMs are very good for inference tasks, especially in settings where probabilistic filtering and smoothing of the state space is important. However, Kalman Filters often are not competitive for forecasting, especially for data that does not meet the model specifications. On the other hand, deep learning models like Long ShortTerm Memory (LSTM) models
(Hochreiter and Schmidhuber, 1997) achieve excellent forecasting accuracy in many settings, but cannot be used for inference tasks, as filtering and smoothing is done implicitly and deterministically in the model. Thus, there still remains a gap between time series models designed for inference and those that achieve competitive forecasting accuracy.We propose Stanza, a State Space Model with Nonlinear Activations and Dynamically Weighted Latent (Z) States, as an intermediate approach to alleviate this tradeoff between inference and forecasting. Our motivation is to fill the gap between large deep learning, attentionbased models and traditional state space models. Stanza achieves forecasting accuracy comparable to deep LSTMs on three realworld datasets and is able to perform inference for complex, nonlinear, nonstationary time series. Furthermore, Stanza is fully probabilistic in the state space, allowing for straightforward multistep ahead forecasts out to long time horizons and provides forecast coverage estimates which are not possible in traditional LSTMs. Stanza incorporates ideas from deep learning into traditional state space models, resulting in a nonlinear, nonstationary model capable of performing inference in more general and complex settings, as compared to classical state space models. We design Stanza specifically for settings where there is not enough data for a full attentionbased model (i.e. the small to medium data regime) and/or where the model needs to be interpretable and uncertainty in the inferential task needs to be quantified.
The optimal setting for Stanza is quite general and occurs in many different application areas. For example, in many medical settings, the quantity of data may be limited, yet it is critical to have an interpretable and probabilistic model for inference tasks and simple DLMs may not be rich enough to capture the complexity in the data, for example, with EEG signals (West et al., 1999). In retail sales settings, there may be an abundance of data, yet the goal may be to understand seasonality patterns in consumer behavior, necessitating a model with interpretable parameters that can either uncover structure in the data or specifically incorporate known structure into the model (Berry and West, 2019). Finally, even in financial settings where forecasting is more important than inference, it may be the case that there is not enough data to train a full deep learning, attentionbased model, yet competitive forecasting accuracy is still required (for example, in the case of the Exchange Rate data considered in (Lai et al., 2017)). Stanza is motivated by a need for powerful inference and forecasting.
Our contributions are as follows:

Stanza: We propose Stanza as a nonlinear, nonstationary, interpretable time series model that gives full uncertainty estimates for parameters.

Versatility: We demonstrate Stanza’s ability to model complex and highly structured time series and provide interpretable parameters with uncertainty estimates for inference. We also demonstrate that Stanza is competitive with deep LSTMs in terms of forecasting accuracy on realworld datasets, particularly for multistep ahead forecasts.

Intermediate Approach: Stanza incorporates ideas from deep learning based models directly into the state space model framework.
2 Stanza: a State Space Model with Nonlinear Activations and Dynamically Weighted Latent (Z) States
Stanza is designed for competitive performance on both inference and forecasting tasks, specifically for settings with limited data and/or where parameter uncertainty estimates and interpretable models are critical. Building off ideas from traditional state space models, particularly TimeVarying Autoregressive (TVAR) models (West et al., 1999), Stanza combines ideas from deep learning approaches into the state space model framework for an intermediate approach that balances advantages of probabilistic inference and competitive forecasting accuracy.
2.1 Model
Let be a univariate or multivariate time series of length centered about 0, and a sequence of realvalued latent states. The Stanza model is specified in Equation 1:
(1) 
where . The dynamic weights, , allow for nonstationarity in the modeling of the time series and can directly capture structure in the observed time series.^{1}^{1}1Note: for clarity of notation, when
, there are 0s in the weight vector
for the elements corresponding to times . For example, if and , then . The hyperparameter determines the order of the lag dependence in the state space and can be specified directly with knowledge about the application. For example, for daily time series data, if there is known weekly seasonality, then can be set to 7 and the dynamic weights, are expected to play an important role in the modeling of the time series. The lag dependence can also be specified via crossvalidation, allowing for structure to be learned from the time series. The lagged dependence enables series with longer term structure to be modeled as compared to the standard DLM or Kalman Filter, which utilize only a firstorder Markov assumption in the latent space. The function is the hyperbolic tangent function, which helps prevent the values of from increasing without bound and also aids in constraining the model for identifiability. This specification extends the basic TVAR framework of (West et al., 1999) with ideas from attentionbased models, such as (Alaa and van der Schaar, 2019), where the complexity of the model is captured in the state space, rather than the observed space, via dynamic weights. In the case of Stanza, these dynamic weights, can be directly interpreted in terms of structure in the observed time series.Similar to deep models, Stanza specifies the model complexity in the latent space, rather than the emission distribution, which is kept simple. This setup allows for flexibility to model either univariate or multivariate series
within the same framework due to the simplicity of the emission distribution. This is in contrast to TVAR models, which would require matrixnormal distributions to model multivariate sequences. Additionally, in contrast to LSTMs, Stanza allows for probabilistic, interpretable inference, while still allowing for longterm dependencies, specified by
. An external latent sequence can optionally be added to the state space to serve as an external input signal, as often used in traditional Kalman Filters (Dan Simon, 2006) or to be used as a multiscale latent factor in a decouplerecouple setting (Berry and West, 2019; Ferreira et al., 2006; West and Harrison, 1997).2.2 Inference
Filtering and smoothing of the latent space are critical to inference in DLMs and Kalman filters, and inference in Stanza proceeds similarly. The Extended Kalman Filter (Ljung, 1979) and Unscented Kalman Filter (Eric A. Wan and Rudolph van der Merwe, 2000)
are extensions to the forward filtering algorithms in the standard Kalman Filter for estimation in the nonlinear setting. We utilize both approaches in the filtering of parameters for Stanza. The Unscented Kalman Filter uses a deterministic sampling approach to approximate the mean and variance of the state variables and can capture the mean and variance of the filtered latent state variable out to the third order
(Eric A. Wan and Rudolph van der Merwe, 2000). The Extended Kalman Filter uses a first order Taylor approximation of the nonlinearity to update the traditional Kalman Filter (Ljung, 1979). Stanza is motivated as a competitive approach to existing deep and state space models and we thus want inference to be relatively fast and efficient. Thus, we avoid the use of particle filters for inference due to their computational cost. A comparison of the Unscented and Extended Kalman Filters is discussed in (Gustafsson and Hendeby, 2012). Smoothing results in posterior distributions for each time .The inference procedure for Stanza is outlined in Algorithm 1
and mainly consists of nonlinear filtering and smoothing of the latent variables, followed by EM algorithm updates for the remaining parameters. First, we perform Unscented Filtering
(Eric A. Wan and Rudolph van der Merwe, 2000; Dan Simon, 2006) and Smoothing (SÄrkkÄ, 2008; Dan Simon, 2006) for the latent sequence . Specifically, we specify Sigma Points, a minimal set of sample points, for all lagged values for use in the Unscented Filter and Smoother to increase the fidelity of the approximation. We approximate the covariance matrix of these lagged values with a diagonal matrix of the filtered variances at previous time points. Then, we perform Extended Kalman Filtering (Ljung, 1979; Dan Simon, 2006) and regular RTS Smoothing (Rauch et al., 1965) for the latent weight vectors . There are no nonlinearities in the state equation for , so the Unscented Kalman Filter results in degenerate crosscovariance matrices, as the Sigma Points used in inference expect nonlinearities in the state space and observation space. Thus, we use the Extended Kalman Filter combined with the TVAR inference algorithm from (West et al., 1999), with corresponding first derivative corrections due to the nonlinearity.We initialize these latent variables via a “warm start” approach, which ignores the nonlinearity in the state space. As filtering and smoothing depend on both and , it is important to start with reasonable values of the latent variables for fast and accurate inference in the Stanza model. We first initialize the sequence by the traditional, linear Kalman Filter and RTS Smoother with randomly sampled, timeinvariant parameters. Then, we sample a latent trajectory from the smoothed estimates to initialize the sequence via the TVAR inference algorithm (West et al., 1999), which is similar and fully linear. This warm start approach gives reasonable trajectories of the state variables and to start the Stanza inference procedure.
As it can be difficult to learn or to fully specify timevarying variances like and , we use discount factors (West and Harrison, 1997) to aid in the specification of these latent variances. For a discount factor , between 0 and 1 (but for most practical models, ), we specify , where is the variance of from forward filtering (West and Harrison, 1997). The discount factor thus inflates the variance , relative to the filtered uncertainty and also represents a decay of information over time. A similar discount factor setup can be specified for . Lower discount factors allow for more volatility and adaptability of the filtered parameters (and hence forecast distributions), while higher discount factors result in much less inflation of the variance and less increase in the uncertainty from one time step to the next. Discount factors are thus directly interpretable in terms of the increase in the uncertainty and adaptability of the parameters for filtering from one time step to the next, and can thus either be specified by domain knowledge or by crossvalidation. Discount factors are very general and can be extended to many settings; for a full discussion, see (West and Harrison, 1997). We specify discount factors and for and , respectively.
Finally, we learn the parameters and via the EM algorithm, following (Khan and Dutt, 2007), where these updates have closed forms and use the mean and variance of the smoothed from the Unscented Kalman Smoother. Alternatively, priors could be specified for these parameters and and could be learned via Gibbs Sampling. The simplicity of the emission distribution in Equation 1 allows for simple updates of and and also enables equally simple inference for these parameters whether is a univariate or multivariate sequence. The convergence criteria of Algorithm 1 can either be on the loglikelihood of the model or on the change in one of the EM parameters, or . Inference in Stanza is relatively fast and efficient; a full timing comparison with related approaches is discussed in Section 4 and full inference details are given in the Supplementary Materials.
There are three hyperparameters to specify for Stanza: , the lagged dependence in the latent space, , the discount factor for and , the discount factor for . All of these parameters are directly interpretable and can either be specified directly via domain knowledge or selected using cross validation.
2.3 Forecasting
As in the basic DLM setting, forecasting in Stanza proceeds very naturally via simulation. At a highlevel, forecasting via simulation proceeds similarly to the regular filtering procedure in Algorithm 1, but where values and are sampled from the one stepahead forecast distribution and used as the “observed” values in propagating the state variables forwards in time. Specifically, for onestep ahead forecasting at time , we first propagate the estimate for through the state update equation via the Extended Kalman Filter, where the value for at time t is sampled from the prior distribution. Then, after completing the update from time to time for , we can sample a value of from the Extended Kalman Filter update, that is used to update the estimate for via the state update equation and the Unscented Kalman Filter. After propagating both of the state space sequences forwards in time, a value of can easily be sampled from the measurement update equations in the Unscented Kalman Filter. This facilitates probabilistic forecasting of and can be repeated to forecast steps ahead in time (West and Harrison, 1997). It is also possible to propagate only the forecast mean, for example, rather than the full forecast distribution, for computational efficiency.
3 Related Work
Long ShortTerm Memory (LSTM) models (Hochreiter and Schmidhuber, 1997)
are widely used deep learning models for sequences or time series and address several of the training issues associated with recurrent neural networks
(Goodfellow et al., 2016), and have been applied in many different applications. However, LSTMs can still struggle to capture longterm structure in sequences and time series (GreavesTunnell and Harchaoui, 2019) and more recently, attentionbased approaches (Vaswani et al., 2017; Chung et al., 2015)have achieved stateoftheart results for sequence modeling. While these models are capable of capturing longterm structure in time series, they require large amounts of data to train and are not necessarily suited to settings where probabilistic, interpretable inference is important or for settings in the small to medium data regime. Autoregressive models such as
(Sen et al., 2019) are additionally well suited to large amounts of data, but not necessarily for settings where interpretable, probabilistic inference is required.Dynamic Linear Models (DLMs) (West and Harrison, 1997), which include Kalman Filters (Kalman, 1960), are linear models that apply probabilistic filtering and smoothing. Extensions include Dynamic Generalized Linear Models, which model from exponential family distributions (West and Harrison, 1997; Berry and West, 2019), and to include nonlinearities in the state space and observation equations (Dan Simon, 2006; Eric A. Wan and Rudolph van der Merwe, 2000; Valpola and Karhunen, 2002). Stanza extends ideas from TVAR models, in particular (West et al., 1999). DLMs and various extensions are discussed in detail in (Roweis and Ghahramani, 1999) and in a Bayesian time series setting in (West and Harrison, 1997).
Finally, there are several related approaches that try to address the range between full deep learning approaches and linear state space models, either by including probabilistic inference in deep learning approaches or by extending state space models to more complex settings. Deep learning focused models include Deep Kalman Filters (Krishnan et al., 2015), which use neural networks for the emission and transition distributions in the DLM specification, probabilistic recurrent state space models (Doerr et al., 2018), which combine RNN training with Gaussian processes and Gaussian process dynamic models (Wang et al., 2006) for motion capture data. Additional prior work for probabilistic time series focuses on different aspects of improving inference in complex models (Archer et al., 2015; Karl et al., 2017; Krishnan et al., 2017). Finally, there is recent work focusing specifically on improving forecasting using deep models to extend state space models, such as (Rangapuram et al., 2018; Salinas et al., 2019).
Related approaches that extend state space models directly to nonlinear or more complex settings include (Raiko and Tornio, 2009), which applies the variational inference approach proposed in (Valpola and Karhunen, 2002) to learn nonlinear state space models for model predictive control, rather than inference or forecasting, and (Alaa and van der Schaar, 2019), which uses a discrete state space and an attention mechanism learned via a sequencetosequence model to model disease progression.
Stanza, however, is distinct from these previously proposed approaches. Unlike (Raiko and Tornio, 2009; Rangapuram et al., 2018), Stanza is motivated by a desire to perform competitively for both inference and forecasting settings. Thus, Stanza is designed to be a versatile model that can perform well under a variety of different circumstances. Additionally, rather than applying state space ideas to deep learning approaches (Krishnan et al., 2015; Doerr et al., 2018; Rangapuram et al., 2018; Archer et al., 2015), Stanza incorporates ideas from deep learning into a state space framework. Similar to (Alaa and van der Schaar, 2019), Stanza creates dynamic dependencies in the latent space using dynamic weighting, rather than an attention mechanism with a discrete state space (Alaa and van der Schaar, 2019), to do so. Additionally, Stanza specifies complexity in the latent space and adds nonlinearities to improve performance.
4 Experiments
Stanza is designed for competitive performance in inference and forecasting settings. We demonstrate Stanza’s inference utility on two simulation experiments, both of which consider settings where Stanza is able to capture relevant features in the data, while traditional DLMs are not sufficiently rich to do so. We also demonstrate Stanza’s competitive forecasting accuracy on three realworld datasets.
4.1 Simulations
We first demonstrate the ability of Stanza to recover structure in time series. We consider a highly structured series (Figure 0(a) top with autocorrelation function (ACF) and partial autocorrelation function (PACF) plots Figure 0(b)). There is clear seasonality out to lag 6, so we can specify directly in the Stanza model. The resulting posterior mean weight matrix, is shown in the bottom of Figure 0(a). The weight corresponding to times is the largest across time, reflecting the importance of the seasonality evident in the original series. Additionally, the structure of the time series changes slightly between times 150250 and 300400, and this change is evident in decreased values of the weights at these same times (Figure 0(a) bottom). Additionally, we reproduce this structure when we simulate from from the learned Stanza model, see the ACF/PACF plots in Figure 0(b) right. Thus, for a highly structured time series, Stanza is able to recover key features of the structure and dependence of the series over time in the dynamic weight matrix , which can be directly interpreted in the context of the application.
To demonstrate Stanza’s ability to model complex, nonlinear time series, we generate data from a Lorenz attractor (Lorenz, 1963), following (Garcia Satorras et al., 2019). A Lorenz attractor is defined by a system of three differential equations and is an example of a chaotic system. We simulate 1000 time points from a Lorenz attractor with a time delta of 0.01, and consider the trajectory of the first variable, . We then fit a Stanza(5) (i.e. ) model with high discount factors of 0.99 for and , indicating little volatility. The posterior predictive distribution for the learned Stanza model is shown in Figure 1(a). There is excellent correspondence between the posterior predictive mean and the observed Lorenz attractor trajectory, as well as good coverage for multistep ahead forecasting (Figure 1(b)
). As Stanza performs probabilistic forecasting, we can calculate the coverage of the forecast distribution as the percentage of observed values that fall within specific percentiles of the forecast distribution. That is, we expect approximately 95% of the observed values to fall within the 95% credible intervals for the forecast distribution.
4.2 Forecasting
We additionally demonstrate Stanza’s competitive forecasting performance on three realworld datasets. Following (Salinas et al., 2019; Lai et al., 2017; Sen et al., 2019), we compare Stanza’s forecasting performance on the following publicly available datasets: Exchange Rate (Lai et al., 2017), the daily exchange rate for 8 different currencies from 19902011, and Electricity^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014, hourly electricity consumption for 370 customers (subset to the first 50 customers and the last 10000 time points). We additionally analyze the Weather dataset from (Chollet, 2017
; TensorFlow Tutorials,
2020), which contains 13 different weather measurements. We subset the Weather dataset to analyze the data from 10/01/2016 to 12/31/2016. For each forecasting experiment considered below, the last 1000 time points of each series are treated as test data and the preceding points as the training data. All series are individually normalized using the mean and standard deviation of the training set.
We fit several models independently to each series in each dataset and compare forecasting accuracy with the rootmean squared error (RMSE) for , and stepahead forecasting, where is the forecast horizon. We compare Stanza to two baseline models, a normal DLM with timeinvariant parameters (West and Harrison, 1997) and a TVAR model (West et al., 1999; Prado and West, 2010), as well as two LSTMs, one with 2 hidden layers (LSTM(2)) and one with 5 hidden layers (LSTM(5)). For a direct comparison across series, we use crossvalidation to tune the three hyperparameters for Stanza (using onestep ahead forecasting RMSE as a metric) on each dataset and use the same hyperparameters for all series for both the Stanza and TVAR models for consistency. As Exchange Rate is a daily dataset, the optimal value of for Stanza represents weekly seasonality, and Stanza is able to capture relevant structure in the data. To train the LSTMs, we use a rolling window approach, where the previous 50 data points are used to predict the next 10. The LSTMs are trained with the Adam optimizer on MSE loss. Additionally, as all models except for the LSTMs make probabilistic forecasts, we compare the 95% forecast coverage of the baseline models to Stanza.
The forecasting results for the individual series are given in Table 1. For all three datasets and across all three forecast horizons, Stanza achieves a good balance of high forecast accuracy and good empirical coverage (Figure 3). As compared to competing approaches, Stanza’s advantages are the ability to model multivariate time series (unlike TVAR), to produce probabilistic forecasts and thus coverage estimates (unlike deep LSTMs) and to model nonstationary time series (unlike DLMs). The forecasting accuracy of Stanza does not decay over longer forecasting horizons, and is competitive with the performance of the deep LSTMs across all forecast horizons. Additionally, LSTMs do not provide probabilistic forecast distributions and thus cannot be used to calculate coverage of the observed time series, while Stanza achieves excellent coverage (Figure 3), even compared to the DLM and TVAR models.
RMSE  95% Coverage  

Dataset  Model  
Exchange Rate  DLM  
TVAR(7)  
LSTM(2)  —  —  —  
LSTM(5)  —  —  —  
Stanza(7)  
Electricity  DLM  
TVAR(5)  
LSTM(2)  —  —  —  
LSTM(5)  —  —  —  
Stanza(5)  
Weather  DLM  97  97  97  
TVAR(6)  95  96  96  
LSTM(2)  —  —  —  
LSTM(5)  —  —  —  
Stanza(6)  99  98  97 
The TVAR model achieves very good forecasting accuracy, across the datasets. However, the TVAR model is not designed for multivariate series, while Stanza is easily extended to multivariate sequences. We fit Stanza to the full multivariate Exchange Rate and Weather datasets, due to their relatively low dimension. These results are given in Table 2. This extension to the full multivariate sequence further improves the forecasting ability of Stanza, especially compared to the TVAR models and the deep LSTMs in Table 1, on the Exchange Rate data.
Finally, we can look at the computational efficiency of the Stanza model. For the Exchange Rate data, the average time in seconds for training Stanza is 41 seconds, while for the LSTM(5) model is 222 seconds (averaged across series, run on a single core). While forecasting via simulation for Stanza is much slower than for the LSTM, this computational time could be reduced via propagating the mean only or via parallelization.
RMSE  

Dataset  Model  
Exchange Rate  Stanza(7)  0.09  0.09  0.09 
Weather  Stanza(6)  0.81  0.91  0.92 
5 Discussion
Stanza is motivated by a general goal of integrating modern machine learning approaches with traditional stochastic models. Specifically, we propose Stanza as a versatile, nonstationary state space model for competitive performance on inference and forecasting tasks. Stanza is able to achieve competitive forecasting accuracy with deep LSTMs, while still maintaining probabilistic estimates of the state space and interpretability of parameters. Stanza is suitable for settings where parameter interpretability is important for modeling complex time series and is suitable for either univariate or multivariate modeling. There is a long history of simple machine learning models, like random forests, outperforming deep models on certain tasks, and Stanza fits into this paradigm. Future directions include extending these ideas further in the direction of deep learning models, for example by adding additional depth to the latent space in the proposed Stanza model or extensions for higherdimensional time series, for example with ideas from
(Sen et al., 2019; Salinas et al., 2019). Stanza is a step towards a powerful, general framework of incorporating deep learning ideas directly into traditional probabilistic models.References
 Attentive statespace modeling of disease progression. In Advances in Neural Information Processing Systems 32, pp. 11334–11344. Cited by: §2.1, §3, §3.
 Black box variational inference for state space models. CoRR abs/1511.07367. External Links: 1511.07367, Link Cited by: §3, §3.
 Bayesian forecasting of many countvalued time series. Journal of Business and Economic Statistics. Cited by: §1, §2.1, §3.
 Deep learning with python. Manning Publications. Cited by: Table 3, §4.2.
 A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems 28, pp. 2980–2988. Cited by: §1, §3.
 Optimal State Estimation. Wiley. Cited by: Appendix A, §B.1, §2.1, §2.2, §3.
 Probabilistic recurrent statespace models. CoRR abs/1801.10395. External Links: 1801.10395, Link Cited by: §3, §3.
 The unscented kalman filter for nonlinear estimation. Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications and Control Symposium, pp. 153–158. Cited by: Appendix A, §1, §2.2, §2.2, §3.
 Multiscale and hidden resolution time series models. Bayesian Analysis 2, pp. 294–314. Cited by: §2.1.
 Combining generative and discriminative models for hybrid inference. In Advances in Neural Information Processing Systems 32, pp. 13802–13812. Cited by: §B.3, §4.1.
 Deep learning. MIT Press. Cited by: §3.
 A statistical investigation of long memory in language and music. CoRR abs/1904.03834. External Links: 1904.03834, Link Cited by: §3.
 Some relations between extended and unscented kalman filters. IEEE Transactions on Signal Processing 60 (2), pp. 545–555. Cited by: §2.2.
 Long shortterm memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1, §3.
 A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82 (1), pp. 35–45. Cited by: §B.1, §1, §3.

Deep variational bayes filters: unsupervised learning of state space models from raw data
. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, External Links: Link Cited by: §3. 
An expectationmaximization algorithm based kalman smoother approach for eventrelated desynchronization (ERD) estimation from EEG
. IEEE Transactions on Biomedical Engineering 54 (7), pp. 1191 – 1198. Cited by: §2.2.  Deep kalman filters. CoRR abs/1511.05121. External Links: 1511.05121, Link Cited by: §3, §3.

Structured inference networks for nonlinear state space models.
In
Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence
, AAAI’17, pp. 2101–2109. Cited by: §3.  Modeling long and shortterm temporal patterns with deep neural networks. CoRR abs/1703.07015. External Links: 1703.07015, Link Cited by: §C.1, Table 3, §1, §4.2.
 Asymptotic behavior of the extended kalman filter as a parameter estimate for linear systems. IEEE Transactions on Automatic Control 24 (1), pp. 36–50. Cited by: Appendix A, §1, §2.2, §2.2.
 Deterministic nonperiodic flow. Journal of the Atmospheric Sciences 20, pp. 130–141. Cited by: §B.3, §4.1.
 Time Series: Modeling, Computation and Inference. CRC Press. Cited by: §4.2.
 Variational Bayesian learning of nonlinear hidden statespace models for model predictive control. Neurocomputing 72 (1618), pp. 3704–3712. Cited by: §3, §3.
 Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems 31, pp. 7785–7794. Cited by: §3, §3.
 Maximum likelihood estimates of linear dynamic systems. AIAA Journal 3 (8), pp. 1445–1450. Cited by: Appendix A, §2.2.
 A unifying review of linear gaussian models. Neural Computation 11, pp. 305–345. Cited by: §3.
 Highdimensional multivariate forecasting with lowrank gaussian copula processes. In Advances in Neural Information Processing Systems 32, pp. 6824–6834. Cited by: §C.1, §3, §4.2, §5.
 Unscented rauch–tung–striebel smoother. IEEE Transactions on Automatic Control 53 (3), pp. 845–849. Cited by: Appendix A, §2.2.
 Think globally, act locally: a deep neural network approach to highdimensional time series forecasting. CoRR abs/1905.03806. External Links: 1905.03806, Link Cited by: §C.1, §3, §4.2, §5.
 Time series forecasting. Note: https://www.tensorflow.org/tutorials/structured_data/time_series Cited by: Table 3, §4.2.
 An unsupervised ensemble learning method for nonlinear dynamic statespace models. Neural Computation 14 (11), pp. 2647–2692. Cited by: §3, §3.
 Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998–6008. Cited by: §1, §3.
 Gaussian process dynamical models. In Advances in Neural Information Processing Systems 18, pp. 1441–1448. Cited by: §3.
 Bayesian Forecasting and Dynamic Models. Springer. Cited by: Appendix A, §1, §2.1, §2.2, §2.3, §3, §4.2.
 Evaluation and comparison of EEG Traces: latent structure in nonstationary time series. Journal of the American Statistical Association 94 (448), pp. 1083–1095. Cited by: Appendix A, §1, §2.1, §2.2, §2.2, §2, §3, §4.2.
Appendix A Inference
Let be a univariate or multivariate time series of length centered about 0, and a sequence of realvalued latent states. The Stanza model is specified in Equation 2:
(2) 
where . Additionally, and , where we specify , where
is the identity matrix of dimension
. Additionally, in the TVAR model framework [West et al., 1999], we specify for the volatility for ; see [West et al., 1999, West and Harrison, 1997] for details.Filtering and smoothing of proceeds via the Unscented Kalman Filter [Eric A. Wan and Rudolph van der Merwe, 2000] and Unscented Smoother [SÄrkkÄ, 2008, Dan Simon, 2006]. Inference for proceeds via the Extended Kalman Filtering [Ljung, 1979, Dan Simon, 2006] and regular RTS Smoothing [Rauch et al., 1965]. The EM updates for the parameters and are given in Equation 3 for univariate and Equation 4 for multivariate , where the expectations are obtained via the smoothing relationships for , as smoothing results in posterior distributions for each time .
(3) 
(4) 
Appendix B Simulation Experiments
Additional details about the Simulation Experiments in the main paper are discussed in this section.
b.1 Kalman Filter Recovery
Stanza can also be fit to time series that follow dynamics that are simpler than those assumed in the Stanza model. We generate data from a regular Kalman Filter with timeinvariant parameters and then fit a Stanza(3) model with discount factors . The generated series and the generating latent sequence are shown in Figure 4. A time invariant Kalman Filter can be expressed as in Equation 5 [Kalman, 1960, Dan Simon, 2006]:
(5) 
The filtered, , and smoothed, , means from Stanza are shown in Figure 5 and correspond well to the generating latent sequence, . Finally, the posterior predictive mean for is shown in Figure 6 with credible intervals. There is good correspondence between the observed, generated sequence and the Stanza model fit. Thus, Stanza is able to model simpler dynamics then specified in the full model well.
b.2 Structure Recovery
A Stanza(6) model with is fit to highly periodic, structured data. The learned model parameters are then used to generate new data, which corresponds well in terms of temporal structure to the original data (Figure 7).
b.3 Lorenz Attractor Details
To demonstrate Stanza’s ability to model complex, nonlinear time series, we generate data from a Lorenz attractor [Lorenz, 1963], following [Garcia Satorras et al., 2019]. A Lorenz attractor is defined by a system of three differential equations and is an example of a chaotic system. The Lorenz equations used to generate the data for this simulation are given in Equation 6. We simulate 1000 time points from a Lorenz attractor with a time delta of 0.01, and consider the trajectory of the first variable, . We then fit a Stanza(5) model with high discount factors of 0.99 for and , indicating little volatility.
(6) 
Appendix C Forecasting Experiments
Additional details about the forecasting experiments considered in the main paper are included here.
c.1 Datasets
We demonstrate Stanza’s competitive forecasting ability on three real world, publicly available datasets, following [Salinas et al., 2019, Lai et al., 2017, Sen et al., 2019]. Additional details about each dataset are given in Table 3. All subsetting is performed for computational efficiency only. The Electricity^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 data is subset to the first 50 customers and the last 10000 time points. The Weather data is subset for the time range 10/01/2016 to 12/31/2016. For all datasets and for each forecasting experiment considered below, the last 1000 time points of each series are treated as test data and the preceding points as the training data. All series are individually normalized using the mean and standard deviation of the training set. For the multivariate modeling, the Weather dataset is subset to 9 series.
c.2 Forecasting Experiment HyperParameter Details
For the univariate forecasting experiment, the hyperparameters for the Stanza models are tuned on an aggregate series for the Electricity and Exchange Rate datasets, and then the same hyperparameters are used for modeling all the individual series for the Stanza and TVAR models. The hyperparameters are selected based on which combination gives the best 1step ahead forecasting RMSE, using a grid of values. The hyperparameters for the Electricity data are and for the Exchange Rate data are . For the Weather dataset, as an aggregate series does not make sense (each individual series measures a different physical quantity on a different scale), we manually select (corresponding to hourly structure in the data) and , allowing for a relatively large amount of volatility in the measurements.
c.3 Timing
Full timing details for training and forecasting on the Exchange Rate data are given in Table 4.
Dataset  Model  Training  Forecasting 

Exchange Rate  DLM  1.9  37.4 
TVAR(7)  0.57  124.9  
LSTM(2)  214.8  1.4  
LSTM(5)  222.1  1.4  
Stanza(7)  41.8  669.9 