Deep Probabilistic Time Series Forecasting using Augmented Recurrent Input for Dynamic Systems

by   Haitao Liu, et al.
Dalian University of Technology

The demand of probabilistic time series forecasting has been recently raised in various dynamic system scenarios, for example, system identification and prognostic and health management of machines. To this end, we combine the advances in both deep generative models and state space model (SSM) to come up with a novel, data-driven deep probabilistic sequence model. Specially, we follow the popular encoder-decoder generative structure to build the recurrent neural networks (RNN) assisted variational sequence model on an augmented recurrent input space, which could induce rich stochastic sequence dependency. Besides, in order to alleviate the issue of inconsistency between training and predicting as well as improving the mining of dynamic patterns, we (i) propose using a hybrid output as input at next time step, which brings training and predicting into alignment; and (ii) further devise a generalized auto-regressive strategy that encodes all the historical dependencies at current time step. Thereafter, we first investigate the methodological characteristics of the proposed deep probabilistic sequence model on toy cases, and then comprehensively demonstrate the superiority of our model against existing deep probabilistic SSM models through extensive numerical experiments on eight system identification benchmarks from various dynamic systems. Finally, we apply our sequence model to a real-world centrifugal compressor sensor data forecasting problem, and again verify its outstanding performance by quantifying the time series predictive distribution.


page 1

page 2

page 3

page 4


Seasonal Encoder-Decoder Architecture for Forecasting

Deep learning (DL) in general and Recurrent neural networks (RNNs) in pa...

Stochastic Recurrent Neural Network for Multistep Time Series Forecasting

Time series forecasting based on deep architectures has been gaining pop...

Synergetic Learning of Heterogeneous Temporal Sequences for Multi-Horizon Probabilistic Forecasting

Time-series is ubiquitous across applications, such as transportation, f...

A Data-Driven Method for Recognizing Automated Negotiation Strategies

Understanding an opponent agent helps in negotiating with it. Existing w...

Attention-based Neural Load Forecasting: A Dynamic Feature Selection Approach

Encoder-decoder-based recurrent neural network (RNN) has made significan...

Bundle Networks: Fiber Bundles, Local Trivializations, and a Generative Approach to Exploring Many-to-one Maps

Many-to-one maps are ubiquitous in machine learning, from the image reco...

Variational Dynamic Mixtures

Deep probabilistic time series forecasting models have become an integra...

1 Introduction

Time series forecasting is a long standing problem in widespread decision-making scenarios, for example, dynamic system identification aastrom1971system ; verhaegen2007filtering , prognostic and health management (PHM) of machines zhao2019deep ; yu2019remaining ; lei2020applications , and business demand forecasting tadjer2021machine . It thus has gained extensive attention from academic and industrial community over decades.

In order to learn the underlying dynamic and temporal pattern from historical time series data and perform desirable forecasting in the future, classic time series methodologies, for example, exponential smoothing hyndman2008forecasting , and auto-regressive integrated moving average (ARIMA) model box1968some ; durbin2012time , have been studied for a long history. They usually incorporate user’s prior knowledge by decomposing the time series structure into trend, seasonality and so on. Consequently, these methods have high interpretability while suffering from poor capability of learning complex and long time dependency. Moreover, state space models (SSM) hyndman2008forecasting ; schon2011system ; hanachi2016sequential

, which are similar to hidden Markov models 

rabiner1986introduction and could include the above classic methods as special cases durbin2012time , offer a more principled way to learn dynamic patterns from data.

For a forecasting model, an important feature is that it could not only provide the prediction mean but also quantify the associated uncertainty, which is essential for risk management, for instance, the estimation of remaining useful life 


. To this end, an alternative way is interpreting the SSM from the Bayesian view, for example, the well-known Kalman filter 

kalman1960new and dynamic linear models west2006bayesian . Besides, it is popular to combine SSM with Gaussian process (GP) williams2006gaussian , which is theoretically equivalent to Kalman filter when using specific kernels. The Gaussian process state space model (GPSSM) frigola2014variational ; ialongo2019overcoming places GP prior over both the transition and measurement functions of SSM, thus being capable of exporting probabilistic forecasting. The prominent weakness of GPSSM however is the poor scalability to tackle a large amount of time series data, due to the cubic time complexity liu2020gaussian . Though we could resort to the sparse and distributed approximations snelson2006sparse ; titsias2009variational ; hensman2013gaussian ; liu2018generalized to improve the scalability of GPSSM, it on the other hand may limit the model capability.

Alternatively, the deep learning community has followed the SSM framework to extensively exploit the recurrent neural networks (RNNs), for example, gated recurrent unit (GRU) 


, long short term memory (LSTM) 

hochreiter1997long and their variants sutskever2014sequence ; gehring2017convolutional , in order to perform time series point prediction. Due to their high capability of extracting high-dimensional features for learning complicated patterns within time series data, RNNs have reported remarkable and success stories in various domains lipton2015critical ; yu2021analysis . In order to enable probabilistic forecasting using RNN, it is suggested to combine RNN with generative models. For example, we could combine the auto-regressive strategy with RNN and train the model through maximum likelihood salinas2020deepar ; li2021learning . Alternatively, we could adopt an RNN to learn the time-varying coefficients of linear/nonlinear stochastic SSM rangapuram2018deep ; yanchenko2020stanza , which enjoys both probabilistic forecasting and good interpretability. Another line of work builds the generative GP on the top of RNN as measurement function to produce stochastic outputs al2017learning ; she2020neural . Finally, another popular way is combining RNN with variational autoencoder (VAE) kingma2013auto to build connections between SSM and VAE bayer2014learning ; krishnan2015deep ; krishnan2017structured ; fraccaro2018deep ; li2021learning ; gedon2020deep . These models encode the sequence characteristics and dependency on the manifold embedded in a low-dimensional latent space, and are usually trained through variational inference (VI) by deriving the evidence lower bound (ELBO) as objective, thus resulting in the stochastic or variational RNN.

Along the line of deep SSM which gains benefits from two communities, we devise a novel, data-driven deep variational sequence model in an augmented recurrent input space, named VRNNaug, to perform probabilistic time series forecasting. The main contributions of this paper are three-fold:

  • We propose a VAE-type sequence model on an augmented input space, the inputs of which consist of the original input signal and the additional latent input which encodes all the sequence characteristics. The augmented input space could induce rich stochastic sequence dependency.

  • We alleviate the issue of inconsistency between training and predicting as well as improving the modeling of dynamic patterns in the VI framework by (i) feeding the hybrid output as input at next time step, which brings training and predicting into alignment; and (ii) presenting a generalized auto-regressive strategy which encodes all the historical dependencies. This is found to greatly improve the quality of future forecasting.

  • We perform extensive numerical experiments to verify the superiority of our model on eight system identification benchmarks from various dynamic systems, and thereafter apply the proposed sequence model on a real-world centrifugal compressor sensor data forecasting case to illustrate the desirable time series predictive distribution.

Note that the acronyms and notations used in this paper are summarized in Appendices A and B.

2 Preliminaries

It is assumed that the sequence data at discrete and evenly distributed time points are sampled from an unknown dynamic system, and we attempt to model it in order to learn the underlying dynamic and temporal patterns. Thereafter, we use the fitted sequence model to forecast the predictive distribution of future window given the history data and the further input signal (also known as covariates or exogenous variables) .

Specifically, we assume that the target dynamic system could be described by the state space model


where the subscript , which is a non-negative integer, indexes time; is the latent state to be inferred and represents the initial hidden state; the input signal (action) is known at all time points; is the measurement (observation); the transition function maps the hidden state to the next state ;111When the transition is a linear time-invariant mapping, we arrive at the linear SSM. the measurement function (also known as emission function) maps the hidden state to the associated output; and finally, and denote the parameters of and , respectively.

It is observed that when we are describing the functions and using deep mappings, e.g., neural networks, it results in the so-called deep SSM which takes the advances from two fields: the high flexibility from neural networks and the high expressivity via hidden states. When future combining SSM with generative neural networks, for example, VAE, we arrive at the subclass of deep SSM which is capable of preforming probabilistic time series forecasting and will be investigated below in our work. Note that the possible perturbations, for example, modeling error and measurement error, have been absorbed in (1) and (2).

3 Variational sequence model using augmented stochastic inputs

In order to incorporate stochasticity into the RNN, we propose to use additional random variable to augment the input space as

. Consequently, rich stochastic sequence representation could be induced in the augmented input space for improving both the model capability and the quantification of uncertainty.

Model definition. Under the framework of latent variable model, the joint prior with additional latent variables follows the first-order Markov assumption to factorize as


where the latent sequence starts from which is the initial input state and could take for example the zero initialization; the Gaussian state transition starts from and evolves over time for encoding the previous sequence dependency; the Gaussian likelihood plays as the role of decoder for mapping the encoded sequence inputs to the related output at time .

For model inference, we need to maximize the marginal likelihood

which could be estimated by Markov Chain Monte Carlo (MCMC) sampling. When performing prediction, we require the posterior

, the form of which however is unknown. We hence resort to variational inference by introducing Gaussian variational posteriors as an approximation. Specially, similar to the prior

, the joint variational posterior could be factorized as


which describes the generation of augmented latent inputs given observations, and each state transition takes the Gaussian form.

Inconsistency issue. The stochastic sequence model defined in (3) and (4) however may raise inconsistency between training and predicting. That is, different from the training phase, when we are going to predict at time , we have to use the prior rather than the inferred variational posterior , since there is no access to the further observation . The gap raised by this inconsistency might deteriorate the performance.

Inspired by salinas2020deepar , we could use the previous output instead of in the variational posterior with , and sample from the previous conditional output when predicting at time with in order to alleviate the inconsistency issue. The auto-regressive strategy extracts the known previous output as an input at the next time point, which is available for both training and predicting. This strategy however introduces another inconsistency: we use the ground truth during training, but a sample

during predicting. This discrepancy has been found to deteriorate the RNN performance on natural language processing (NLP) tasks 

bengio2015scheduled . Therefore, we decide to use a hybrid output defined as


and feed it as input at next time point during training. The ground truth in the above expression plays as the teaching force to accelerate the model convergence, while the usage of sample reduces the gap between training and predicting.

Furthermore, as shown in Fig. 1, we decide to use a generalized auto-regressive strategy by evolving over all the historical outputs until time through the recurrent mapping as


where . This generalized auto-regressive strategy can also be applied to and in (4) to obtain the recurrent latent states and inputs as


where and . Consequently, the joint prior and the variational posterior now write respectively as


with the state transitions completely conditioned on recurrent inputs. The encoding of all the previous variables enables the learning of long term dependency, the superiority of which has been verified in the ablation study in section 4.3.

Derivation of ELBO. As for the aforementioned variational sequence model, by maximizing the Kullback-Leibler (KL) divergence between the variational posterior and the exact posterior , we derive the following evidence lower bound (ELBO) for the log marginal likelihood as


where the initials take and . Due to the Gaussian variables, the KL term in the right-hand side of (10) has analytical expressions.

The estimation of ELBO  (10), especially the likelihood expectation, could be conducted through the re-parameterization trick kingma2013auto due to the latent Gaussian random variables, therefore resulting in differentiable objective for optimization. For instance, instead of directly sampling from the variational Gaussian posterior , we sample from another random variable without trainable parameters, and then use it to obtain where the symbol represents element-wise product.

Besides, directly evaluating the recurrent ELBO unrolled over a long time series requires high memory consumption due to the many historical states. More efficiently, we employ the shingling technique proposed in leskovec2014mining to convert the long time series into several short time series chunks. Specially, given a long time series with length of , we start from an arbitrary time point and obtain a short time series with sequence length . Consequently, we could obtain short time series chunks in total. This shingling data preprocessing allows modeling many independent time series with varying length simultaneously given the same chunk size . Besides, the time series split as well as the ELBO factorized over time points in (10

) could naturally support an efficient and unbiased estimation on a subset

of the short time series chunks wherein , thus allowing using mini-batch optimizer, e.g., Adam kingma2014adam .

Figure 1:

The inference and generative networks of the proposed VRNNaug model for probabilistic time series forecasting. Note that the diamond nodes represent deterministic variables. Contrarily, the circle nodes indicate random variables. The gray nodes represent input signals and observations. Particularly, the transient gray node

indicates the mixture of observation and sample. Finally, it is found that in comparison to the inference network, the only difference in the generative network is that it is using the inferred previous output variable since we are inaccessible to the true observation when predicting.

Amortized VRNNaug. As depicted in Fig. 1, the whole model follows the encoder-decoder structure. Furthermore, we employ amortized variational inference, i.e., amortizing the parameters over neural networks, to build our proposed VRNNaug model more efficiently. In order to account for the recurrent structure in the state transition

, we build the latent variables upon the outputs of for example RNN models, and map them to the related Gaussian variables through multi-layer perceptron (MLP) as


The above RNN cells, for instance, GRU cho2014learning and LSTM hochreiter1997long , naturally encode the sequence characteristics. It is notable that the sequence modelings , and are not limited to RNN models. The recently developed sequence extractors, for example, temporal convolutional networks (TCN) bai2018empirical and transformer vaswani2017attention , are also available here. Besides, note that different from and , the hidden state in the above expression solely holds for training, see the inference network in Fig. 1. When predicting, it accepts the single sample from previous output distribution as , see the generative network in Fig. 1. Finally, different from the posterior state transition , the latent prior takes the unit Gaussian for simplicity.

As for the likelihood (decoder), we follow the architecture of Dense-Net huang2017densely to extend the original by extracting and concatenating all the previous inputs in Fig. 1 as


where the Gaussian parameters are parameterized through MLPs as


This sort of dense connection between encoder and decoder eases model training because of the block-cross feature sharing. Note that the likelihood is not limited to Gaussian, it can adopt other task-dependent distributions. For instance, we could choose the negative-binomial likelihood for positive count data, and the Bernoulli likelihood for binary data salinas2020deepar .

Prediction. Given the trained deep sequence model, in order to perform recursive prediction over the future time , we first sample from the posterior state transition wherein now is unrolled over the previous output samples since the true output observations are inaccessible when predicting. Thereafter, we pass the latent sample through the decoder to output the prediction sample at current time point. It is notable that although the model employs Gaussian variables, the final predictive distribution is not Gaussian. Hence, we could use MCMC sampling to output samples for empirically quantifying the implicit predictive distribution.

Discussions. So far, we have described the proposed deep sequence model with known input signal at any time point, which includes additional covariate that may affect the output. For time series without additional control signals, one could simply use the time features, for example, hour-of-day and day-of-week, as the additional for the model.

Alternatively, it is known that one could convert this kind of time series into supervised learning fashion through the auto-regressive manner. That is, given a look-back window size

, we use the historical data as the input control signal to predict at future time . This can be naturally done by our model in Fig. 1 by simply dropping out in both inference and generative networks. Note that our model is mainly designed for free forecasting, i.e., multi-step ahead prediction. If one wants to conduct one-step ahead prediction, we could simply feed the ground truth into both the inference and generative networks.

Finally, the relation and difference of our model to existing literature should be highlighted. The most closely related works come from bayer2014learning ; fraccaro2018deep ; li2021learning ; gedon2020deep ; salinas2020deepar . In comparison to the similar VAE-type probabilistic sequence models, the proposed VRNNaug on the augmented inputs brings several improvements: (i) it adopts hybrid output in (5) during training for greatly alleviating the inconsistency issue; and (ii) it employs the generalized auto-regressive strategy in (6) and (7) to form the recurrent inputs for better capturing temporal dependency. The benefits brought by these improvements will be verified by the comparative study in section 4.2 and the ablation study in section 4.3.

4 Numerical experiments

In this section, we first showcase the methodological characteristics of proposed VRNNaug model on two toy cases, followed by the extensive comparison against existing RNN assisted stochastic sequence models on eight system identification benchmarks and a real-world centrifugal compressor sensor data.

We implement our model using PyTorch on a Linux workstation with TITAN RTX GPU. The detailed experimental configurations are provided in Appendix 

C. As for the probabilistic forecasting provided by our model, we quantify the quality of predictive distribution by the 50- and

90-quantile losses 

rangapuram2018deep . Generally, given the -th time series output () and its corresponding -quantile prediction , the -quantile loss is expressed as


where the parameter . In the following experiments, we use to quantify the accuracy of prediction mean, and to represent the coverage of predictive distribution.

4.1 Toy cases

We below study the proposed VRNNaug model on two toy cases (the linear Gaussian system and the heteroscedastic motorcycle case) with different features.

Linear Gaussian system. We first consider a one-dimensional linear Gaussian system gedon2020deep expressed as


where is the process noise, and is the measurement noise. We use this dynamic model to sequentially generate 2000 training points, 2000 validation points and 5000 test points. Note that when generating training and validation signals, an excitation input signal polluted with uniform random noise is employed for the linear system. Differently, the input signal for the test data writes as

Figure 2:

The predictive distribution of VRNNaug on the linear Gaussian system. The red curve represents 50-quantile prediction, while the associated shaded region indicates the interval bounded by 95-quantile and 5-quantile predictions. The blue curve is the true observation with the shaded region representing 95% confidence interval.

The predictive distributions of the proposed VRNNaug together with the true observations are depicted in Fig. 2. It is observed that for this linear system, the flexible, nonlinear VRNNaug successfully captures the underlying linear dynamics. This linear model has also been used in gedon2020deep to test another deep SSM model, named STORN bayer2014learning , and the gray-box model SSEST lennart1999system with two latent states. Though successfully capturing the dynamics, they conservatively overestimate the uncertainty of prediction. Our model however is found to well estimate the uncertainty in Fig. 2.

Heteroscedastic motorcycle data. We next study a more complicated motorcycle time series dataset silverman1985some composed of 133 points with time input (ms) and acceleration output. The motorcycle data takes from an experiment on the efficacy of crash helmets, and yields multiple acceleration observations at some time points. In comparison to the previous linear Gaussian system with homoscedastic noise, this nonlinear dataset contains more challenging time-varying noise.

Figure 3: The predictive distribution of VRNNaug on the heteroscedastic motorcycle dataset. The red curve represents 50-quantile prediction, while the associated shaded region indicates the interval bounded by 95-quantile and 5-quantile predictions. The circles represent 133 data points.

We use all the 133 data points for training, validation and testing, and extract the time feature as input signal for our VRNNaug model. As depicted in Fig. 3, the proposed VRNNaug captures the non-linear dynamics of underlying acceleration curve, for example, the constant behavior before ms and the step behavior around ms. More interestingly, it well describes the time-varying noise, for example, the pretty small perturbation before ms and the large and time-varying noise thereafter.

4.2 System identification benchmarks

This section verifies the performance of proposed deep sequence model on eight popular system identification benchmarks from dynamics systems like hydraulic actuator, electric motors, hair dryer, furnace, tank, cascaded tanks system schoukens2016cascaded , F-16 aircraft ground vibration test noel2017f , and wiener-hammerstein process noise system schoukens2016wiener .222The first five datasets are available at and the last three datasets are collected at Table 1 summarizes the sequence length and dimensions of the considered time series datasets. It is found that all these datasets except f16gvt have a single input signal, and the particular tank dataset has two outputs. The goal of sequence model is to infer the underlying high-dimensional latent state and the relevant nonlinear dynamics from data.

actuator 1024 1 1
cascaded tank (ctank) 2028 1 1
drive 500 1 1
dryer 1000 1 1
f16gvt 32768 2 1
gas furnace (furnace) 296 1 1
tank 2500 1 2
wiener hammerstein (hammerstein) 32768 1 1
Table 1: The system identification benchmarks.

The performance of VRNNaug is compared against state-of-the-art RNN assisted deep stochastic SSM models, including (i) the linear SSM model with the parameters learned through neural networks (Deep-LSSM) rangapuram2018deep ; (ii) the basic VAE structure with the encoder built upon RNN, named VAE-RNN fraccaro2018deep ; (iii) the stochastic RNN (STORN) following encoder-decoder structure bayer2014learning ; (iv) the deep auto-regressive model (Deep-AR) salinas2020deepar ; and finally, (v) the deep SSM for probabilistic forecasting, donated as DSSMF li2021learning .333The DSSMF in li2021learning additionally adopts an automatic relevance determination network to identify the importance of components in input signal , which is not implemented in our comparison. Tables 2 and 3 report the and results of various deep stochastic SSM models over ten runs, respectively. Note that the best and second-best results on the eight benchmarks are marked as gray and light gray, respectively, in order to highlight the winners. We have the following findings from the comparative study.

actuator 0.3866 0.3529 0.3503 0.3839 0.3737 0.3241
ctank 0.2777 0.1558 0.1065 0.0965 0.1028 0.0715
drive 0.3157 0.4925 0.3792 0.5083 0.5056 0.2098
dryer 0.0201 0.0203 0.0162 0.0179 0.0153 0.0158
f16gvt 0.7108 0.0723 0.0735 0.2973 0.3021 0.1540
furnace 0.0238 0.0436 0.0225 0.0231 0.0242 0.0226
tank () 0.1005 0.1885 0.0785 0.0539 0.0529 0.0591
tank () 0.0597 0.0615 0.0559 0.0414 0.0437 0.0429
hammerstein NA 0.9400 0.7364 0.4676 0.5088 0.2261
Table 2: The 50 results of various deep stochastic SSM models on the eight system identification benchmarks. The best and second-best results on each dataset are marked as gray and light gray, respectively.
actuator 0.2059 0.2302 0.1894 0.1965 0.1925 0.1776
ctank 0.2209 0.0979 0.0656 0.0758 0.0611 0.0358
drive 0.1610 0.1966 0.1737 0.2173 0.2245 0.1029
dryer 0.0079 0.0093 0.0060 0.0082 0.0062 0.0067
f16gvt 0.3180 0.0342 0.0353 0.1310 0.1356 0.0733
furnace 0.0210 0.0176 0.0221 0.0229 0.0222 0.0197
tank () 0.0386 0.1219 0.0605 0.0592 0.0566 0.0318
tank () 0.0287 0.0286 0.0344 0.0463 0.0504 0.0285
hammerstein NA 0.3992 0.3689 0.2024 0.2122 0.1443
Table 3: The 90 results of various deep stochastic SSM models on the eight system identification benchmarks. The best and second-best results on each dataset are marked as gray and light gray, respectively.

It is first observed that due to the linear SSM framework, the Deep-LSSM is hard to learn the nonlinear dynamic behaviors from sequence data, even though it attempts to learn the local, time-varying coefficients. For instance, though the Deep-LSSM forecasts the rough profile on the ctank dataset, its prediction however leaves far away from the ground truth in comparison to the others, which is indicated by the poor results in Table 2. Furthermore, it fails on the large and complicated hammerstein dataset. Therefore, in comparison to other competitors, the Deep-LSSM showcases the worst performance in general in terms of both and . Despite the linear model structure, the undesirable performance of Deep-LSSM may be attributed to the type-I maximum likelihood framework wherein the latent variables are not integrated out.

As for the VAE-type deep SSM models, the more complicated STORN is found to outperform the simple VAE-RNN on most cases in terms of both and . This is attributed to the description of the recurrency in previous outputs when determining the posterior . But as has been discussed, the usage of current output at time raises the inconsistency issue: when forecasting time series through generative networks, we have no access to current output and then have to sample from prior. This thus makes STORN’s results in Table 3 slightly worse than Deep-AR and DSSMF which use the auto-regressive strategy to alleviate the issue. Besides, the sampling from prior when generating time series data makes VAE-RNN and STORN tend to underestimate the uncertainty, which will be illustrated on the compressor case in next section.

Finally, for the last three deep SSM models, all of them employ sort of auto-regressive strategy to avoid the inconsistency issue. In comparison to Deep-AR and DSSMF, the proposed VRNNaug adopts (i) the hybrid output to further alleviate the gap between training and predicting; and (ii) the generalized auto-regressive strategy on output , input signal and augmented input for extracting long term historical patterns. As a result, it performs the best on these system identification benchmarks in terms of both and .444The VRNNaug forecasting in future time period on the eight system identification benchmarks is respectively illustrated in Appendix D. Besides, it is found that akin to Deep-LSSM, the Deep-AR also trains the model in the simple type-I maximum-likelihood framework, thus making it perform slightly worse than the variant DSSMF derived through variational inference, especially in terms of the criterion.

4.3 Ablation study

We here perform ablation study of the proposed VRNNaug model on the drive dataset.The ablation study includes two additional VRNNaug variants in order to reveal individual effects of components: (i) the VRNNaug-v1 which resembles Deep-AR by using only the auto-regressive strategy on ; and (ii) the VRNNaug-v2 which uses the generalized auto-regressive strategy in (6) and (7), but simply uses instead of the hybrid in (5) to determine . Differently, the complete VRNNaug model adopts both the previous hybrid outputs and the generalized auto-regressive strategy to alleviate the inconsistency issue and improve the model capability.

Figure 4: Ablation study of the proposed VRNNaug model on the drive dataset.

Fig. 4 depicts the comparative results of three VRNNaug models in terms of both and criteria on the drive dataset. It is observed that:

  • By solely using the auto-regressive strategy akin to Deep-AR and DSSMF, the VRNNaug-v1 is incapable of capturing the underlying nonlinear dynamical behavior on the drive dataset. Contrarily, the generalized auto-regressive strategy summarizes all the long term historical patterns of not only but also and . Consequently, both VRNNaug-v2 and VRNNaug significantly outperform the basic VRNNaug-v1;

  • The hybrid output used in training helps further fill the gap between inference and generation, since now we always consider the prediction samples in the two phases. Hence, it brings considerable improvements for VRNNaug over VRNNaug-v2.

4.4 Centrifugal compressor sensor data forecasting

This section verifies the performance of the proposed VRNNaug on an 11-variable time series data collected from a real-world centrifugal compressor with rotating speed of 5556 RPM xu2016bayesian . As illustrated in Fig. 5, this time series data was recorded every one hour from March to October, 2013, by placing sensors at the inlet, the inlet guide vanes, the exhaust, the bearing, and the axis of centrifugal compressor, thus resulting in 3710 raw data points for each variable.

Figure 5: Illustration of sensors layout for collecting the compressor time series data.

As for this real-world time series task, we perform data pre-processing before applying deep stochastic SSM models on it for forecasting. These 11-variable time series data is collected manually from sensors, thus may containing human errors, outliers, noise and missing values. Hence, we first perform outlier analysis to identify and remove the outliers in each raw time series variable via the box-plot method, thus resulting in 3524 data points. Secondly, in order to eliminate the effect of magnitudes within multivariate time series, we normalize all the variables into the same range

. Thirdly, we conduct discrete wavelet packet transform (DWPT) analysis by the Daubechies wavelet of eight-order daubechies1992ten as well as the Bayesian thresholding technique in order to denoise our time series data.

By processing and denoising the raw data, we finally obtain the compressor time series containing 3524 data points. Following the experimental configurations elaborated in Appendix C, our aim is to predict the axis displacement at the last 10% time points, given the remaining ten variables as covariates.

Deep-LSSM 0.0597 0.0428
VAE-RNN 0.0563 0.0259
STORN 0.0540 0.0270
Deep-AR 0.0674 0.0531
DSSMF 0.0600 0.0581
VRNNaug 0.0488 0.0255
Table 4: The and results of various deep stochastic SSM models on the real-world compressor dataset. Note that the best and second-best results on each dataset are marked as gray and light gray, respectively.

Table 4 reports the forecasting results of VRNNaug against that of other deep stochastic SSM models on the real-world compressor dataset. The superiority of VRNNaug over other competitors has again been observed in terms of both the and criteria for forecasting the future axis displacement. Moreover, Fig. 6 illustrates the forecasting of these deep stochastic SSM models for the future axis displacement. It is observed that VRNNaug provides more accurate prediction mean. Besides, the uncertainty of Deep-AR and DSSMF is overestimated, thus indicated by the poor results in Table 4. Contrarily, VAE-RNN and STORN tend to underestimate the uncertainty, which may be attributed to the sampling from prior rather than the informative posterior in the generative network.555Note that in comparison to the overestimated uncertainty, this conservative uncertainty is favored by the criterion since it might cover the observations. The proposed VRNNaug achieves a trade-off between the two extreme cases, thus providing desirable estimation of uncertainty, which is revealed by the best results.

Figure 6: The future axis displacement of centrifugal compressor forecasted by various deep SSM models. The red curve represents 50-quantile prediction, while the associated shaded region indicates the interval bounded by 95-quantile and 5-quantile predictions.

5 Conclusions

In this paper, we present a variational RNN model on the augmented recurrent space for probabilistic time series forecasting. When using variational inference to derive the ELBO as objective for model training, we figure out the issue of inconsistency that would deteriorate the model performance. Hence, we propose (i) feeding the hybrid output as input at next time step, and (ii) presenting the generalized auto-regressive strategy in order to form recurrent inputs. Through extensive experiments on various time series tasks against existing deep stochastic SSM models, we have showcased the superiority of proposed VRNNaug model by quantifying predictive distribution. Future work would attempt to extend the proposed model to challenging scenarios in dynamic systems with for example missing outputs, irregular time points, and knowledge transfer across multivariate time series.


This work was supported by the National Natural Science Foundation of China (52005074), and the Fundamental Research Funds for the Central Universities (DUT19RC(3)070). Besides, it was partially supported by the Research and Innovation in Science and Technology Major Project of Liaoning Province (2019JH1-10100024), and the MIIT Marine Welfare Project (Z135060009002).

Appendix A Acronyms

Appendix B Notations

Appendix C Experimental details

Toy cases. For the two toy cases, we adopt the same settings for the following system identification benchmarks except that (i) the chunk size is set as and the latent dimensionality takes on the motorcycle

dataset, and (ii) the Adam optimizer is ran over 200 epochs.

System identification benchmarks. The experimental configurations on the eight system identification datasets (, , , , , , , and ) are detailed as below.

Firstly, we perform standardization on the sequence inputs and outputs of time series data to have zero mean and unit variance. We then split the whole data by choosing the first 50% as training set, the middle 20% as validation set, and the final 30% as testing set. Besides, we use the shingling strategy 

leskovec2014mining to further split the training and validation datasets with the chunk size (i.e., sequence length) , thus resulting into many short time series sets.

As for model configuration of the proposed VRNNaug, we build three GRU models in (11)-(13) using a single hidden layer with 100 units. The weight parameters of these GRU models are initialized by the orthogonal initialization saxe2013exact

, and the bias parameters are initialized with zeros. As for the MLPs in VRNNaug, we adopt the fully-connected (FC) neural network with three hidden layers, the ReLU activation, and skip connection 

he2016deep if possible. The number of units for the hidden layers is set as wherein is the number of input features. Finally, for the VAE structure in Fig. 1, we set the latent dimensionality of as .

As for model training, the Adam optimizer is employed with the batch size of and the maximum number of epochs as 100. Particularly, we have a scheduled learning strategy. That is, we start using a learning rate of within the first ten epochs, and then check it every ten epochs: if the validation loss cannot decrease within the last ten epochs, we use a decreasing factor 0.5 to adjust the learning rate. We stop the training when (i) it reaches the maximum number of epochs; or (ii) the adjusted learning rate is below .

As for model forecasting, we start from cold and recursively sample points from the predictive distributions of deep stochastic SSM models at the future time period , and then use the prediction samples to evaluate the related and quantile criteria.

It is finally notable that we repeat the experiments on each time series task ten times by starting with different random seeds in order to comprehensively evaluate the performance of deep sequence models.

Centrifugal compressor sensor data. After obtaining the processed compressor sequence data with and , we adopt the similar experimental settings on the above system identification benchmarks for model training and forecasting. The differences are that we here split the whole data by choosing the first 80% as training set, the middle 10% as validation set, and the final 10% as testing set; and we use the shingling strategy to split the training and validation datasets with the chunk size .

Appendix D Illustration of VRNNaug on benchmarks

Fig. 7 depicts the predictive distributions of the proposed VRNNaug on eight system identification benchmarks, respectively. Note that for the benchmarks with many test time points (e.g., ), we only illustrate the predictions at the first 500 time points. It is observed that the predictions of VRNNaug well describe the sequence behaviors on most benchmarks.

Figure 7: Illustration of VRNNaug forecasting on eight system identification benchmarks. The red curve represents 50-quantile prediction, while the associated shaded region indicates the interval bounded by 95-quantile and 5-quantile predictions.



  • (1) K. J. Åström, P. Eykhoff, System identification—A survey, Automatica 7 (2) (1971) 123–162.
  • (2) M. Verhaegen, V. Verdult, Filtering and system identification: A least squares approach, Cambridge university press, 2007.
  • (3) R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, R. X. Gao, Deep learning and its applications to machine health monitoring, Mechanical Systems and Signal Processing 115 (2019) 213–237.
  • (4)

    W. Yu, I. Y. Kim, C. Mechefske, Remaining useful life estimation using a bidirectional recurrent neural network based autoencoder scheme, Mechanical Systems and Signal Processing 129 (2019) 764–780.

  • (5)

    Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, A. K. Nandi, Applications of machine learning to machine fault diagnosis: A review and roadmap, Mechanical Systems and Signal Processing 138 (2020) 106587.

  • (6) A. Tadjer, A. Hong, R. B. Bratvold, Machine learning based decline curve analysis for short-term oil production forecast, Energy Exploration & Exploitation (2021) 01445987211011784.
  • (7) R. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting with exponential smoothing: the state space approach, Springer Science & Business Media, 2008.
  • (8) G. E. Box, G. M. Jenkins, Some recent advances in forecasting and control, Journal of the Royal Statistical Society: Series C (Applied Statistics) 17 (2) (1968) 91–109.
  • (9) J. Durbin, S. J. Koopman, Time series analysis by state space methods, Oxford university press, 2012.
  • (10) T. B. Schön, A. Wills, B. Ninness, System identification of nonlinear state-space models, Automatica 47 (1) (2011) 39–49.
  • (11) H. Hanachi, J. Liu, A. Banerjee, Y. Chen, Sequential state estimation of nonlinear/non-gaussian systems with stochastic input for turbine degradation estimation, Mechanical Systems and Signal Processing 72 (2016) 32–45.
  • (12)

    L. Rabiner, B. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine 3 (1) (1986) 4–16.

  • (13) Y. Liao, L. Zhang, C. Liu, Uncertainty prediction of remaining useful life using long short-term memory network based on bootstrap method, in: IEEE International Conference on Prognostics and Health Management, IEEE, 2018, pp. 1–8.
  • (14) R. E. Kalman, A new approach to linear filtering and prediction problems, Journal of Basic Engineering 82 (1) (1960) 35–45.
  • (15) M. West, J. Harrison, Bayesian forecasting and dynamic models, Springer Science & Business Media, 2006.
  • (16) C. K. Williams, C. E. Rasmussen, Gaussian processes for machine learning, MIT press Cambridge, MA, 2006.
  • (17) R. Frigola, Y. Chen, C. Rasmussen, Variational Gaussian process state-space models, in: Advances in Neural Information Processing Systems, Vol. 27, 2014, pp. 3680–3688.
  • (18) A. D. Ialongo, M. van der Wilk, J. Hensman, C. E. Rasmussen, Overcoming mean-field approximations in recurrent Gaussian process models., in: International Conference on Machine Learning, 2019, pp. 2931–2940.
  • (19) H. Liu, Y.-S. Ong, X. Shen, J. Cai, When gaussian process meets big data: A review of scalable GPs, IEEE Transactions on Neural Networks and Learning Systems 31 (11) (2020) 4405–4423.
  • (20) E. Snelson, Z. Ghahramani, Sparse gaussian processes using pseudo-inputs, in: Advances in Neural Information Processing Systems, MIT Press, 2006, pp. 1257–1264.
  • (21)

    M. Titsias, Variational learning of inducing variables in sparse Gaussian processes, in: Artificial Intelligence and Statistics, 2009, pp. 567–574.

  • (22)

    J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in: Uncertainty in Artificial Intelligence, Citeseer, 2013, p. 282.

  • (23) H. Liu, J. Cai, Y. Wang, Y.-S. Ong, Generalized robust Bayesian committee machine for large-scale Gaussian process regression, in: International Conference on Machine Learning, 2018, pp. 1–10.
  • (24) K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724–1734.
  • (25) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780.
  • (26) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, arXiv preprint arXiv:1409.3215.
  • (27) J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional sequence to sequence learning, in: International Conference on Machine Learning, PMLR, 2017, pp. 1243–1252.
  • (28) Z. C. Lipton, J. Berkowitz, C. Elkan, A critical review of recurrent neural networks for sequence learning, arXiv preprint arXiv:1506.00019.
  • (29) W. Yu, I. Y. Kim, C. Mechefske, Analysis of different rnn autoencoder variants for time series classification and machine prognostics, Mechanical Systems and Signal Processing 149 (2021) 107322.
  • (30) D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, DeepAR: Probabilistic forecasting with autoregressive recurrent networks, International Journal of Forecasting 36 (3) (2020) 1181–1191.
  • (31) L. Li, J. Yan, X. Yang, Y. Jin, Learning interpretable deep state space model for probabilistic time series forecasting, arXiv preprint arXiv:2102.00397.
  • (32) S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, T. Januschowski, Deep state space models for time series forecasting, Advances in Neural Information Processing Systems 31 (2018) 7785–7794.
  • (33) A. K. Yanchenko, S. Mukherjee, Stanza: A nonlinear state space model for probabilistic inference in non-stationary time series, arXiv preprint arXiv:2006.06553.
  • (34) M. Al-Shedivat, A. G. Wilson, Y. Saatchi, Z. Hu, E. P. Xing, Learning scalable deep kernels with recurrent structure, The Journal of Machine Learning Research 18 (1) (2017) 2850–2886.
  • (35) Q. She, A. Wu, Neural dynamics discovery via Gaussian process recurrent neural networks, in: Uncertainty in Artificial Intelligence, PMLR, 2020, pp. 454–464.
  • (36) D. P. Kingma, M. Welling, Auto-encoding variational Bayes, arXiv preprint arXiv:1312.6114.
  • (37) J. Bayer, C. Osendorfer, Learning stochastic recurrent networks, arXiv preprint arXiv:1411.7610.
  • (38) R. G. Krishnan, U. Shalit, D. Sontag, Deep kalman filters, arXiv preprint arXiv:1511.05121.
  • (39) R. G. Krishnan, U. Shalit, D. Sontag, Structured inference networks for nonlinear state space models, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017, pp. 2101–2109.
  • (40) M. Fraccaro, Deep latent variable models for sequential data, Ph.D. thesis, Department of Applied Mathematics and Computer Science, Technical University of Denmark (2018).
  • (41) D. Gedon, N. Wahlström, T. B. Schön, L. Ljung, Deep state space models for nonlinear system identification., arxiv:eess.SY.
  • (42) S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.
  • (43) J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of massive data sets, Cambridge university press, 2014.
  • (44) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
  • (45) S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271.
  • (46) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
  • (47)

    G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.

  • (48) L. Lennart, System identification: Theory for the user, PTR Prentice Hall, Upper Saddle River, NJ 28.
  • (49) B. W. Silverman, Some aspects of the spline smoothing approach to non-parametric regression curve fitting, Journal of the Royal Statistical Society: Series B (Methodological) 47 (1) (1985) 1–21.
  • (50) M. Schoukens, P. Mattson, T. Wigren, J.-P. Noel, Cascaded tanks benchmark combining soft and hard nonlinearities, in: Workshop on nonlinear system identification benchmarks, 2016, pp. 20–23.
  • (51) J.-P. Noël, M. Schoukens, F-16 aircraft benchmark based on ground vibration test data, in: 2017 Workshop on Nonlinear System Identification Benchmarks, 2017, pp. 19–23.
  • (52) M. Schoukens, J.-P. Noel, Wiener-hammerstein benchmark with process noise, in: Workshop on nonlinear system identification benchmarks, 2016, pp. 15–19.
  • (53) X. Shengli, J. Xiaomo, H. Jinzhi, Y. Shuhua, W. Xiaofang, Bayesian wavelet pca methodology for turbomachinery damage diagnosis under uncertainty, Mechanical Systems and Signal Processing 80 (Dec) (2016) 1–18.
  • (54) I. Daubechies, Ten lectures on wavelets, SIAM, 1992.
  • (55) A. M. Saxe, J. L. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv preprint arXiv:1312.6120.
  • (56) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.