1 Introduction
Time series forecasting is a long standing problem in widespread decisionmaking scenarios, for example, dynamic system identification aastrom1971system ; verhaegen2007filtering , prognostic and health management (PHM) of machines zhao2019deep ; yu2019remaining ; lei2020applications , and business demand forecasting tadjer2021machine . It thus has gained extensive attention from academic and industrial community over decades.
In order to learn the underlying dynamic and temporal pattern from historical time series data and perform desirable forecasting in the future, classic time series methodologies, for example, exponential smoothing hyndman2008forecasting , and autoregressive integrated moving average (ARIMA) model box1968some ; durbin2012time , have been studied for a long history. They usually incorporate user’s prior knowledge by decomposing the time series structure into trend, seasonality and so on. Consequently, these methods have high interpretability while suffering from poor capability of learning complex and long time dependency. Moreover, state space models (SSM) hyndman2008forecasting ; schon2011system ; hanachi2016sequential
, which are similar to hidden Markov models
rabiner1986introduction and could include the above classic methods as special cases durbin2012time , offer a more principled way to learn dynamic patterns from data.For a forecasting model, an important feature is that it could not only provide the prediction mean but also quantify the associated uncertainty, which is essential for risk management, for instance, the estimation of remaining useful life
liao2018uncertainty. To this end, an alternative way is interpreting the SSM from the Bayesian view, for example, the wellknown Kalman filter
kalman1960new and dynamic linear models west2006bayesian . Besides, it is popular to combine SSM with Gaussian process (GP) williams2006gaussian , which is theoretically equivalent to Kalman filter when using specific kernels. The Gaussian process state space model (GPSSM) frigola2014variational ; ialongo2019overcoming places GP prior over both the transition and measurement functions of SSM, thus being capable of exporting probabilistic forecasting. The prominent weakness of GPSSM however is the poor scalability to tackle a large amount of time series data, due to the cubic time complexity liu2020gaussian . Though we could resort to the sparse and distributed approximations snelson2006sparse ; titsias2009variational ; hensman2013gaussian ; liu2018generalized to improve the scalability of GPSSM, it on the other hand may limit the model capability.Alternatively, the deep learning community has followed the SSM framework to extensively exploit the recurrent neural networks (RNNs), for example, gated recurrent unit (GRU)
cho2014learning, long short term memory (LSTM)
hochreiter1997long and their variants sutskever2014sequence ; gehring2017convolutional , in order to perform time series point prediction. Due to their high capability of extracting highdimensional features for learning complicated patterns within time series data, RNNs have reported remarkable and success stories in various domains lipton2015critical ; yu2021analysis . In order to enable probabilistic forecasting using RNN, it is suggested to combine RNN with generative models. For example, we could combine the autoregressive strategy with RNN and train the model through maximum likelihood salinas2020deepar ; li2021learning . Alternatively, we could adopt an RNN to learn the timevarying coefficients of linear/nonlinear stochastic SSM rangapuram2018deep ; yanchenko2020stanza , which enjoys both probabilistic forecasting and good interpretability. Another line of work builds the generative GP on the top of RNN as measurement function to produce stochastic outputs al2017learning ; she2020neural . Finally, another popular way is combining RNN with variational autoencoder (VAE) kingma2013auto to build connections between SSM and VAE bayer2014learning ; krishnan2015deep ; krishnan2017structured ; fraccaro2018deep ; li2021learning ; gedon2020deep . These models encode the sequence characteristics and dependency on the manifold embedded in a lowdimensional latent space, and are usually trained through variational inference (VI) by deriving the evidence lower bound (ELBO) as objective, thus resulting in the stochastic or variational RNN.Along the line of deep SSM which gains benefits from two communities, we devise a novel, datadriven deep variational sequence model in an augmented recurrent input space, named VRNNaug, to perform probabilistic time series forecasting. The main contributions of this paper are threefold:

We propose a VAEtype sequence model on an augmented input space, the inputs of which consist of the original input signal and the additional latent input which encodes all the sequence characteristics. The augmented input space could induce rich stochastic sequence dependency.

We alleviate the issue of inconsistency between training and predicting as well as improving the modeling of dynamic patterns in the VI framework by (i) feeding the hybrid output as input at next time step, which brings training and predicting into alignment; and (ii) presenting a generalized autoregressive strategy which encodes all the historical dependencies. This is found to greatly improve the quality of future forecasting.

We perform extensive numerical experiments to verify the superiority of our model on eight system identification benchmarks from various dynamic systems, and thereafter apply the proposed sequence model on a realworld centrifugal compressor sensor data forecasting case to illustrate the desirable time series predictive distribution.
Note that the acronyms and notations used in this paper are summarized in Appendices A and B.
2 Preliminaries
It is assumed that the sequence data at discrete and evenly distributed time points are sampled from an unknown dynamic system, and we attempt to model it in order to learn the underlying dynamic and temporal patterns. Thereafter, we use the fitted sequence model to forecast the predictive distribution of future window given the history data and the further input signal (also known as covariates or exogenous variables) .
Specifically, we assume that the target dynamic system could be described by the state space model
(1)  
(2) 
where the subscript , which is a nonnegative integer, indexes time; is the latent state to be inferred and represents the initial hidden state; the input signal (action) is known at all time points; is the measurement (observation); the transition function maps the hidden state to the next state ;^{1}^{1}1When the transition is a linear timeinvariant mapping, we arrive at the linear SSM. the measurement function (also known as emission function) maps the hidden state to the associated output; and finally, and denote the parameters of and , respectively.
It is observed that when we are describing the functions and using deep mappings, e.g., neural networks, it results in the socalled deep SSM which takes the advances from two fields: the high flexibility from neural networks and the high expressivity via hidden states. When future combining SSM with generative neural networks, for example, VAE, we arrive at the subclass of deep SSM which is capable of preforming probabilistic time series forecasting and will be investigated below in our work. Note that the possible perturbations, for example, modeling error and measurement error, have been absorbed in (1) and (2).
3 Variational sequence model using augmented stochastic inputs
In order to incorporate stochasticity into the RNN, we propose to use additional random variable to augment the input space as
. Consequently, rich stochastic sequence representation could be induced in the augmented input space for improving both the model capability and the quantification of uncertainty.Model definition. Under the framework of latent variable model, the joint prior with additional latent variables follows the firstorder Markov assumption to factorize as
(3) 
where the latent sequence starts from which is the initial input state and could take for example the zero initialization; the Gaussian state transition starts from and evolves over time for encoding the previous sequence dependency; the Gaussian likelihood plays as the role of decoder for mapping the encoded sequence inputs to the related output at time .
For model inference, we need to maximize the marginal likelihood
which could be estimated by Markov Chain Monte Carlo (MCMC) sampling. When performing prediction, we require the posterior
, the form of which however is unknown. We hence resort to variational inference by introducing Gaussian variational posteriors as an approximation. Specially, similar to the prior, the joint variational posterior could be factorized as
(4) 
which describes the generation of augmented latent inputs given observations, and each state transition takes the Gaussian form.
Inconsistency issue. The stochastic sequence model defined in (3) and (4) however may raise inconsistency between training and predicting. That is, different from the training phase, when we are going to predict at time , we have to use the prior rather than the inferred variational posterior , since there is no access to the further observation . The gap raised by this inconsistency might deteriorate the performance.
Inspired by salinas2020deepar , we could use the previous output instead of in the variational posterior with , and sample from the previous conditional output when predicting at time with in order to alleviate the inconsistency issue. The autoregressive strategy extracts the known previous output as an input at the next time point, which is available for both training and predicting. This strategy however introduces another inconsistency: we use the ground truth during training, but a sample
during predicting. This discrepancy has been found to deteriorate the RNN performance on natural language processing (NLP) tasks
bengio2015scheduled . Therefore, we decide to use a hybrid output defined as(5) 
and feed it as input at next time point during training. The ground truth in the above expression plays as the teaching force to accelerate the model convergence, while the usage of sample reduces the gap between training and predicting.
Furthermore, as shown in Fig. 1, we decide to use a generalized autoregressive strategy by evolving over all the historical outputs until time through the recurrent mapping as
(6) 
where . This generalized autoregressive strategy can also be applied to and in (4) to obtain the recurrent latent states and inputs as
(7) 
where and . Consequently, the joint prior and the variational posterior now write respectively as
(8)  
(9) 
with the state transitions completely conditioned on recurrent inputs. The encoding of all the previous variables enables the learning of long term dependency, the superiority of which has been verified in the ablation study in section 4.3.
Derivation of ELBO. As for the aforementioned variational sequence model, by maximizing the KullbackLeibler (KL) divergence between the variational posterior and the exact posterior , we derive the following evidence lower bound (ELBO) for the log marginal likelihood as
(10) 
where the initials take and . Due to the Gaussian variables, the KL term in the righthand side of (10) has analytical expressions.
The estimation of ELBO (10), especially the likelihood expectation, could be conducted through the reparameterization trick kingma2013auto due to the latent Gaussian random variables, therefore resulting in differentiable objective for optimization. For instance, instead of directly sampling from the variational Gaussian posterior , we sample from another random variable without trainable parameters, and then use it to obtain where the symbol represents elementwise product.
Besides, directly evaluating the recurrent ELBO unrolled over a long time series requires high memory consumption due to the many historical states. More efficiently, we employ the shingling technique proposed in leskovec2014mining to convert the long time series into several short time series chunks. Specially, given a long time series with length of , we start from an arbitrary time point and obtain a short time series with sequence length . Consequently, we could obtain short time series chunks in total. This shingling data preprocessing allows modeling many independent time series with varying length simultaneously given the same chunk size . Besides, the time series split as well as the ELBO factorized over time points in (10
) could naturally support an efficient and unbiased estimation on a subset
of the short time series chunks wherein , thus allowing using minibatch optimizer, e.g., Adam kingma2014adam .Amortized VRNNaug. As depicted in Fig. 1, the whole model follows the encoderdecoder structure. Furthermore, we employ amortized variational inference, i.e., amortizing the parameters over neural networks, to build our proposed VRNNaug model more efficiently. In order to account for the recurrent structure in the state transition
, we build the latent variables upon the outputs of for example RNN models, and map them to the related Gaussian variables through multilayer perceptron (MLP) as
(11)  
(12)  
(13) 
The above RNN cells, for instance, GRU cho2014learning and LSTM hochreiter1997long , naturally encode the sequence characteristics. It is notable that the sequence modelings , and are not limited to RNN models. The recently developed sequence extractors, for example, temporal convolutional networks (TCN) bai2018empirical and transformer vaswani2017attention , are also available here. Besides, note that different from and , the hidden state in the above expression solely holds for training, see the inference network in Fig. 1. When predicting, it accepts the single sample from previous output distribution as , see the generative network in Fig. 1. Finally, different from the posterior state transition , the latent prior takes the unit Gaussian for simplicity.
As for the likelihood (decoder), we follow the architecture of DenseNet huang2017densely to extend the original by extracting and concatenating all the previous inputs in Fig. 1 as
(14) 
where the Gaussian parameters are parameterized through MLPs as
(15)  
(16) 
This sort of dense connection between encoder and decoder eases model training because of the blockcross feature sharing. Note that the likelihood is not limited to Gaussian, it can adopt other taskdependent distributions. For instance, we could choose the negativebinomial likelihood for positive count data, and the Bernoulli likelihood for binary data salinas2020deepar .
Prediction. Given the trained deep sequence model, in order to perform recursive prediction over the future time , we first sample from the posterior state transition wherein now is unrolled over the previous output samples since the true output observations are inaccessible when predicting. Thereafter, we pass the latent sample through the decoder to output the prediction sample at current time point. It is notable that although the model employs Gaussian variables, the final predictive distribution is not Gaussian. Hence, we could use MCMC sampling to output samples for empirically quantifying the implicit predictive distribution.
Discussions. So far, we have described the proposed deep sequence model with known input signal at any time point, which includes additional covariate that may affect the output. For time series without additional control signals, one could simply use the time features, for example, hourofday and dayofweek, as the additional for the model.
Alternatively, it is known that one could convert this kind of time series into supervised learning fashion through the autoregressive manner. That is, given a lookback window size
, we use the historical data as the input control signal to predict at future time . This can be naturally done by our model in Fig. 1 by simply dropping out in both inference and generative networks. Note that our model is mainly designed for free forecasting, i.e., multistep ahead prediction. If one wants to conduct onestep ahead prediction, we could simply feed the ground truth into both the inference and generative networks.Finally, the relation and difference of our model to existing literature should be highlighted. The most closely related works come from bayer2014learning ; fraccaro2018deep ; li2021learning ; gedon2020deep ; salinas2020deepar . In comparison to the similar VAEtype probabilistic sequence models, the proposed VRNNaug on the augmented inputs brings several improvements: (i) it adopts hybrid output in (5) during training for greatly alleviating the inconsistency issue; and (ii) it employs the generalized autoregressive strategy in (6) and (7) to form the recurrent inputs for better capturing temporal dependency. The benefits brought by these improvements will be verified by the comparative study in section 4.2 and the ablation study in section 4.3.
4 Numerical experiments
In this section, we first showcase the methodological characteristics of proposed VRNNaug model on two toy cases, followed by the extensive comparison against existing RNN assisted stochastic sequence models on eight system identification benchmarks and a realworld centrifugal compressor sensor data.
We implement our model using PyTorch on a Linux workstation with TITAN RTX GPU. The detailed experimental configurations are provided in Appendix
C. As for the probabilistic forecasting provided by our model, we quantify the quality of predictive distribution by the 50 and90quantile losses
rangapuram2018deep . Generally, given the th time series output () and its corresponding quantile prediction , the quantile loss is expressed as(17) 
where the parameter . In the following experiments, we use to quantify the accuracy of prediction mean, and to represent the coverage of predictive distribution.
4.1 Toy cases
We below study the proposed VRNNaug model on two toy cases (the linear Gaussian system and the heteroscedastic motorcycle case) with different features.
Linear Gaussian system. We first consider a onedimensional linear Gaussian system gedon2020deep expressed as
(18)  
(19) 
where is the process noise, and is the measurement noise. We use this dynamic model to sequentially generate 2000 training points, 2000 validation points and 5000 test points. Note that when generating training and validation signals, an excitation input signal polluted with uniform random noise is employed for the linear system. Differently, the input signal for the test data writes as
The predictive distributions of the proposed VRNNaug together with the true observations are depicted in Fig. 2. It is observed that for this linear system, the flexible, nonlinear VRNNaug successfully captures the underlying linear dynamics. This linear model has also been used in gedon2020deep to test another deep SSM model, named STORN bayer2014learning , and the graybox model SSEST lennart1999system with two latent states. Though successfully capturing the dynamics, they conservatively overestimate the uncertainty of prediction. Our model however is found to well estimate the uncertainty in Fig. 2.
Heteroscedastic motorcycle data. We next study a more complicated motorcycle time series dataset silverman1985some composed of 133 points with time input (ms) and acceleration output. The motorcycle data takes from an experiment on the efficacy of crash helmets, and yields multiple acceleration observations at some time points. In comparison to the previous linear Gaussian system with homoscedastic noise, this nonlinear dataset contains more challenging timevarying noise.
We use all the 133 data points for training, validation and testing, and extract the time feature as input signal for our VRNNaug model. As depicted in Fig. 3, the proposed VRNNaug captures the nonlinear dynamics of underlying acceleration curve, for example, the constant behavior before ms and the step behavior around ms. More interestingly, it well describes the timevarying noise, for example, the pretty small perturbation before ms and the large and timevarying noise thereafter.
4.2 System identification benchmarks
This section verifies the performance of proposed deep sequence model on eight popular system identification benchmarks from dynamics systems like hydraulic actuator, electric motors, hair dryer, furnace, tank, cascaded tanks system schoukens2016cascaded , F16 aircraft ground vibration test noel2017f , and wienerhammerstein process noise system schoukens2016wiener .^{2}^{2}2The first five datasets are available at https://homes.esat.kuleuven.be/~smc/daisy/daisydata.html and the last three datasets are collected at http://nonlinearbenchmark.org/index.html. Table 1 summarizes the sequence length and dimensions of the considered time series datasets. It is found that all these datasets except f16gvt have a single input signal, and the particular tank dataset has two outputs. The goal of sequence model is to infer the underlying highdimensional latent state and the relevant nonlinear dynamics from data.
dataset  

actuator  1024  1  1 
cascaded tank (ctank)  2028  1  1 
drive  500  1  1 
dryer  1000  1  1 
f16gvt  32768  2  1 
gas furnace (furnace)  296  1  1 
tank  2500  1  2 
wiener hammerstein (hammerstein)  32768  1  1 
The performance of VRNNaug is compared against stateoftheart RNN assisted deep stochastic SSM models, including (i) the linear SSM model with the parameters learned through neural networks (DeepLSSM) rangapuram2018deep ; (ii) the basic VAE structure with the encoder built upon RNN, named VAERNN fraccaro2018deep ; (iii) the stochastic RNN (STORN) following encoderdecoder structure bayer2014learning ; (iv) the deep autoregressive model (DeepAR) salinas2020deepar ; and finally, (v) the deep SSM for probabilistic forecasting, donated as DSSMF li2021learning .^{3}^{3}3The DSSMF in li2021learning additionally adopts an automatic relevance determination network to identify the importance of components in input signal , which is not implemented in our comparison. Tables 2 and 3 report the and results of various deep stochastic SSM models over ten runs, respectively. Note that the best and secondbest results on the eight benchmarks are marked as gray and light gray, respectively, in order to highlight the winners. We have the following findings from the comparative study.
Datasets  DeepLSSM  VAERNN  STORN  DeepAR  DSSMF  VRNNaug 

actuator  0.3866  0.3529  0.3503  0.3839  0.3737  0.3241 
ctank  0.2777  0.1558  0.1065  0.0965  0.1028  0.0715 
drive  0.3157  0.4925  0.3792  0.5083  0.5056  0.2098 
dryer  0.0201  0.0203  0.0162  0.0179  0.0153  0.0158 
f16gvt  0.7108  0.0723  0.0735  0.2973  0.3021  0.1540 
furnace  0.0238  0.0436  0.0225  0.0231  0.0242  0.0226 
tank ()  0.1005  0.1885  0.0785  0.0539  0.0529  0.0591 
tank ()  0.0597  0.0615  0.0559  0.0414  0.0437  0.0429 
hammerstein  NA  0.9400  0.7364  0.4676  0.5088  0.2261 
Datasets  DeepLSSM  VAERNN  STORN  DeepAR  DSSMF  VRNNaug 

actuator  0.2059  0.2302  0.1894  0.1965  0.1925  0.1776 
ctank  0.2209  0.0979  0.0656  0.0758  0.0611  0.0358 
drive  0.1610  0.1966  0.1737  0.2173  0.2245  0.1029 
dryer  0.0079  0.0093  0.0060  0.0082  0.0062  0.0067 
f16gvt  0.3180  0.0342  0.0353  0.1310  0.1356  0.0733 
furnace  0.0210  0.0176  0.0221  0.0229  0.0222  0.0197 
tank ()  0.0386  0.1219  0.0605  0.0592  0.0566  0.0318 
tank ()  0.0287  0.0286  0.0344  0.0463  0.0504  0.0285 
hammerstein  NA  0.3992  0.3689  0.2024  0.2122  0.1443 
It is first observed that due to the linear SSM framework, the DeepLSSM is hard to learn the nonlinear dynamic behaviors from sequence data, even though it attempts to learn the local, timevarying coefficients. For instance, though the DeepLSSM forecasts the rough profile on the ctank dataset, its prediction however leaves far away from the ground truth in comparison to the others, which is indicated by the poor results in Table 2. Furthermore, it fails on the large and complicated hammerstein dataset. Therefore, in comparison to other competitors, the DeepLSSM showcases the worst performance in general in terms of both and . Despite the linear model structure, the undesirable performance of DeepLSSM may be attributed to the typeI maximum likelihood framework wherein the latent variables are not integrated out.
As for the VAEtype deep SSM models, the more complicated STORN is found to outperform the simple VAERNN on most cases in terms of both and . This is attributed to the description of the recurrency in previous outputs when determining the posterior . But as has been discussed, the usage of current output at time raises the inconsistency issue: when forecasting time series through generative networks, we have no access to current output and then have to sample from prior. This thus makes STORN’s results in Table 3 slightly worse than DeepAR and DSSMF which use the autoregressive strategy to alleviate the issue. Besides, the sampling from prior when generating time series data makes VAERNN and STORN tend to underestimate the uncertainty, which will be illustrated on the compressor case in next section.
Finally, for the last three deep SSM models, all of them employ sort of autoregressive strategy to avoid the inconsistency issue. In comparison to DeepAR and DSSMF, the proposed VRNNaug adopts (i) the hybrid output to further alleviate the gap between training and predicting; and (ii) the generalized autoregressive strategy on output , input signal and augmented input for extracting long term historical patterns. As a result, it performs the best on these system identification benchmarks in terms of both and .^{4}^{4}4The VRNNaug forecasting in future time period on the eight system identification benchmarks is respectively illustrated in Appendix D. Besides, it is found that akin to DeepLSSM, the DeepAR also trains the model in the simple typeI maximumlikelihood framework, thus making it perform slightly worse than the variant DSSMF derived through variational inference, especially in terms of the criterion.
4.3 Ablation study
We here perform ablation study of the proposed VRNNaug model on the drive dataset.The ablation study includes two additional VRNNaug variants in order to reveal individual effects of components: (i) the VRNNaugv1 which resembles DeepAR by using only the autoregressive strategy on ; and (ii) the VRNNaugv2 which uses the generalized autoregressive strategy in (6) and (7), but simply uses instead of the hybrid in (5) to determine . Differently, the complete VRNNaug model adopts both the previous hybrid outputs and the generalized autoregressive strategy to alleviate the inconsistency issue and improve the model capability.
Fig. 4 depicts the comparative results of three VRNNaug models in terms of both and criteria on the drive dataset. It is observed that:

By solely using the autoregressive strategy akin to DeepAR and DSSMF, the VRNNaugv1 is incapable of capturing the underlying nonlinear dynamical behavior on the drive dataset. Contrarily, the generalized autoregressive strategy summarizes all the long term historical patterns of not only but also and . Consequently, both VRNNaugv2 and VRNNaug significantly outperform the basic VRNNaugv1;

The hybrid output used in training helps further fill the gap between inference and generation, since now we always consider the prediction samples in the two phases. Hence, it brings considerable improvements for VRNNaug over VRNNaugv2.
4.4 Centrifugal compressor sensor data forecasting
This section verifies the performance of the proposed VRNNaug on an 11variable time series data collected from a realworld centrifugal compressor with rotating speed of 5556 RPM xu2016bayesian . As illustrated in Fig. 5, this time series data was recorded every one hour from March to October, 2013, by placing sensors at the inlet, the inlet guide vanes, the exhaust, the bearing, and the axis of centrifugal compressor, thus resulting in 3710 raw data points for each variable.
As for this realworld time series task, we perform data preprocessing before applying deep stochastic SSM models on it for forecasting. These 11variable time series data is collected manually from sensors, thus may containing human errors, outliers, noise and missing values. Hence, we first perform outlier analysis to identify and remove the outliers in each raw time series variable via the boxplot method, thus resulting in 3524 data points. Secondly, in order to eliminate the effect of magnitudes within multivariate time series, we normalize all the variables into the same range
. Thirdly, we conduct discrete wavelet packet transform (DWPT) analysis by the Daubechies wavelet of eightorder daubechies1992ten as well as the Bayesian thresholding technique in order to denoise our time series data.By processing and denoising the raw data, we finally obtain the compressor time series containing 3524 data points. Following the experimental configurations elaborated in Appendix C, our aim is to predict the axis displacement at the last 10% time points, given the remaining ten variables as covariates.
Method  

DeepLSSM  0.0597  0.0428 
VAERNN  0.0563  0.0259 
STORN  0.0540  0.0270 
DeepAR  0.0674  0.0531 
DSSMF  0.0600  0.0581 
VRNNaug  0.0488  0.0255 
Table 4 reports the forecasting results of VRNNaug against that of other deep stochastic SSM models on the realworld compressor dataset. The superiority of VRNNaug over other competitors has again been observed in terms of both the and criteria for forecasting the future axis displacement. Moreover, Fig. 6 illustrates the forecasting of these deep stochastic SSM models for the future axis displacement. It is observed that VRNNaug provides more accurate prediction mean. Besides, the uncertainty of DeepAR and DSSMF is overestimated, thus indicated by the poor results in Table 4. Contrarily, VAERNN and STORN tend to underestimate the uncertainty, which may be attributed to the sampling from prior rather than the informative posterior in the generative network.^{5}^{5}5Note that in comparison to the overestimated uncertainty, this conservative uncertainty is favored by the criterion since it might cover the observations. The proposed VRNNaug achieves a tradeoff between the two extreme cases, thus providing desirable estimation of uncertainty, which is revealed by the best results.
5 Conclusions
In this paper, we present a variational RNN model on the augmented recurrent space for probabilistic time series forecasting. When using variational inference to derive the ELBO as objective for model training, we figure out the issue of inconsistency that would deteriorate the model performance. Hence, we propose (i) feeding the hybrid output as input at next time step, and (ii) presenting the generalized autoregressive strategy in order to form recurrent inputs. Through extensive experiments on various time series tasks against existing deep stochastic SSM models, we have showcased the superiority of proposed VRNNaug model by quantifying predictive distribution. Future work would attempt to extend the proposed model to challenging scenarios in dynamic systems with for example missing outputs, irregular time points, and knowledge transfer across multivariate time series.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (52005074), and the Fundamental Research Funds for the Central Universities (DUT19RC(3)070). Besides, it was partially supported by the Research and Innovation in Science and Technology Major Project of Liaoning Province (2019JH110100024), and the MIIT Marine Welfare Project (Z135060009002).
Appendix A Acronyms
Appendix B Notations
Appendix C Experimental details
Toy cases. For the two toy cases, we adopt the same settings for the following system identification benchmarks except that (i) the chunk size is set as and the latent dimensionality takes on the motorcycle
dataset, and (ii) the Adam optimizer is ran over 200 epochs.
System identification benchmarks. The experimental configurations on the eight system identification datasets (, , , , , , , and ) are detailed as below.
Firstly, we perform standardization on the sequence inputs and outputs of time series data to have zero mean and unit variance. We then split the whole data by choosing the first 50% as training set, the middle 20% as validation set, and the final 30% as testing set. Besides, we use the shingling strategy
leskovec2014mining to further split the training and validation datasets with the chunk size (i.e., sequence length) , thus resulting into many short time series sets.As for model configuration of the proposed VRNNaug, we build three GRU models in (11)(13) using a single hidden layer with 100 units. The weight parameters of these GRU models are initialized by the orthogonal initialization saxe2013exact
, and the bias parameters are initialized with zeros. As for the MLPs in VRNNaug, we adopt the fullyconnected (FC) neural network with three hidden layers, the ReLU activation, and skip connection
he2016deep if possible. The number of units for the hidden layers is set as wherein is the number of input features. Finally, for the VAE structure in Fig. 1, we set the latent dimensionality of as .As for model training, the Adam optimizer is employed with the batch size of and the maximum number of epochs as 100. Particularly, we have a scheduled learning strategy. That is, we start using a learning rate of within the first ten epochs, and then check it every ten epochs: if the validation loss cannot decrease within the last ten epochs, we use a decreasing factor 0.5 to adjust the learning rate. We stop the training when (i) it reaches the maximum number of epochs; or (ii) the adjusted learning rate is below .
As for model forecasting, we start from cold and recursively sample points from the predictive distributions of deep stochastic SSM models at the future time period , and then use the prediction samples to evaluate the related and quantile criteria.
It is finally notable that we repeat the experiments on each time series task ten times by starting with different random seeds in order to comprehensively evaluate the performance of deep sequence models.
Centrifugal compressor sensor data. After obtaining the processed compressor sequence data with and , we adopt the similar experimental settings on the above system identification benchmarks for model training and forecasting. The differences are that we here split the whole data by choosing the first 80% as training set, the middle 10% as validation set, and the final 10% as testing set; and we use the shingling strategy to split the training and validation datasets with the chunk size .
Appendix D Illustration of VRNNaug on benchmarks
Fig. 7 depicts the predictive distributions of the proposed VRNNaug on eight system identification benchmarks, respectively. Note that for the benchmarks with many test time points (e.g., ), we only illustrate the predictions at the first 500 time points. It is observed that the predictions of VRNNaug well describe the sequence behaviors on most benchmarks.
References
References
 (1) K. J. Åström, P. Eykhoff, System identification—A survey, Automatica 7 (2) (1971) 123–162.
 (2) M. Verhaegen, V. Verdult, Filtering and system identification: A least squares approach, Cambridge university press, 2007.
 (3) R. Zhao, R. Yan, Z. Chen, K. Mao, P. Wang, R. X. Gao, Deep learning and its applications to machine health monitoring, Mechanical Systems and Signal Processing 115 (2019) 213–237.

(4)
W. Yu, I. Y. Kim, C. Mechefske, Remaining useful life estimation using a bidirectional recurrent neural network based autoencoder scheme, Mechanical Systems and Signal Processing 129 (2019) 764–780.

(5)
Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, A. K. Nandi, Applications of machine learning to machine fault diagnosis: A review and roadmap, Mechanical Systems and Signal Processing 138 (2020) 106587.
 (6) A. Tadjer, A. Hong, R. B. Bratvold, Machine learning based decline curve analysis for shortterm oil production forecast, Energy Exploration & Exploitation (2021) 01445987211011784.
 (7) R. Hyndman, A. B. Koehler, J. K. Ord, R. D. Snyder, Forecasting with exponential smoothing: the state space approach, Springer Science & Business Media, 2008.
 (8) G. E. Box, G. M. Jenkins, Some recent advances in forecasting and control, Journal of the Royal Statistical Society: Series C (Applied Statistics) 17 (2) (1968) 91–109.
 (9) J. Durbin, S. J. Koopman, Time series analysis by state space methods, Oxford university press, 2012.
 (10) T. B. Schön, A. Wills, B. Ninness, System identification of nonlinear statespace models, Automatica 47 (1) (2011) 39–49.
 (11) H. Hanachi, J. Liu, A. Banerjee, Y. Chen, Sequential state estimation of nonlinear/nongaussian systems with stochastic input for turbine degradation estimation, Mechanical Systems and Signal Processing 72 (2016) 32–45.

(12)
L. Rabiner, B. Juang, An introduction to hidden Markov models, IEEE ASSP Magazine 3 (1) (1986) 4–16.
 (13) Y. Liao, L. Zhang, C. Liu, Uncertainty prediction of remaining useful life using long shortterm memory network based on bootstrap method, in: IEEE International Conference on Prognostics and Health Management, IEEE, 2018, pp. 1–8.
 (14) R. E. Kalman, A new approach to linear filtering and prediction problems, Journal of Basic Engineering 82 (1) (1960) 35–45.
 (15) M. West, J. Harrison, Bayesian forecasting and dynamic models, Springer Science & Business Media, 2006.
 (16) C. K. Williams, C. E. Rasmussen, Gaussian processes for machine learning, MIT press Cambridge, MA, 2006.
 (17) R. Frigola, Y. Chen, C. Rasmussen, Variational Gaussian process statespace models, in: Advances in Neural Information Processing Systems, Vol. 27, 2014, pp. 3680–3688.
 (18) A. D. Ialongo, M. van der Wilk, J. Hensman, C. E. Rasmussen, Overcoming meanfield approximations in recurrent Gaussian process models., in: International Conference on Machine Learning, 2019, pp. 2931–2940.
 (19) H. Liu, Y.S. Ong, X. Shen, J. Cai, When gaussian process meets big data: A review of scalable GPs, IEEE Transactions on Neural Networks and Learning Systems 31 (11) (2020) 4405–4423.
 (20) E. Snelson, Z. Ghahramani, Sparse gaussian processes using pseudoinputs, in: Advances in Neural Information Processing Systems, MIT Press, 2006, pp. 1257–1264.

(21)
M. Titsias, Variational learning of inducing variables in sparse Gaussian processes, in: Artificial Intelligence and Statistics, 2009, pp. 567–574.

(22)
J. Hensman, N. Fusi, N. D. Lawrence, Gaussian processes for big data, in: Uncertainty in Artificial Intelligence, Citeseer, 2013, p. 282.
 (23) H. Liu, J. Cai, Y. Wang, Y.S. Ong, Generalized robust Bayesian committee machine for largescale Gaussian process regression, in: International Conference on Machine Learning, 2018, pp. 1–10.
 (24) K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724–1734.
 (25) S. Hochreiter, J. Schmidhuber, Long shortterm memory, Neural Computation 9 (8) (1997) 1735–1780.
 (26) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, arXiv preprint arXiv:1409.3215.
 (27) J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional sequence to sequence learning, in: International Conference on Machine Learning, PMLR, 2017, pp. 1243–1252.
 (28) Z. C. Lipton, J. Berkowitz, C. Elkan, A critical review of recurrent neural networks for sequence learning, arXiv preprint arXiv:1506.00019.
 (29) W. Yu, I. Y. Kim, C. Mechefske, Analysis of different rnn autoencoder variants for time series classification and machine prognostics, Mechanical Systems and Signal Processing 149 (2021) 107322.
 (30) D. Salinas, V. Flunkert, J. Gasthaus, T. Januschowski, DeepAR: Probabilistic forecasting with autoregressive recurrent networks, International Journal of Forecasting 36 (3) (2020) 1181–1191.
 (31) L. Li, J. Yan, X. Yang, Y. Jin, Learning interpretable deep state space model for probabilistic time series forecasting, arXiv preprint arXiv:2102.00397.
 (32) S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, T. Januschowski, Deep state space models for time series forecasting, Advances in Neural Information Processing Systems 31 (2018) 7785–7794.
 (33) A. K. Yanchenko, S. Mukherjee, Stanza: A nonlinear state space model for probabilistic inference in nonstationary time series, arXiv preprint arXiv:2006.06553.
 (34) M. AlShedivat, A. G. Wilson, Y. Saatchi, Z. Hu, E. P. Xing, Learning scalable deep kernels with recurrent structure, The Journal of Machine Learning Research 18 (1) (2017) 2850–2886.
 (35) Q. She, A. Wu, Neural dynamics discovery via Gaussian process recurrent neural networks, in: Uncertainty in Artificial Intelligence, PMLR, 2020, pp. 454–464.
 (36) D. P. Kingma, M. Welling, Autoencoding variational Bayes, arXiv preprint arXiv:1312.6114.
 (37) J. Bayer, C. Osendorfer, Learning stochastic recurrent networks, arXiv preprint arXiv:1411.7610.
 (38) R. G. Krishnan, U. Shalit, D. Sontag, Deep kalman filters, arXiv preprint arXiv:1511.05121.
 (39) R. G. Krishnan, U. Shalit, D. Sontag, Structured inference networks for nonlinear state space models, in: Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, 2017, pp. 2101–2109.
 (40) M. Fraccaro, Deep latent variable models for sequential data, Ph.D. thesis, Department of Applied Mathematics and Computer Science, Technical University of Denmark (2018).
 (41) D. Gedon, N. Wahlström, T. B. Schön, L. Ljung, Deep state space models for nonlinear system identification., arxiv:eess.SY.
 (42) S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.
 (43) J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of massive data sets, Cambridge university press, 2014.
 (44) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
 (45) S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv preprint arXiv:1803.01271.
 (46) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.

(47)
G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
 (48) L. Lennart, System identification: Theory for the user, PTR Prentice Hall, Upper Saddle River, NJ 28.
 (49) B. W. Silverman, Some aspects of the spline smoothing approach to nonparametric regression curve fitting, Journal of the Royal Statistical Society: Series B (Methodological) 47 (1) (1985) 1–21.
 (50) M. Schoukens, P. Mattson, T. Wigren, J.P. Noel, Cascaded tanks benchmark combining soft and hard nonlinearities, in: Workshop on nonlinear system identification benchmarks, 2016, pp. 20–23.
 (51) J.P. Noël, M. Schoukens, F16 aircraft benchmark based on ground vibration test data, in: 2017 Workshop on Nonlinear System Identification Benchmarks, 2017, pp. 19–23.
 (52) M. Schoukens, J.P. Noel, Wienerhammerstein benchmark with process noise, in: Workshop on nonlinear system identification benchmarks, 2016, pp. 15–19.
 (53) X. Shengli, J. Xiaomo, H. Jinzhi, Y. Shuhua, W. Xiaofang, Bayesian wavelet pca methodology for turbomachinery damage diagnosis under uncertainty, Mechanical Systems and Signal Processing 80 (Dec) (2016) 1–18.
 (54) I. Daubechies, Ten lectures on wavelets, SIAM, 1992.
 (55) A. M. Saxe, J. L. McClelland, S. Ganguli, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, arXiv preprint arXiv:1312.6120.
 (56) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.