1 Introduction
As a fundamental problem in machine learning, probabilistic sequence modeling aims at capturing the sequential correlations in both short and long ranges. Among many possible model choices, deep autoregressive models [13, 33] have become one of the most widely adapted solutions. Typically, a deep autoregressive model factorizes the likelihood function of sequences in an autoregressive manner, i.e.,
. Then, a neural network (e.g. RNN) is employed to encode the conditional context
into a compact hidden representation
, which is then used to define the output distribution . As deep autoregressive models are fully tractable, they can be trained by maximum likelihood estimation (MLE) using backpropagation.Despite the stateoftheart (SOTA) performance in many domains [7, 3, 26, 5], the hidden representations of standard autoregressive models are produced in a completely deterministic way. Hence, the stochastic aspects of the observed sequences can only be modeled by the output distribution, which however, usually has a simple parametric form such as a unimodal distribution or a finite mixture of unimodal distributions. A potential weakness of such simple forms is that they may not be sufficiently expressive for modeling realworld sequential data with complex stochastic dynamics.
Recently, many efforts have been made to enrich the expressive power of autoregressive models by injecting stochastic latent variables into the computation of hidden states. Notably, relying on the variational autoencoding (VAE) framework [19, 29], stochastic recurrent models (SRNN) have outperformed standard RNNbased autoregressive models by a large margin in modeling raw soundwave sequences [1, 4, 8, 12, 23].
However, the success of stochastic latent variables does not necessarily generalize to other domains such as text and images. For instance, Goyal et al. [12] report that an SRNN trained by ZForcing lags behind a baseline RNN in language modeling. Similarly, for the density estimation of natural images, PixelCNN [25, 34, 31] consistently outperforms generative models with latent variables [14, 15, 28, 6, 20].
To better understand the discrepancy, we perform a reexamination on the role of stochastic variables in SRNN models. By carefully inspecting of the previous experiment settings for soundwave density estimation, and systematically analyzing the properties of SRNN, we identify two potential causes of the performance gap between SRNN and RNN. Controlled experiments are designed to test each hypothesis, where we find that previous evaluations impose an unnecessary restriction of fully factorized output distributions, which has led to an unfair comparison between SRNN and RNN. Specifically, under the factorized parameterization, SRNN can still implicitly leverage the intrastep correlation, i.e., the simultaneity [2], while the RNN baselines are prohibited to do so. Meanwhile, we also observe that the posterior learned by SRNN can get outperformed by a simple handcrafted posterior, raising a serious doubt about the general effectiveness of injecting latent variables.
To provide a fair comparison, we propose an evaluation setting where both the SRNN and RNN can utilize an autoregressive output distribution to model the intrastep correlation explicitly. Under the new setting, we reevaluate SRNN and RNN on a diverse collection of sequential data, including human speech, MIDI music, handwriting trajectory and framepermuted speech. Empirically, we find that sequential models with continuous latent variables fail to offer any practical benefits, despite their widely believed theoretical superiority. On the contrary, explicitly capturing the intrastep correlation with an autoregressive output distribution consistently performs better, substantially improving the SOTA performances in modeling speech signals. Overall, these observations show that the previously reported performance “advantage” of SRNN is merely the result of a longexisting experiment bias of using factorized output distributions.
2 Related Work
In the field of probabilistic sequence modeling, many efforts prior to deep learning have been devoted to State Space Models
[30], such as the Hidden Markov Model
[27]with discrete states and the Kalman Filter
[17] whose states are continuous.Recently, the focus has shifted to deep sequential models, including tractable deep autoregressive models without any latent variable and deep stochastic models that combine the powerful nonlinear computation of neural networks and the stochastic flexibility of latentvariable models. The recurrent temporal RBM [32] and RNNRBM [2] are early examples of how latent variables can be incorporated into deep neural networks. After VAE is introduced, the stochastic backpropagation makes it easy to combine the deep neural networks and latentvariable models, leading to stochastic recurrent models introduced in Section 1, temporal sigmoid belief networks [10], deep Kalman Filters [21], deep Markov Models [22], Kalman variational autoencoders [9] and many other variants. Johnson et al. [16] provide a general discussion on how the classic graphical models and deep neural networks can be combined.
3 Background
In this section, we briefly review SRNN and RNN for probabilistic sequence modeling. Throughout the paper, we will use bold font to denote a sequence, and to indicate the subsequence of first and elements respectively, and to represent the th element. Note that
can either be a scalar or a vector. In the latter case,
refers to the th element of the vector .Given a set of sequences , we are interested in building a density estimation model for sequences. A widely adapted solution is to employ an autoregressive model powered by a neural network, and utilize MLE to perform the training:
(1) 
where is the length of the sequence .
In practice, the conditional distribution is usually jointly modeled by three submodules:

[leftmargin=*,itemsep=0em,topsep=0em]

The predefined distribution family of the output distribution , such as Gaussian or Categorical;

The sequence model , which encodes the contextual sequence into a compact hidden vector ;

The output model , which transforms the hidden vector into the distribution parameters of the chosen family.
For instance, when Gaussian is chosen as the output distribution family, and jointly transform
into the predicted “mean” and “variance” of the Gaussian, i.e.,
Under this general framework, RNN and SRNN can be seen as two different instantiations of the sequence model.
When RNN is employed as the sequence model , the hidden states are produced recurrently:
(2) 
As we have discussed in Section 1, the computation inside RNN is fully deterministic. Hence, in order to model a complex distribution , one has to rely on a rich enough distribution family.
To improve the model expressiveness, SRNN takes an alternative route and incorporates continuous latent variables into the sequence model. Typically, SRNN associates the observed data sequence with a sequence of latent variables , one for each step. With latent variables, the internal dynamics of the sequence model is not deterministic any more, offering a theoretical possibility to capture more complex stochastic patterns. However, the improved capacity comes with a computational burden — the loglikelihood is generally intractable due to the integral:
Hence, standard MLE training cannot be performed.
To handle the intractability, SRNN utilizes the VAE framework and maximizes the evidence lower bound (ELBO) of the loglikelihood (1) for training:
where is the approximate posterior distribution modeled by an encoder network.
Computationally, several SRNN variants have been proposed [1, 4, 8, 12], mostly differing in how the generative distribution and the variational posterior are parameterized. In this work, we consider the following parameterization. Firstly, a forward RNN and a backward RNN are employed to encode into two sequences of forward and backward vectors, respectively:
where denotes vector concatenation. Based on the forward and backward vectors, the perstep prior and posterior are constructed:
Then, for each step , a sample is drawn from either or , depending on whether it is during training or evaluation. Given the sampled values, we employ an additional RNN to merge and into the hidden state , while capturing the dependency among :
(3) 
Finally, similar to the RNN case, the hidden state is fed into the output model, producing the output distribution .
Under this parameterization, the EBLO reduces to
(4) 
Compared to previous approaches [4, 8, 12] that employ the recurrent latenttolatent connection to capture the dependency between latent variables, we instead use a factorized form and then stack another RNN on the sequence of sampled latent variables to model the dependency as shown in Eqn. (3). We find this simplification yields a comparable or even better performance under the same model size but significantly improves the speed. For more details, we refer readers to the Supplementary A.
4 Revisiting SRNN for Speech Modeling
4.1 Previous Setting for Speech Density Estimation
To compare SRNN and RNN, previous studies largely rely on the density estimation of soundwave sequences. Usually, a soundwave dataset consists of a collection of audio sequences with a sample rate of 16Hz, where each frame (element) of the sequence is a scalar in , representing the normalized amplitude of the sound.
Instead of treating each frame as a single step, Chung et al. [4] propose a multiframe setting, where every 200 consecutive frames are taken as a single step. Effectively, the data can be viewed as a sequence of 200dimensional realvalued vectors, i.e., with . During training, every steps (8,000 frames) are taken as an i.i.d. sequence to form the training set.
Under this data format, notice that the output distributions and now correspond to an dimensional random vector . Therefore, how to parameterize this multivariate distribution can largely influence the empirical performance. That said, recent approaches [8, 12] have all followed Chung et al. [4] to employ a fully factorized parametric form which ignores the inner dependency:
(5)  
(6) 
Here, we have used the to emphasize this choice effectively poses an independent assumption. Despite this convenience, note that the restriction of fully factorized form is not necessary at all. Nevertheless, we will refer to the models in Eqn. (5) and Eqn. (6), respectively, as factorized RNN (FRNN) and factorized SRNN (FSRNN) in the following.
To provide a baseline for further discussion, we replicate the experiments under the setting introduced above and evaluate them on three speech datasets, namely TIMIT, VCTK and Blizzard. Following the previous work [4], we choose a Gaussian mixture to model the perframe distribution of FRNN, which enables a basic multimodality.
We report the averaged test loglikelihood in Table 1. For consistency with previous results in literature, the results of TIMIT and Blizzard are based on sequencelevel average, while the result of VCTK is framelevel average. As we can see, similar to previous observations, FSRNN outperforms FRNN on all three datasets by a dramatic margin.
Models  TIMIT  VCTK  Blizzard 

FRNN  32,745  0.786  7,610 
FSRNN  69,223  2.383  15,258 
4.2 Decomposing the Advantages of Factorized SRNN
To understand why the FSRNN outperforms FRNN by such a large margin, it is helpful to examine the effective output distribution of FSRNN after marginalizing out the latent variables:
(7)  
From this particular form, we can see two potential causes of the performance gap between FSRNN and FRNN in the multiframe setting:

[leftmargin=*,topsep=0em,itemsep=0em]

Advantage under High Volatility: By incorporating the continuous latent variable, the distribution of FSRNN essentially forms an infinite mixture of simpler distributions (see first line of Eqn. 7). As a result, the distribution is significantly more expressive and flexible, and it is believed to be particularly suitable for modeling highentropy sequential dynamics [4].
The multiframe setting introduced above well matches this description. Concretely, since the model is required to predict the next frames all together in this setting, the long prediction horizon will naturally involve a higher uncertainty. Therefore, the high volatility of the multiframe setting may provide a perfect scenario for SRNN to exhibit its theoretical advantage in expressiveness.

Utilizing the IntraStep Correlation: From the second line of Eqn. (7), notice that the distribution after marginalization is generally not factorized any more, due to the coupling with . In contrast, recall the same distribution of the FRNN (Eqn. 5) is fully factorized . Therefore, in theory, a factorized SRNN could still model the correlation among the frames within each step, if properly trained, while the factorized RNN has no means to do so at all. Thus, SRNN may also benefit from this difference.
While both advantages could have jointly led to the performance gap in Table 1, the implications are totally different. The first advantage under high volatility is a unique property of latentvariable models that other generative models without latent variables can hardly to obtain. Therefore, if this property significantly contributes to the superior performance of FSRNN over FRNN, it suggests a more general effectiveness of incorporating stochastic latent variables.
Quite the contrary, being able to utilize the intrastep correlation is more like an unfair benefit to SRNN, since it is the unnecessary restriction of fully factorized output distributions in previous experimental design that prevents RNNs from modeling the correlation. In practice, one can easily enable RNNs to do so by employing a nonfactorized output distribution. In this case, it remains unclear whether this particular advantage will sustain.
Motivated by the distinct implications, in the sequel, we will try to figure out how much each of the two hypotheses above actually contributes to the performance gap.
4.3 Advantage under High Volatility
In order to test the advantage of FSRNN in modeling highvolatile data in isolation, the idea is to construct a sequential dataset where each step consists of a single frame (i.e., a univariate variable), while there exists high volatility between every two consecutive steps.
Concretely, for each sequence , we create a subsequence by selecting one frame from every consecutive frames, i.e., with . Intuitively, a larger stride will lead to a longer horizon between two selected frames and hence a higher uncertainty. Moreover, since each step corresponds to a single scalar, the second advantage (i.e., the potential confounding factor) automatically disappears.
Following this idea, from the original datasets, we derive the strideTIMIT, strideVCTK and strideBlizzard with different stride values , and evaluate the RNN and SRNN on each of them. Again, we report the sequence or frameaverage test likelihood in Table 2.
Stride = 50  Stride = 200  

Model  TIMIT  VCTK  Blizzard  TIMIT  VCTK  Blizzard 
RNN  20,655  0.668  4,607  4,124  0.177  320 
SRNN  14,469  0.601  3,603  1,137  0.003  1,231 
Surprisingly, RNN consistently achieves a better performance than SRNN in this setting. It suggests the theoretically better expressiveness of SRNN does not help that much in highvolatility scenarios. Hence, this potential advantage does not really contribute to the performance gap observed in Table 1.
4.4 Utilizing the IntraStep Correlation
After ruling out the first hypothesis, it becomes more likely that being able to utilize the intrastep correlation actually leads to the superior performance of FSRNN. However, despite the nonfactorized form in Eqn. (7), it is still not clear how FSRNN computationally captures the correlation in practice. Here, we provide a particular possibility.
To facilitate the discussion, we first rewrite the ELBO in Eqn. (4) in terms of the reconstruction and the KL term:
(8)  
(9) 
From Eqn. (8), notice that the vector we hope to reconstruct at step is included in the conditional input to the posterior . With this computational structure, the encoder can theoretically leak a subset of the vector into the latent variable , and leverage the leaked subset to predict (reconstruct) the rest elements in . Intuitively, the procedure of using the leaked subset to predict the remained subset is essentially exploiting the dependency between the two subsets, or in other words, the correlation within .
To make this informal description more concrete, we construct a special example. Following the intuition above, we split the elements of the vector into two arbitrary disjoint subsets, the leaked subset and its complement . Then, we consider a special posterior:
(10) 
where
denotes a delta function that puts all probability mass of
on the single point . Effectively, this posterior simply memorizes the leaked subset . Under this delta posterior, if we further assume , Eqn. (8) and (9) can be simplified intoNotice the term in and the term in
can cancel out each other, because they are both degenerated delta distributions with the random variable
on both sides of the conditional bar. Thus, after the cancellation, the ELBO further reduces to(11) 
Now, the second term above is always conditioned on the leaked subset of to predict , which is exactly utilizing the correlation between the two subsets. From another perspective, the form of Eqn. (11) is equivalent to a particular autoregressive factorization of the output distribution:
(12) 
In other words, with a proper posterior, FSRNN can recover a certain autoregressive parameterization, making it possible to utilize the intrastep correlation. More importantly, the conditioning on the leaked subset is not affected by the choice of fully factorized output distributions, since the information is passed through the posterior.
Although the analysis and construction above provide a theoretical possibility, we still lack concrete evidence to support the hypothesis that FSRNN has significantly benefited from modeling the intrastep correlation. While it is difficult to verify this hypothesis in general, we can exploit the equivalence in Eqn. (12) to get some empirical evidence. Specifically, we can parameterize an RNN according to Eqn. (12), which is equivalent to an FSRNN with a delta posterior as defined in Eqn. (10). Therefore, by measuring the performance of this special RNN, we can get a conservative estimate of how much modeling the intrastep correlation can contribute to the performance of FSRNN.
To finish the special RNN idea, we still need to specify how is split into and . Here, we consider two methods with different intuitions:

[leftmargin=*,itemsep=0em,topsep=0em]

Interleaving: The first method takes one out of every elements to construct . Essentially, this method interleaves the two subsets and . As a result, when we condition on to predict , each element in will have some nearby elements from to provide information, which eases the prediction. In the extreme case of ,
includes the odd elements of
and the even ones. Hence, when predicting an even element , the output distribution is conditioned on both the elements to the left and to the right , making the problem much easier. 
Random: The second method simply uniformly selects random elements from to form , and leaves the rest for . Intuitively, this can be viewed as an informal “lower bound” of performance gain through modeling the intrastep correlation.
Models  TIMIT  VCTK  Blizzard 

FRNN  32,745  0.786  7,610 
FSRNN  69,223  2.383  15,258 
RNN ()  70,900  2.027  15,306 
RNN ()  72,067  2.262  15,284 
RNN ()  70,435  2.333  14,934 
RNN ()  56,558  2.243  13,188 
RNN ()  57,216  2.153  12,803 
RNN ()  66,122  2.199  14,389 
RNN ()  66,453  2.120  14,585 
RNN ()  64,382  1.948  13,776 
Since the parametric form (12) is derived from a delta posterior, we will refer to the special RNN model as RNN. Based on the two split methods, we train RNN on TIMIT, VCTK and Blizzard with different values of and . The results are summarized in Table 3. As we can see, when the interleaving split scheme is used, RNN significantly improves upon FRNN, and becomes very competitive with FSRNN. Specifically, on TIMIT and Blizzard, RNN can even outperform FSRNN in certain cases. More surprisingly, the RNN with the randomcopy scheme can also achieve a performance that is very close to that of FSRNN, especially compared to FRNN.
Recall that RNN is equivalent to employing a manually designed delta posterior that can only copy but never compresses (autoencodes) the information in . As a result, compared to a posterior that can learn to compress information, the delta posterior will involve a higher KL cost when leaking information through the posterior. Despite this disadvantage, RNN is still able to match or even surpasses the performance of FSRNN, suggesting the learned posterior in FSRNN is far from satisfying. Quite contrary to that, the limited performance gap between FSRNN and the random copy baseline raises a serious concern about the effectiveness of current variational inference techniques.
Nevertheless, putting the analysis and empirical evidence together, we can conclude that the performance advantage of FSRNN in the multiframe setting can be entirely attributed to the second cause. That is, under the factorized constraint in previous experiments, FSRNN can still implicitly leverage the intrastep correlation, while FRNN is prohibited to do so. However, as we have discussed earlier in Section 4.2, this is essentially an unfair comparison. More importantly, the claimed superiority of SRNN over RNN may be misleading, as it is unclear whether performance advantage of SRNN will sustain or not when a nonfactorized output distribution is employed to capture the intrastep correlation explicitly.
As far as we know, no previous work has carefully compared the performance of SRNN and RNN when nonfactorized output distribution is allowed. On the other hand, as shown in Table 3, by modeling the multivariate simultaneity in the simplest way, RNN can achieve a dramatic performance improvement. Motivated by the huge potential as well as the lack of a systematic study, we will next include nonfactorized output distributions in our consideration, and properly reevaluate SRNN and RNN for multivariate sequence modeling.
5 Proper Multivariate Sequence Modeling with or without Latent Variables
5.1 Avoiding the Implicit Data Bias
In this section, we aim to eliminate any experimental bias and provide a proper evaluation of SRNN and RNN for multivariate sequence modeling. Apart from the “model bias” of employing fully factorized output distributions we have discussed, another possible source of bias is actually the experimental data. For example, as we discussed in Section 4.1, the multiframe speech sequences are constructed by reshaping consecutive realvalued frames into dimensional vectors. Consequently, elements within each step are simply temporally correlated with a natural order, which would favor a model that recurrently process each element from to with parameter sharing.
Thus, to avoid such “data bias”, besides speech sequences, we additionally consider three more types of multivariate sequences with different patterns of intrastep correlation:

[leftmargin=*,topsep=0em,itemsep=0em]

The first type is the MIDI sound sequence introduced in [2]. Each step of the MIDI sound sequence is 88dimensional binary vector, representing the activated piano notes ranging from A0 to C8. Intuitively, to make the MIDI sound musically plausible, there must be some correlations among the notes within each step. However, different from the multiframe speech data, the correlation structure is not temporal any more.
To avoid the unnecessary complication due to overfitting, we utilize the two relatively larger datasets, namely the Muse (orchestral music) and Nottingham (folk tunes). Following earlier work [2], we report stepaveraged loglikelihood for these two MIDI datasets.

The second one we consider is the widely used handwriting trajectory dataset, IAMOnDB. Each step of the trajectory is represented by a 3dimension vector, where the first dimension is of binary value, indicating whether the pen is touching the paper or not, and the second and third dimensions are the coordinates of the pen given it is on the paper. Different from other datasets, the dimensionality of each step in IAMOnDB is significantly lower. Hence, it is reasonable to believe the intrastep structure is relatively simpler here. Following earlier work [4], we report sequenceaveraged loglikelihood for the IAMOnDB dataset.

The last type is actually a synthetic dataset we derive from TIMIT. Specifically, we maintain the multiframe structure of the speech sequence, but permute the frames in each step with a predetermined random order. Intuitively, this can be viewed as an extreme test of a model’s capability of discovering the underlying correlation between frames. Ideally, an optimal model should be able to discover the correct sequential order and recover the same performance as the original TIMIT. For convenience, we will call this dataset PermTIMIT.
The detailed statistics of all datasets we will use are summarized in Supplementary B.
5.2 Modeling Simultaneity with AutoRegressive Decomposition
With proper datasets, we now consider how to construct a family of nonfactorized distributions that (1) can be easily integrated into RNN and SRNN as the output distribution, and (2) are reasonably expressive for modeling multivariate correlations. Among many possible choices, the most straightforward choice would be the autoregressive parameterization. Compared to other options such as the normalizing flow or Markov Random Field (e.g. RBM), the autoregressive structure is conceptually simpler and can be applied to both discrete and continuous data with full tractability. Moreover, various dedicated neural architectures have been developed to support the autoregressive form. In light of these benefits, we choose to follow this simple idea, and decompose the output distribution of the RNN and SRNN, respectively, as
(13)  
(14) 
Notice that although we use the natural decomposition order from smallest index to largest one, this particular order is generally not optimal for modeling multivariate distributions. A better choice could be adapting the orderless training previously explored in literature [33]. But for simplicity, we will stick to this simple approach.
Given the autoregressive decomposition, a natural neural instantiation would be a recurrent hierarchical model that utilizes a twolevel architecture to process the sequence:

[leftmargin=*,topsep=0em,itemsep=0em]

Firstly, a highlevel RNN or SRNN is employed to encode the multivariate steps into a sequence of highlevel hidden vectors , which follows exactly the same as the computational procedure used in FRNN and FSRNN (see Eqn. 2 and 3). Recall that, in the case of SRNN, the computation of highlevel vectors involves sampling the latent variables.

Based on the highlevel representations, for each multivariate step , another neural model will take both the elements and the highlevel vector as input, and autoregressively produce a sequence of lowlevel hidden vectors :
Now, it is easy to verify that computationally satisfy the autoregressive (causal) constraint and only depend on valid conditional factors for constructing the corresponding output distributions, i.e., for RNN and for SRNN. Hence, the lowlevel hidden vectors can be then used to form the perelement output distributions in Eqn. (13) and (14).
In practice, the lowlevel model could simply be an RNN or a causally masked MLP [11], depending on our prior about the data. For instance, RNN is clearly not a suitable choice for the PermTIMIT dataset, since the element after permutation do not possess any recurrent pattern. Therefore, for our evaluation, RNN is employed as the lowlevel neural architecture on all datasets except for the PermTIMIT, where we employ a causally masked MLP without parameter sharing, i.e.,
For convenience, we will refer to the hierarchical models as RNNhier and SRNNhier.
Models  TIMIT  VCTK  Blizzard  Muse  Nottingham  IAMOnDB  PermTIMIT 

VRNN  28,982    9,392      1384   
SRNN  60,550    11,991  6.28  2.94     
ZForcing  68,903    14,435         
ZForcing + aux  70,469    15,430         
FRNN  32,745  0.786  7,610  6.991  3.400  1397  25,679 
FSRNN  69,223  2.383  15,258  6.438  2.811  1402  67,613 
RNNrandom  66,453  2.199  14,585  6.252  2.834  N/A  61,103 
RNNflat  117,721  3.2173  22,714  5.251  2.180  N/A  15,763 
SRNNflat  109,284  3.2062  22,290  5.616  2.324  N/A  14,278 
RNNhier  109,641  3.1822  21,950  5.161  2.028  1440  95,161 
SRNNhier  107,912  3.1423  21,845  5.483  2.065  1395  94,402 
In some cases where all the elements within a step share the same statistical type, such as on the speech or MIDI dataset, one may alternatively consider a flat model. As the name suggests, the flat model will break the boundary between steps and flatten the data into a new univariate sequence, where each step is simply a single element. Then, the new univariate sequence can be directly fed into a standard RNN or SRNN model, producing each conditional factor in Eqn. (13) and (14) in an autoregressive manner. Similarly, this class of RNN and SRNN will be referred to as RNNflat and SRNNflat, respectively.
Compared to the hierarchical model, the flat variant implicitly assumes a sequential continuity between and , since their computational dependency is the same as that between any two consecutive elements within the same step. Since this inductive bias matches the characteristics of multiframe speech sequences, we expect flat model to perform better in this case.
5.3 Experiment Results
Based on the seven datasets listed in Table 7, we compare the performance of factorized models, including FRNN and FSRNN, and nonfactorized models introduced above. To provide a random baseline, we include the RNN with the random split scheme in the comparison. Moreover, previous results, if exist, are also presented to provide additional information. For a fair comparison, we make sure all models share the same parameter size. For more implementation details, please refer to the Supplementary C as well as the source code^{1}^{1}1github.com/zihangdai/reexaminesrnn. Finally, the results are summarized in Table 4, where we make several important observations.
Firstly, on the speech and MIDI datasets, models with autoregressive (lowerhalf) output distributions obtain a dramatic advantage over models with fully factorized output distributions (upperhalf), achieving new SOTA results on three speech datasets. This observation reminds us that, besides capturing the longterm temporal structure across steps, how to properly model the intrastep dependency is equally, if not more, crucial to the practical performance.
Secondly, when the autoregressive output distribution is employed (lowerhalf), the nonstochastic recurrent models consistently outperform their stochastic counterparts across all datasets. In other words, the advantage of SRNN completely disappears once a powerful output distribution is used. Combined with the previous observation, it verifies our earlier concern that the socalled superiority of FSRNN over FRNN is merely a result of the biased experiment design in previous work.
In addition, as we expected, when the inductive bias of the flat model match the characteristics of speech data, it will achieves a better performance than the hierarchical model. Inversely, when the prior does not match data property on the other datasets, the hierarchical model is always better. In the extreme case of permuted TIMIT, the flat model even falls behind factorized models, while the hierarchical model achieves a very decent performance that is even much better than what FSRNN can achieve on the original TIMIT. This shows that hierarchical model is usually more robust, especially when we don’t have a good prior.
Overall, we don’t find any advantage of employing stochastic latent variables for multivariate sequence modeling. Instead, relying on a full autoregressive solution yields better or even stateoftheart performances. Combined with the observation that RNNrandom can often achieve a competitive performance to FSRNN, we believe that the theoretical advantage of latentvariable models in sequence modeling is still far from fulfilled, if ultimately possible. In addition, we suggest future develop along this line compare with the simple but extremely robust baselines with an autoregressive output distribution.
6 Conclusion and Discussion
In summary, our reexamination reveals a misleading impression on the benefits of latent variables in sequence modeling. From our empirical observation, the main effect of latent variables is only to provide a mechanism to leverage the intrastep correlation, which is however, not as powerful as employing the straightforward autoregressive decomposition. It remains unclear what leads to the significant gap between the theoretical potential of latent variables and their practical effectiveness, which we believe deserves more research attention. Meanwhile, given the large gain of modeling simultaneity, using sequential structures to better capture local patterns is another good future direction in sequence modeling.
Acknowledgment
This work is supported in part by the National Science Foundation (NSF) under grant IIS1546329 and by DOEOffice of Science under grant ASCR #KJ040201.
References
 Bayer & Osendorfer [2014] Bayer, J. and Osendorfer, C. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 BoulangerLewandowski et al. [2012] BoulangerLewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012.
 Chen et al. [2017] Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017.
 Chung et al. [2015] Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
 Dai et al. [2019] Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformerxl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860, 2019.
 Dinh et al. [2016] Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Flunkert et al. [2017] Flunkert, V., Salinas, D., and Gasthaus, J. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
 Fraccaro et al. [2016] Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207, 2016.

Fraccaro et al. [2017]
Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.
In Advances in Neural Information Processing Systems, pp. 3601–3610, 2017.  Gan et al. [2015] Gan, Z., Li, C., Henao, R., Carlson, D. E., and Carin, L. Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems, pp. 2467–2475, 2015.

Germain et al. [2015]
Germain, M., Gregor, K., Murray, I., and Larochelle, H.
Made: Masked autoencoder for distribution estimation.
In International Conference on Machine Learning, pp. 881–889, 2015.  Goyal et al. [2017] Goyal, A. G. A. P., Sordoni, A., Côté, M.A., Ke, N. R., and Bengio, Y. Zforcing: Training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723, 2017.
 Graves [2013] Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
 Gregor et al. [2015] Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 Gregor et al. [2016] Gregor, K., Besse, F., Rezende, D. J., Danihelka, I., and Wierstra, D. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.
 Johnson et al. [2016] Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
 Kalman [1960] Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
 Kingma & Ba [2014] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling [2013] Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. [2016] Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751, 2016.
 Krishnan et al. [2015] Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Krishnan et al. [2017] Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101–2109, 2017.
 Lai et al. [2018] Lai, G., Li, B., Zheng, G., and Yang, Y. Stochastic wavenet: A generative latent variable model for sequential data. arXiv preprint arXiv:1806.06116, 2018.
 Loshchilov & Hutter [2016] Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
 Oord et al. [2016] Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 Parmar et al. [2018] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
 Rabiner & Juang [1986] Rabiner, L. R. and Juang, B.H. An introduction to hidden markov models. ieee assp magazine, 3(1):4–16, 1986.
 Rezende & Mohamed [2015] Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 Rezende et al. [2014] Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Roweis & Ghahramani [1999] Roweis, S. and Ghahramani, Z. A unifying review of linear gaussian models. Neural computation, 11(2):305–345, 1999.
 Salimans et al. [2017] Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.

Sutskever et al. [2009]
Sutskever, I., Hinton, G. E., and Taylor, G. W.
The recurrent temporal restricted boltzmann machine.
In Advances in neural information processing systems, pp. 1601–1608, 2009.  Uria et al. [2016] Uria, B., Côté, M.A., Gregor, K., Murray, I., and Larochelle, H. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
 van den Oord et al. [2016] van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.
Appendix A Different Variants of Stochastic Recurrent Neural Networks
As stated in Section 3
, we employ a simplified version of stochastic recurrent neural network in our study to evaluate the effectiveness of latent variables in sequence modeling. This section detail the connection and difference between our parameterization and the stochastic recurrent network models proposed in previous publications
[1, 4, 8, 12].For stochastic recurrent neural networks, the generic decomposition of generative distribution shared by previous methods has the form:
where each new step depends on the entire history of the observation and the latent variables . Similarly, for the approximate posterior distribution, all previous approaches can be unified under the form
Given the generic forms, various parameterizations with different independence assumptions have been introduced:

[leftmargin=*,itemsep=0em,topsep=0em]

STORN [1]: This parameterization makes two simplifications. Firstly, the prior distribution is assumed to be context independent, i.e.,
Secondly, the posterior distribution is simplified as
which drops both the dependence on the future information as well as that on subsequence of previous latent variables .
Despite the simplification in the prior, STORN imposes no independence assumption on the output distribution . Specifically, an RNN is used to capture the two conditional factors :
Notice that, the RNN is capable of modeling the correlation among the latent variables and encodes the information into .

VRNN [4]: This parameterization eliminates some independence assumptions in STORN. Firstly, the prior distribution becomes fully context dependent via a context RNN:
Notice that is dependent on all previous latent variables . Hence, there is no independence assumptions involved in the prior distribution. However, notice that the computation of cannot be parallelized due to the dependence on the latent variable as an input.
Secondly, compared to STORN, the posterior in VRNN additionally depends on the previous latent variables :
where is the same forward vector used to construct the prior distribution above. However, the posterior still does not depend on the future observations .
Finally, the output distribution is simply constructed as

SRNN [8]: Compared to VRNN, SRNN (1) introduces a Markov assumption into the latenttolatent dependence and (2) makes the posterior condition on the future observations .
Specifically, SRNN employs two RNNs, one forward and the other backward, to consume the observation sequence from the two different directions:
From the parametric form, notice that is always conditioned on the entire observation , while only has access to .
Then, the prior and posterior are respectively formed by
where the indicates the aforementioned Markov assumption. In other words, given the sampled value of , is independent of .
Finally, the output distribution of SRNN also involves the same simplification:

ZForcing SRNN [12]: By feeding the latent variable as an additional input into the forward RNN, an approach similar to the VRNN, this parameterization successfully removes the Markov assumption in SRNN.
Specifically, the computation goes as follows:
where the is sampled from either the prior or posterior:
Notice that, since relies on in a deterministic manner, there is no Markov assumption anymore when is used to construct the prior and posterior.
The same property also extends to the output distribution, which has the same parametric form as SRNN although the contains different information:
As explained above, as long as the construction of prior or the posterior conditions on the value of previous latent variables as input, the computation will be completely sequential and cannot be parallelized. When the number of steps reaches thousands, such as in the case of using the SRNNflat model introduced in Section 5.2, the training speed will be unbearably slow.
Faced with this problem, we remove the dependency between and , leading to the prior and posterior introduced in Section 3:
where the forward and backward vectors are both computed separately in a single pass:
However, this simplification entirely throws away the dependency among latent variables, which could be over simplified. As a compensation, we employ an additional RNN to process the latent variables (Eqn. 3):
where can potentially capture the correlation among . This is similar to the solution in STORN.
An advantage of this simplified solution is that we can directly utilize cuDNN accelerated RNN module instead of relying on a customized forloop as used in VRNN, SRNN and ZForcing SRNN. However, the speed advantage will be meaningless if the performance degrades significantly. To ensure this does not happen, based on the TIMIT dataset, we compare the simplified SRNN and our implementation of ZForcing SRNN in Table 5. Specifically, we consider two cases where the output distribution either has a factorized form (FSRNN) and uses a flat autoregressive parameterization (SRNNflat).
As we can see, given similar number of parameters, the simplified version can match or even surpass the performance of ZForcing SRNN without the auxiliary cost. Speed wise, the simplified version is consistently faster than ZForcing SRNN, especially the flat autoregressive case (3x faster).
Model  #Param  LL  Time  
ZFSRNN  40  200  17.44M  69,296  1.38h 
SimFSRNN  40  200  17.45M  69,223  0.96h 
ZSRNNflat  1,000  1  16.98M  97,194  62.86h 
SimSRNNflat  1,000  1  16.78M  99,102  20.62h 
Additionally, in Table 6, we also compare the simplified SRNN with the published performance of the stochastic recurrent networks on the speech, handwriting trajectory, and music datasets. The result demonstrates that the simplified SRNN is able to reproduce the SOTA level performance.
Model  TIMIT  Blizzard  Muse  Nottingham  IAMOnDB 

VRNN  28,982  9,392      1384 
SRNN  60,550  11,991  6.28  2.94   
ZForcing  68,903  14,435       
ZForcing + aux  70,469  15,430       
SimSRNN  69,223  15,258  6.44  2.81  1402 
Due to the comparative performance and improved speed of SRNN, we choose it as the default parameterization of the stochastic recurrent network in this work.
Appendix B Data Statistics
Datasets  Number of Steps  Frames / Step 

TIMIT  1.54M  200 
VCTK  12.6M  200 
Blizzard  90.5M  200 
PermTIMIT  1.54M  200 
Muse  36.1M  88 
Nottingham  23.5M  88 
IAMOnDB  7.63M  3 
The dataset statistic is summarized in Table 7. “Frame / Step” indicates the dimension of the vector at each time stamp. “Number of Steps” is the total length for the multivariate sequence.
Appendix C Experiment Details
Domains  Speech  MIDI  Handwriting 

FRNN  17.41M  0.57M  0.93M 
FSRNN  17.53M  2.28M  1.17M 
RNNrandom  18.57M  0.71M  N/A 
RNNflat  16.86M  1.58M  N/A 
SRNNflat  16.93M  2.24M  N/A 
RNNhier  17.28M  1.87M  0.97M 
SRNNhier  17.25M  3.05M  1.02M 
In the following, we will provide more details about our implementation. Firstly, Table 8 reports the parameter size of all models compared in Table 4. For data domains with enough data (i.e., speech and handwriting), we ensure the parameter size is about the same. On the smaller MIDI dataset, we only make sure the RNN variants do not use more parameters than SRNNs do.
For all methods, we use the Adam algorithm [18] as the optimizer with learning rate 0.001. The cosine schedule [24] is used to anneal the learning rate from 0.001 to 0.000001 during the training process. The batch size is set to 32 for TIMIT, 128 for VCTK and Blizzard, 16 for Muse, Nottingham, and 32 for IAMOnDB. The total number of training steps is 20k for Muse, Nottingham, and IAMOnDB, 40k for TIMIT, 80k for VCTK, 160K for Blizzard. For all SRNN variants, we follow previous work to employ the KL annealing strategy, where the coefficient on the KL term is increased from 0.2 to 1.0 by an increment of 0.00005 after each parameter update [12].
For the architectural details such as the number of layers and hidden dimensions used in this study, we refer the readers to the accompanied source code.
Comments
There are no comments yet.