As a fundamental problem in machine learning, probabilistic sequence modeling aims at capturing the sequential correlations in both short and long ranges. Among many possible model choices, deep auto-regressive models [13, 33] have become one of the most widely adapted solutions. Typically, a deep auto-regressive model factorizes the likelihood function of sequences in an auto-regressive manner, i.e.,
. Then, a neural network (e.g. RNN) is employed to encode the conditional context
into a compact hidden representation, which is then used to define the output distribution . As deep auto-regressive models are fully tractable, they can be trained by maximum likelihood estimation (MLE) using back-propagation.
Despite the state-of-the-art (SOTA) performance in many domains [7, 3, 26, 5], the hidden representations of standard auto-regressive models are produced in a completely deterministic way. Hence, the stochastic aspects of the observed sequences can only be modeled by the output distribution, which however, usually has a simple parametric form such as a unimodal distribution or a finite mixture of unimodal distributions. A potential weakness of such simple forms is that they may not be sufficiently expressive for modeling real-world sequential data with complex stochastic dynamics.
Recently, many efforts have been made to enrich the expressive power of auto-regressive models by injecting stochastic latent variables into the computation of hidden states. Notably, relying on the variational auto-encoding (VAE) framework [19, 29], stochastic recurrent models (SRNN) have outperformed standard RNN-based auto-regressive models by a large margin in modeling raw sound-wave sequences [1, 4, 8, 12, 23].
However, the success of stochastic latent variables does not necessarily generalize to other domains such as text and images. For instance, Goyal et al.  report that an SRNN trained by Z-Forcing lags behind a baseline RNN in language modeling. Similarly, for the density estimation of natural images, PixelCNN [25, 34, 31] consistently outperforms generative models with latent variables [14, 15, 28, 6, 20].
To better understand the discrepancy, we perform a re-examination on the role of stochastic variables in SRNN models. By carefully inspecting of the previous experiment settings for sound-wave density estimation, and systematically analyzing the properties of SRNN, we identify two potential causes of the performance gap between SRNN and RNN. Controlled experiments are designed to test each hypothesis, where we find that previous evaluations impose an unnecessary restriction of fully factorized output distributions, which has led to an unfair comparison between SRNN and RNN. Specifically, under the factorized parameterization, SRNN can still implicitly leverage the intra-step correlation, i.e., the simultaneity , while the RNN baselines are prohibited to do so. Meanwhile, we also observe that the posterior learned by SRNN can get outperformed by a simple hand-crafted posterior, raising a serious doubt about the general effectiveness of injecting latent variables.
To provide a fair comparison, we propose an evaluation setting where both the SRNN and RNN can utilize an auto-regressive output distribution to model the intra-step correlation explicitly. Under the new setting, we re-evaluate SRNN and RNN on a diverse collection of sequential data, including human speech, MIDI music, handwriting trajectory and frame-permuted speech. Empirically, we find that sequential models with continuous latent variables fail to offer any practical benefits, despite their widely believed theoretical superiority. On the contrary, explicitly capturing the intra-step correlation with an auto-regressive output distribution consistently performs better, substantially improving the SOTA performances in modeling speech signals. Overall, these observations show that the previously reported performance “advantage” of SRNN is merely the result of a long-existing experiment bias of using factorized output distributions.
2 Related Work
In the field of probabilistic sequence modeling, many efforts prior to deep learning have been devoted to State Space Models
, such as the Hidden Markov Model
with discrete states and the Kalman Filter whose states are continuous.
Recently, the focus has shifted to deep sequential models, including tractable deep auto-regressive models without any latent variable and deep stochastic models that combine the powerful nonlinear computation of neural networks and the stochastic flexibility of latent-variable models. The recurrent temporal RBM  and RNN-RBM  are early examples of how latent variables can be incorporated into deep neural networks. After VAE is introduced, the stochastic back-propagation makes it easy to combine the deep neural networks and latent-variable models, leading to stochastic recurrent models introduced in Section 1, temporal sigmoid belief networks , deep Kalman Filters , deep Markov Models , Kalman variational auto-encoders  and many other variants. Johnson et al.  provide a general discussion on how the classic graphical models and deep neural networks can be combined.
In this section, we briefly review SRNN and RNN for probabilistic sequence modeling. Throughout the paper, we will use bold font to denote a sequence, and to indicate the sub-sequence of first and elements respectively, and to represent the -th element. Note that
can either be a scalar or a vector. In the latter case,refers to the -th element of the vector .
Given a set of sequences , we are interested in building a density estimation model for sequences. A widely adapted solution is to employ an auto-regressive model powered by a neural network, and utilize MLE to perform the training:
where is the length of the sequence .
In practice, the conditional distribution is usually jointly modeled by three sub-modules:
The pre-defined distribution family of the output distribution , such as Gaussian or Categorical;
The sequence model , which encodes the contextual sequence into a compact hidden vector ;
The output model , which transforms the hidden vector into the distribution parameters of the chosen family.
For instance, when Gaussian is chosen as the output distribution family, and jointly transform
into the predicted “mean” and “variance” of the Gaussian, i.e.,
Under this general framework, RNN and SRNN can be seen as two different instantiations of the sequence model.
When RNN is employed as the sequence model , the hidden states are produced recurrently:
As we have discussed in Section 1, the computation inside RNN is fully deterministic. Hence, in order to model a complex distribution , one has to rely on a rich enough distribution family.
To improve the model expressiveness, SRNN takes an alternative route and incorporates continuous latent variables into the sequence model. Typically, SRNN associates the observed data sequence with a sequence of latent variables , one for each step. With latent variables, the internal dynamics of the sequence model is not deterministic any more, offering a theoretical possibility to capture more complex stochastic patterns. However, the improved capacity comes with a computational burden — the log-likelihood is generally intractable due to the integral:
Hence, standard MLE training cannot be performed.
To handle the intractability, SRNN utilizes the VAE framework and maximizes the evidence lower bound (ELBO) of the log-likelihood (1) for training:
where is the approximate posterior distribution modeled by an encoder network.
Computationally, several SRNN variants have been proposed [1, 4, 8, 12], mostly differing in how the generative distribution and the variational posterior are parameterized. In this work, we consider the following parameterization. Firstly, a forward RNN and a backward RNN are employed to encode into two sequences of forward and backward vectors, respectively:
where denotes vector concatenation. Based on the forward and backward vectors, the per-step prior and posterior are constructed:
Then, for each step , a sample is drawn from either or , depending on whether it is during training or evaluation. Given the sampled values, we employ an additional RNN to merge and into the hidden state , while capturing the dependency among :
Finally, similar to the RNN case, the hidden state is fed into the output model, producing the output distribution .
Under this parameterization, the EBLO reduces to
Compared to previous approaches [4, 8, 12] that employ the recurrent latent-to-latent connection to capture the dependency between latent variables, we instead use a factorized form and then stack another RNN on the sequence of sampled latent variables to model the dependency as shown in Eqn. (3). We find this simplification yields a comparable or even better performance under the same model size but significantly improves the speed. For more details, we refer readers to the Supplementary A.
4 Revisiting SRNN for Speech Modeling
4.1 Previous Setting for Speech Density Estimation
To compare SRNN and RNN, previous studies largely rely on the density estimation of sound-wave sequences. Usually, a sound-wave dataset consists of a collection of audio sequences with a sample rate of 16Hz, where each frame (element) of the sequence is a scalar in , representing the normalized amplitude of the sound.
Instead of treating each frame as a single step, Chung et al.  propose a multi-frame setting, where every 200 consecutive frames are taken as a single step. Effectively, the data can be viewed as a sequence of 200-dimensional real-valued vectors, i.e., with . During training, every steps (8,000 frames) are taken as an i.i.d. sequence to form the training set.
Under this data format, notice that the output distributions and now correspond to an -dimensional random vector . Therefore, how to parameterize this multivariate distribution can largely influence the empirical performance. That said, recent approaches [8, 12] have all followed Chung et al.  to employ a fully factorized parametric form which ignores the inner dependency:
Here, we have used the to emphasize this choice effectively poses an independent assumption. Despite this convenience, note that the restriction of fully factorized form is not necessary at all. Nevertheless, we will refer to the models in Eqn. (5) and Eqn. (6), respectively, as factorized RNN (F-RNN) and factorized SRNN (F-SRNN) in the following.
To provide a baseline for further discussion, we replicate the experiments under the setting introduced above and evaluate them on three speech datasets, namely TIMIT, VCTK and Blizzard. Following the previous work , we choose a Gaussian mixture to model the per-frame distribution of F-RNN, which enables a basic multi-modality.
We report the averaged test log-likelihood in Table 1. For consistency with previous results in literature, the results of TIMIT and Blizzard are based on sequence-level average, while the result of VCTK is frame-level average. As we can see, similar to previous observations, F-SRNN outperforms F-RNN on all three datasets by a dramatic margin.
4.2 Decomposing the Advantages of Factorized SRNN
To understand why the F-SRNN outperforms F-RNN by such a large margin, it is helpful to examine the effective output distribution of F-SRNN after marginalizing out the latent variables:
From this particular form, we can see two potential causes of the performance gap between F-SRNN and F-RNN in the multi-frame setting:
Advantage under High Volatility: By incorporating the continuous latent variable, the distribution of F-SRNN essentially forms an infinite mixture of simpler distributions (see first line of Eqn. 7). As a result, the distribution is significantly more expressive and flexible, and it is believed to be particularly suitable for modeling high-entropy sequential dynamics .
The multi-frame setting introduced above well matches this description. Concretely, since the model is required to predict the next frames all together in this setting, the long prediction horizon will naturally involve a higher uncertainty. Therefore, the high volatility of the multi-frame setting may provide a perfect scenario for SRNN to exhibit its theoretical advantage in expressiveness.
Utilizing the Intra-Step Correlation: From the second line of Eqn. (7), notice that the distribution after marginalization is generally not factorized any more, due to the coupling with . In contrast, recall the same distribution of the F-RNN (Eqn. 5) is fully factorized . Therefore, in theory, a factorized SRNN could still model the correlation among the frames within each step, if properly trained, while the factorized RNN has no means to do so at all. Thus, SRNN may also benefit from this difference.
While both advantages could have jointly led to the performance gap in Table 1, the implications are totally different. The first advantage under high volatility is a unique property of latent-variable models that other generative models without latent variables can hardly to obtain. Therefore, if this property significantly contributes to the superior performance of F-SRNN over F-RNN, it suggests a more general effectiveness of incorporating stochastic latent variables.
Quite the contrary, being able to utilize the intra-step correlation is more like an unfair benefit to SRNN, since it is the unnecessary restriction of fully factorized output distributions in previous experimental design that prevents RNNs from modeling the correlation. In practice, one can easily enable RNNs to do so by employing a non-factorized output distribution. In this case, it remains unclear whether this particular advantage will sustain.
Motivated by the distinct implications, in the sequel, we will try to figure out how much each of the two hypotheses above actually contributes to the performance gap.
4.3 Advantage under High Volatility
In order to test the advantage of F-SRNN in modeling high-volatile data in isolation, the idea is to construct a sequential dataset where each step consists of a single frame (i.e., a uni-variate variable), while there exists high volatility between every two consecutive steps.
Concretely, for each sequence , we create a sub-sequence by selecting one frame from every consecutive frames, i.e., with . Intuitively, a larger stride will lead to a longer horizon between two selected frames and hence a higher uncertainty. Moreover, since each step corresponds to a single scalar, the second advantage (i.e., the potential confounding factor) automatically disappears.
Following this idea, from the original datasets, we derive the stride-TIMIT, stride-VCTK and stride-Blizzard with different stride values , and evaluate the RNN and SRNN on each of them. Again, we report the sequence- or frame-average test likelihood in Table 2.
|Stride = 50||Stride = 200|
Surprisingly, RNN consistently achieves a better performance than SRNN in this setting. It suggests the theoretically better expressiveness of SRNN does not help that much in high-volatility scenarios. Hence, this potential advantage does not really contribute to the performance gap observed in Table 1.
4.4 Utilizing the Intra-Step Correlation
After ruling out the first hypothesis, it becomes more likely that being able to utilize the intra-step correlation actually leads to the superior performance of F-SRNN. However, despite the non-factorized form in Eqn. (7), it is still not clear how F-SRNN computationally captures the correlation in practice. Here, we provide a particular possibility.
To facilitate the discussion, we first rewrite the ELBO in Eqn. (4) in terms of the reconstruction and the KL term:
From Eqn. (8), notice that the vector we hope to reconstruct at step is included in the conditional input to the posterior . With this computational structure, the encoder can theoretically leak a subset of the vector into the latent variable , and leverage the leaked subset to predict (reconstruct) the rest elements in . Intuitively, the procedure of using the leaked subset to predict the remained subset is essentially exploiting the dependency between the two subsets, or in other words, the correlation within .
To make this informal description more concrete, we construct a special example. Following the intuition above, we split the elements of the vector into two arbitrary disjoint subsets, the leaked subset and its complement . Then, we consider a special posterior:
denotes a delta function that puts all probability mass ofon the single point . Effectively, this posterior simply memorizes the leaked subset . Under this delta posterior, if we further assume , Eqn. (8) and (9) can be simplified into
Notice the term in and the term in
can cancel out each other, because they are both degenerated delta distributions with the random variableon both sides of the conditional bar. Thus, after the cancellation, the ELBO further reduces to
Now, the second term above is always conditioned on the leaked subset of to predict , which is exactly utilizing the correlation between the two subsets. From another perspective, the form of Eqn. (11) is equivalent to a particular auto-regressive factorization of the output distribution:
In other words, with a proper posterior, F-SRNN can recover a certain auto-regressive parameterization, making it possible to utilize the intra-step correlation. More importantly, the conditioning on the leaked subset is not affected by the choice of fully factorized output distributions, since the information is passed through the posterior.
Although the analysis and construction above provide a theoretical possibility, we still lack concrete evidence to support the hypothesis that F-SRNN has significantly benefited from modeling the intra-step correlation. While it is difficult to verify this hypothesis in general, we can exploit the equivalence in Eqn. (12) to get some empirical evidence. Specifically, we can parameterize an RNN according to Eqn. (12), which is equivalent to an F-SRNN with a delta posterior as defined in Eqn. (10). Therefore, by measuring the performance of this special RNN, we can get a conservative estimate of how much modeling the intra-step correlation can contribute to the performance of F-SRNN.
To finish the special RNN idea, we still need to specify how is split into and . Here, we consider two methods with different intuitions:
Interleaving: The first method takes one out of every elements to construct . Essentially, this method interleaves the two subsets and . As a result, when we condition on to predict , each element in will have some nearby elements from to provide information, which eases the prediction. In the extreme case of ,
includes the odd elements ofand the even ones. Hence, when predicting an even element , the output distribution is conditioned on both the elements to the left and to the right , making the problem much easier.
Random: The second method simply uniformly selects random elements from to form , and leaves the rest for . Intuitively, this can be viewed as an informal “lower bound” of performance gain through modeling the intra-step correlation.
Since the parametric form (12) is derived from a delta posterior, we will refer to the special RNN model as -RNN. Based on the two split methods, we train -RNN on TIMIT, VCTK and Blizzard with different values of and . The results are summarized in Table 3. As we can see, when the interleaving split scheme is used, -RNN significantly improves upon F-RNN, and becomes very competitive with F-SRNN. Specifically, on TIMIT and Blizzard, -RNN can even outperform F-SRNN in certain cases. More surprisingly, the -RNN with the random-copy scheme can also achieve a performance that is very close to that of F-SRNN, especially compared to F-RNN.
Recall that -RNN is equivalent to employing a manually designed delta posterior that can only copy but never compresses (auto-encodes) the information in . As a result, compared to a posterior that can learn to compress information, the delta posterior will involve a higher KL cost when leaking information through the posterior. Despite this disadvantage, -RNN is still able to match or even surpasses the performance of F-SRNN, suggesting the learned posterior in F-SRNN is far from satisfying. Quite contrary to that, the limited performance gap between F-SRNN and the random copy baseline raises a serious concern about the effectiveness of current variational inference techniques.
Nevertheless, putting the analysis and empirical evidence together, we can conclude that the performance advantage of F-SRNN in the multi-frame setting can be entirely attributed to the second cause. That is, under the factorized constraint in previous experiments, F-SRNN can still implicitly leverage the intra-step correlation, while F-RNN is prohibited to do so. However, as we have discussed earlier in Section 4.2, this is essentially an unfair comparison. More importantly, the claimed superiority of SRNN over RNN may be misleading, as it is unclear whether performance advantage of SRNN will sustain or not when a non-factorized output distribution is employed to capture the intra-step correlation explicitly.
As far as we know, no previous work has carefully compared the performance of SRNN and RNN when non-factorized output distribution is allowed. On the other hand, as shown in Table 3, by modeling the multivariate simultaneity in the simplest way, -RNN can achieve a dramatic performance improvement. Motivated by the huge potential as well as the lack of a systematic study, we will next include non-factorized output distributions in our consideration, and properly re-evaluate SRNN and RNN for multivariate sequence modeling.
5 Proper Multivariate Sequence Modeling with or without Latent Variables
5.1 Avoiding the Implicit Data Bias
In this section, we aim to eliminate any experimental bias and provide a proper evaluation of SRNN and RNN for multivariate sequence modeling. Apart from the “model bias” of employing fully factorized output distributions we have discussed, another possible source of bias is actually the experimental data. For example, as we discussed in Section 4.1, the multi-frame speech sequences are constructed by reshaping consecutive real-valued frames into -dimensional vectors. Consequently, elements within each step are simply temporally correlated with a natural order, which would favor a model that recurrently process each element from to with parameter sharing.
Thus, to avoid such “data bias”, besides speech sequences, we additionally consider three more types of multivariate sequences with different patterns of intra-step correlation:
The first type is the MIDI sound sequence introduced in . Each step of the MIDI sound sequence is 88-dimensional binary vector, representing the activated piano notes ranging from A0 to C8. Intuitively, to make the MIDI sound musically plausible, there must be some correlations among the notes within each step. However, different from the multi-frame speech data, the correlation structure is not temporal any more.
To avoid the unnecessary complication due to overfitting, we utilize the two relatively larger datasets, namely the Muse (orchestral music) and Nottingham (folk tunes). Following earlier work , we report step-averaged log-likelihood for these two MIDI datasets.
The second one we consider is the widely used handwriting trajectory dataset, IAM-OnDB. Each step of the trajectory is represented by a 3-dimension vector, where the first dimension is of binary value, indicating whether the pen is touching the paper or not, and the second and third dimensions are the coordinates of the pen given it is on the paper. Different from other datasets, the dimensionality of each step in IAM-OnDB is significantly lower. Hence, it is reasonable to believe the intra-step structure is relatively simpler here. Following earlier work , we report sequence-averaged log-likelihood for the IAM-OnDB dataset.
The last type is actually a synthetic dataset we derive from TIMIT. Specifically, we maintain the multi-frame structure of the speech sequence, but permute the frames in each step with a pre-determined random order. Intuitively, this can be viewed as an extreme test of a model’s capability of discovering the underlying correlation between frames. Ideally, an optimal model should be able to discover the correct sequential order and recover the same performance as the original TIMIT. For convenience, we will call this dataset Perm-TIMIT.
The detailed statistics of all datasets we will use are summarized in Supplementary B.
5.2 Modeling Simultaneity with Auto-Regressive Decomposition
With proper datasets, we now consider how to construct a family of non-factorized distributions that (1) can be easily integrated into RNN and SRNN as the output distribution, and (2) are reasonably expressive for modeling multivariate correlations. Among many possible choices, the most straightforward choice would be the auto-regressive parameterization. Compared to other options such as the normalizing flow or Markov Random Field (e.g. RBM), the auto-regressive structure is conceptually simpler and can be applied to both discrete and continuous data with full tractability. Moreover, various dedicated neural architectures have been developed to support the auto-regressive form. In light of these benefits, we choose to follow this simple idea, and decompose the output distribution of the RNN and SRNN, respectively, as
Notice that although we use the natural decomposition order from smallest index to largest one, this particular order is generally not optimal for modeling multivariate distributions. A better choice could be adapting the orderless training previously explored in literature . But for simplicity, we will stick to this simple approach.
Given the auto-regressive decomposition, a natural neural instantiation would be a recurrent hierarchical model that utilizes a two-level architecture to process the sequence:
Firstly, a high-level RNN or SRNN is employed to encode the multivariate steps into a sequence of high-level hidden vectors , which follows exactly the same as the computational procedure used in F-RNN and F-SRNN (see Eqn. 2 and 3). Recall that, in the case of SRNN, the computation of high-level vectors involves sampling the latent variables.
Based on the high-level representations, for each multivariate step , another neural model will take both the elements and the high-level vector as input, and auto-regressively produce a sequence of low-level hidden vectors :
Now, it is easy to verify that computationally satisfy the auto-regressive (causal) constraint and only depend on valid conditional factors for constructing the corresponding output distributions, i.e., for RNN and for SRNN. Hence, the low-level hidden vectors can be then used to form the per-element output distributions in Eqn. (13) and (14).
In practice, the low-level model could simply be an RNN or a causally masked MLP , depending on our prior about the data. For instance, RNN is clearly not a suitable choice for the Perm-TIMIT dataset, since the element after permutation do not possess any recurrent pattern. Therefore, for our evaluation, RNN is employed as the low-level neural architecture on all datasets except for the Perm-TIMIT, where we employ a causally masked MLP without parameter sharing, i.e.,
For convenience, we will refer to the hierarchical models as RNN-hier and SRNN-hier.
|Z-Forcing + aux||70,469||-||15,430||-||-||-||-|
In some cases where all the elements within a step share the same statistical type, such as on the speech or MIDI dataset, one may alternatively consider a flat model. As the name suggests, the flat model will break the boundary between steps and flatten the data into a new uni-variate sequence, where each step is simply a single element. Then, the new uni-variate sequence can be directly fed into a standard RNN or SRNN model, producing each conditional factor in Eqn. (13) and (14) in an auto-regressive manner. Similarly, this class of RNN and SRNN will be referred to as RNN-flat and SRNN-flat, respectively.
Compared to the hierarchical model, the flat variant implicitly assumes a sequential continuity between and , since their computational dependency is the same as that between any two consecutive elements within the same step. Since this inductive bias matches the characteristics of multi-frame speech sequences, we expect flat model to perform better in this case.
5.3 Experiment Results
Based on the seven datasets listed in Table 7, we compare the performance of factorized models, including F-RNN and F-SRNN, and non-factorized models introduced above. To provide a random baseline, we include the -RNN with the random split scheme in the comparison. Moreover, previous results, if exist, are also presented to provide additional information. For a fair comparison, we make sure all models share the same parameter size. For more implementation details, please refer to the Supplementary C as well as the source code111github.com/zihangdai/reexamine-srnn. Finally, the results are summarized in Table 4, where we make several important observations.
Firstly, on the speech and MIDI datasets, models with auto-regressive (lower-half) output distributions obtain a dramatic advantage over models with fully factorized output distributions (upper-half), achieving new SOTA results on three speech datasets. This observation reminds us that, besides capturing the long-term temporal structure across steps, how to properly model the intra-step dependency is equally, if not more, crucial to the practical performance.
Secondly, when the auto-regressive output distribution is employed (lower-half), the non-stochastic recurrent models consistently outperform their stochastic counterparts across all datasets. In other words, the advantage of SRNN completely disappears once a powerful output distribution is used. Combined with the previous observation, it verifies our earlier concern that the so-called superiority of F-SRNN over F-RNN is merely a result of the biased experiment design in previous work.
In addition, as we expected, when the inductive bias of the flat model match the characteristics of speech data, it will achieves a better performance than the hierarchical model. Inversely, when the prior does not match data property on the other datasets, the hierarchical model is always better. In the extreme case of permuted TIMIT, the flat model even falls behind factorized models, while the hierarchical model achieves a very decent performance that is even much better than what F-SRNN can achieve on the original TIMIT. This shows that hierarchical model is usually more robust, especially when we don’t have a good prior.
Overall, we don’t find any advantage of employing stochastic latent variables for multivariate sequence modeling. Instead, relying on a full auto-regressive solution yields better or even state-of-the-art performances. Combined with the observation that -RNN-random can often achieve a competitive performance to F-SRNN, we believe that the theoretical advantage of latent-variable models in sequence modeling is still far from fulfilled, if ultimately possible. In addition, we suggest future develop along this line compare with the simple but extremely robust baselines with an auto-regressive output distribution.
6 Conclusion and Discussion
In summary, our re-examination reveals a misleading impression on the benefits of latent variables in sequence modeling. From our empirical observation, the main effect of latent variables is only to provide a mechanism to leverage the intra-step correlation, which is however, not as powerful as employing the straightforward auto-regressive decomposition. It remains unclear what leads to the significant gap between the theoretical potential of latent variables and their practical effectiveness, which we believe deserves more research attention. Meanwhile, given the large gain of modeling simultaneity, using sequential structures to better capture local patterns is another good future direction in sequence modeling.
This work is supported in part by the National Science Foundation (NSF) under grant IIS-1546329 and by DOE-Office of Science under grant ASCR #KJ040201.
- Bayer & Osendorfer  Bayer, J. and Osendorfer, C. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
- Boulanger-Lewandowski et al.  Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012.
- Chen et al.  Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017.
- Chung et al.  Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
- Dai et al.  Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Dinh et al.  Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Flunkert et al.  Flunkert, V., Salinas, D., and Gasthaus, J. Deepar: Probabilistic forecasting with autoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.
- Fraccaro et al.  Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pp. 2199–2207, 2016.
Fraccaro et al. 
Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.In Advances in Neural Information Processing Systems, pp. 3601–3610, 2017.
- Gan et al.  Gan, Z., Li, C., Henao, R., Carlson, D. E., and Carin, L. Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems, pp. 2467–2475, 2015.
Germain et al. 
Germain, M., Gregor, K., Murray, I., and Larochelle, H.
Made: Masked autoencoder for distribution estimation.In International Conference on Machine Learning, pp. 881–889, 2015.
- Goyal et al.  Goyal, A. G. A. P., Sordoni, A., Côté, M.-A., Ke, N. R., and Bengio, Y. Z-forcing: Training stochastic recurrent networks. In Advances in neural information processing systems, pp. 6713–6723, 2017.
- Graves  Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- Gregor et al.  Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
- Gregor et al.  Gregor, K., Besse, F., Rezende, D. J., Danihelka, I., and Wierstra, D. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.
- Johnson et al.  Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P., and Datta, S. R. Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems, pp. 2946–2954, 2016.
- Kalman  Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
- Kingma & Ba  Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma & Welling  Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma et al.  Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751, 2016.
- Krishnan et al.  Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
- Krishnan et al.  Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101–2109, 2017.
- Lai et al.  Lai, G., Li, B., Zheng, G., and Yang, Y. Stochastic wavenet: A generative latent variable model for sequential data. arXiv preprint arXiv:1806.06116, 2018.
- Loshchilov & Hutter  Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Oord et al.  Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
- Parmar et al.  Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., and Ku, A. Image transformer. arXiv preprint arXiv:1802.05751, 2018.
- Rabiner & Juang  Rabiner, L. R. and Juang, B.-H. An introduction to hidden markov models. ieee assp magazine, 3(1):4–16, 1986.
- Rezende & Mohamed  Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- Rezende et al.  Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- Roweis & Ghahramani  Roweis, S. and Ghahramani, Z. A unifying review of linear gaussian models. Neural computation, 11(2):305–345, 1999.
- Salimans et al.  Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
Sutskever et al. 
Sutskever, I., Hinton, G. E., and Taylor, G. W.
The recurrent temporal restricted boltzmann machine.In Advances in neural information processing systems, pp. 1601–1608, 2009.
- Uria et al.  Uria, B., Côté, M.-A., Gregor, K., Murray, I., and Larochelle, H. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
- van den Oord et al.  van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.
Appendix A Different Variants of Stochastic Recurrent Neural Networks
As stated in Section 3
, we employ a simplified version of stochastic recurrent neural network in our study to evaluate the effectiveness of latent variables in sequence modeling. This section detail the connection and difference between our parameterization and the stochastic recurrent network models proposed in previous publications[1, 4, 8, 12].
For stochastic recurrent neural networks, the generic decomposition of generative distribution shared by previous methods has the form:
where each new step depends on the entire history of the observation and the latent variables . Similarly, for the approximate posterior distribution, all previous approaches can be unified under the form
Given the generic forms, various parameterizations with different independence assumptions have been introduced:
STORN : This parameterization makes two simplifications. Firstly, the prior distribution is assumed to be context independent, i.e.,
Secondly, the posterior distribution is simplified as
which drops both the dependence on the future information as well as that on sub-sequence of previous latent variables .
Despite the simplification in the prior, STORN imposes no independence assumption on the output distribution . Specifically, an RNN is used to capture the two conditional factors :
Notice that, the RNN is capable of modeling the correlation among the latent variables and encodes the information into .
VRNN : This parameterization eliminates some independence assumptions in STORN. Firstly, the prior distribution becomes fully context dependent via a context RNN:
Notice that is dependent on all previous latent variables . Hence, there is no independence assumptions involved in the prior distribution. However, notice that the computation of cannot be parallelized due to the dependence on the latent variable as an input.
Secondly, compared to STORN, the posterior in VRNN additionally depends on the previous latent variables :
where is the same forward vector used to construct the prior distribution above. However, the posterior still does not depend on the future observations .
Finally, the output distribution is simply constructed as
SRNN : Compared to VRNN, SRNN (1) introduces a Markov assumption into the latent-to-latent dependence and (2) makes the posterior condition on the future observations .
Specifically, SRNN employs two RNNs, one forward and the other backward, to consume the observation sequence from the two different directions:
From the parametric form, notice that is always conditioned on the entire observation , while only has access to .
Then, the prior and posterior are respectively formed by
where the indicates the aforementioned Markov assumption. In other words, given the sampled value of , is independent of .
Finally, the output distribution of SRNN also involves the same simplification:
Z-Forcing SRNN : By feeding the latent variable as an additional input into the forward RNN, an approach similar to the VRNN, this parameterization successfully removes the Markov assumption in SRNN.
Specifically, the computation goes as follows:
where the is sampled from either the prior or posterior:
Notice that, since relies on in a deterministic manner, there is no Markov assumption anymore when is used to construct the prior and posterior.
The same property also extends to the output distribution, which has the same parametric form as SRNN although the contains different information:
As explained above, as long as the construction of prior or the posterior conditions on the value of previous latent variables as input, the computation will be completely sequential and cannot be parallelized. When the number of steps reaches thousands, such as in the case of using the SRNN-flat model introduced in Section 5.2, the training speed will be unbearably slow.
Faced with this problem, we remove the dependency between and , leading to the prior and posterior introduced in Section 3:
where the forward and backward vectors are both computed separately in a single pass:
However, this simplification entirely throws away the dependency among latent variables, which could be over simplified. As a compensation, we employ an additional RNN to process the latent variables (Eqn. 3):
where can potentially capture the correlation among . This is similar to the solution in STORN.
An advantage of this simplified solution is that we can directly utilize cuDNN accelerated RNN module instead of relying on a customized for-loop as used in VRNN, SRNN and Z-Forcing SRNN. However, the speed advantage will be meaningless if the performance degrades significantly. To ensure this does not happen, based on the TIMIT dataset, we compare the simplified SRNN and our implementation of Z-Forcing SRNN in Table 5. Specifically, we consider two cases where the output distribution either has a factorized form (F-SRNN) and uses a flat auto-regressive parameterization (SRNN-flat).
As we can see, given similar number of parameters, the simplified version can match or even surpass the performance of Z-Forcing SRNN without the auxiliary cost. Speed wise, the simplified version is consistently faster than Z-Forcing SRNN, especially the flat auto-regressive case (3x faster).
Additionally, in Table 6, we also compare the simplified SRNN with the published performance of the stochastic recurrent networks on the speech, hand-writing trajectory, and music datasets. The result demonstrates that the simplified SRNN is able to reproduce the SOTA level performance.
|Z-Forcing + aux||70,469||15,430||-||-||-|
Due to the comparative performance and improved speed of SRNN, we choose it as the default parameterization of the stochastic recurrent network in this work.
Appendix B Data Statistics
|Datasets||Number of Steps||Frames / Step|
The dataset statistic is summarized in Table 7. “Frame / Step” indicates the dimension of the vector at each time stamp. “Number of Steps” is the total length for the multivariate sequence.
Appendix C Experiment Details
In the following, we will provide more details about our implementation. Firstly, Table 8 reports the parameter size of all models compared in Table 4. For data domains with enough data (i.e., speech and handwriting), we ensure the parameter size is about the same. On the smaller MIDI dataset, we only make sure the RNN variants do not use more parameters than SRNNs do.
For all methods, we use the Adam algorithm  as the optimizer with learning rate 0.001. The cosine schedule  is used to anneal the learning rate from 0.001 to 0.000001 during the training process. The batch size is set to 32 for TIMIT, 128 for VCTK and Blizzard, 16 for Muse, Nottingham, and 32 for IAM-OnDB. The total number of training steps is 20k for Muse, Nottingham, and IAM-OnDB, 40k for TIMIT, 80k for VCTK, 160K for Blizzard. For all SRNN variants, we follow previous work to employ the KL annealing strategy, where the coefficient on the KL term is increased from 0.2 to 1.0 by an increment of 0.00005 after each parameter update .
For the architectural details such as the number of layers and hidden dimensions used in this study, we refer the readers to the accompanied source code.