Re-examination of the Role of Latent Variables in Sequence Modeling

02/04/2019 ∙ by Zihang Dai, et al. ∙ 0

With latent variables, stochastic recurrent models have achieved state-of-the-art performance in modeling sound-wave sequence. However, opposite results are also observed in other domains, where standard recurrent networks often outperform stochastic models. To better understand this discrepancy, we re-examine the roles of latent variables in stochastic recurrent models for speech density estimation. Our analysis reveals that under the restriction of fully factorized output distribution in previous evaluations, the stochastic models were implicitly leveraging intra-step correlation but the standard recurrent baselines were prohibited to do so, resulting in an unfair comparison. To correct the unfairness, we remove such restriction in our re-examination, where all the models can explicitly leverage intra-step correlation with an auto-regressive structure. Over a diverse set of sequential data, including human speech, MIDI music, handwriting trajectory and frame-permuted speech, our results show that stochastic recurrent models fail to exhibit any practical advantage despite the claimed theoretical superiority. In contrast, standard recurrent models equipped with an auto-regressive output distribution consistently perform better, significantly advancing the state-of-the-art results on three speech datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a fundamental problem in machine learning, probabilistic sequence modeling aims at capturing the sequential correlations in both short and long ranges. Among many possible model choices, deep auto-regressive models [13, 33] have become one of the most widely adapted solutions. Typically, a deep auto-regressive model factorizes the likelihood function of sequences in an auto-regressive manner, i.e.,

. Then, a neural network (e.g. RNN) is employed to encode the conditional context

into a compact hidden representation

, which is then used to define the output distribution . As deep auto-regressive models are fully tractable, they can be trained by maximum likelihood estimation (MLE) using back-propagation.

Despite the state-of-the-art (SOTA) performance in many domains [7, 3, 26, 5], the hidden representations of standard auto-regressive models are produced in a completely deterministic way. Hence, the stochastic aspects of the observed sequences can only be modeled by the output distribution, which however, usually has a simple parametric form such as a unimodal distribution or a finite mixture of unimodal distributions. A potential weakness of such simple forms is that they may not be sufficiently expressive for modeling real-world sequential data with complex stochastic dynamics.

Recently, many efforts have been made to enrich the expressive power of auto-regressive models by injecting stochastic latent variables into the computation of hidden states. Notably, relying on the variational auto-encoding (VAE) framework [19, 29], stochastic recurrent models (SRNN) have outperformed standard RNN-based auto-regressive models by a large margin in modeling raw sound-wave sequences [1, 4, 8, 12, 23].

However, the success of stochastic latent variables does not necessarily generalize to other domains such as text and images. For instance,  Goyal et al. [12] report that an SRNN trained by Z-Forcing lags behind a baseline RNN in language modeling. Similarly, for the density estimation of natural images, PixelCNN [25, 34, 31] consistently outperforms generative models with latent variables [14, 15, 28, 6, 20].

To better understand the discrepancy, we perform a re-examination on the role of stochastic variables in SRNN models. By carefully inspecting of the previous experiment settings for sound-wave density estimation, and systematically analyzing the properties of SRNN, we identify two potential causes of the performance gap between SRNN and RNN. Controlled experiments are designed to test each hypothesis, where we find that previous evaluations impose an unnecessary restriction of fully factorized output distributions, which has led to an unfair comparison between SRNN and RNN. Specifically, under the factorized parameterization, SRNN can still implicitly leverage the intra-step correlation, i.e., the simultaneity [2], while the RNN baselines are prohibited to do so. Meanwhile, we also observe that the posterior learned by SRNN can get outperformed by a simple hand-crafted posterior, raising a serious doubt about the general effectiveness of injecting latent variables.

To provide a fair comparison, we propose an evaluation setting where both the SRNN and RNN can utilize an auto-regressive output distribution to model the intra-step correlation explicitly. Under the new setting, we re-evaluate SRNN and RNN on a diverse collection of sequential data, including human speech, MIDI music, handwriting trajectory and frame-permuted speech. Empirically, we find that sequential models with continuous latent variables fail to offer any practical benefits, despite their widely believed theoretical superiority. On the contrary, explicitly capturing the intra-step correlation with an auto-regressive output distribution consistently performs better, substantially improving the SOTA performances in modeling speech signals. Overall, these observations show that the previously reported performance “advantage” of SRNN is merely the result of a long-existing experiment bias of using factorized output distributions.

2 Related Work

In the field of probabilistic sequence modeling, many efforts prior to deep learning have been devoted to State Space Models 


, such as the Hidden Markov Model 


with discrete states and the Kalman Filter 

[17] whose states are continuous.

Recently, the focus has shifted to deep sequential models, including tractable deep auto-regressive models without any latent variable and deep stochastic models that combine the powerful nonlinear computation of neural networks and the stochastic flexibility of latent-variable models. The recurrent temporal RBM [32] and RNN-RBM [2] are early examples of how latent variables can be incorporated into deep neural networks. After VAE is introduced, the stochastic back-propagation makes it easy to combine the deep neural networks and latent-variable models, leading to stochastic recurrent models introduced in Section 1, temporal sigmoid belief networks [10], deep Kalman Filters [21], deep Markov Models [22], Kalman variational auto-encoders [9] and many other variants. Johnson et al. [16] provide a general discussion on how the classic graphical models and deep neural networks can be combined.

3 Background

In this section, we briefly review SRNN and RNN for probabilistic sequence modeling. Throughout the paper, we will use bold font to denote a sequence, and to indicate the sub-sequence of first and elements respectively, and to represent the -th element. Note that

can either be a scalar or a vector. In the latter case,

refers to the -th element of the vector .

Given a set of sequences , we are interested in building a density estimation model for sequences. A widely adapted solution is to employ an auto-regressive model powered by a neural network, and utilize MLE to perform the training:


where is the length of the sequence .

In practice, the conditional distribution is usually jointly modeled by three sub-modules:

  • [leftmargin=*,itemsep=0em,topsep=0em]

  • The pre-defined distribution family of the output distribution , such as Gaussian or Categorical;

  • The sequence model , which encodes the contextual sequence into a compact hidden vector ;

  • The output model , which transforms the hidden vector into the distribution parameters of the chosen family.

For instance, when Gaussian is chosen as the output distribution family, and jointly transform

into the predicted “mean” and “variance” of the Gaussian, i.e.,

Under this general framework, RNN and SRNN can be seen as two different instantiations of the sequence model.

When RNN is employed as the sequence model , the hidden states are produced recurrently:


As we have discussed in Section 1, the computation inside RNN is fully deterministic. Hence, in order to model a complex distribution , one has to rely on a rich enough distribution family.

To improve the model expressiveness, SRNN takes an alternative route and incorporates continuous latent variables into the sequence model. Typically, SRNN associates the observed data sequence with a sequence of latent variables , one for each step. With latent variables, the internal dynamics of the sequence model is not deterministic any more, offering a theoretical possibility to capture more complex stochastic patterns. However, the improved capacity comes with a computational burden — the log-likelihood is generally intractable due to the integral:

Hence, standard MLE training cannot be performed.

To handle the intractability, SRNN utilizes the VAE framework and maximizes the evidence lower bound (ELBO) of the log-likelihood (1) for training:

where is the approximate posterior distribution modeled by an encoder network.

Computationally, several SRNN variants have been proposed [1, 4, 8, 12], mostly differing in how the generative distribution and the variational posterior are parameterized. In this work, we consider the following parameterization. Firstly, a forward RNN and a backward RNN are employed to encode into two sequences of forward and backward vectors, respectively:

where denotes vector concatenation. Based on the forward and backward vectors, the per-step prior and posterior are constructed:

Then, for each step , a sample is drawn from either or , depending on whether it is during training or evaluation. Given the sampled values, we employ an additional RNN to merge and into the hidden state , while capturing the dependency among :


Finally, similar to the RNN case, the hidden state is fed into the output model, producing the output distribution .

Under this parameterization, the EBLO reduces to


Compared to previous approaches [4, 8, 12] that employ the recurrent latent-to-latent connection to capture the dependency between latent variables, we instead use a factorized form and then stack another RNN on the sequence of sampled latent variables to model the dependency as shown in Eqn. (3). We find this simplification yields a comparable or even better performance under the same model size but significantly improves the speed. For more details, we refer readers to the Supplementary A.

4 Revisiting SRNN for Speech Modeling

4.1 Previous Setting for Speech Density Estimation

To compare SRNN and RNN, previous studies largely rely on the density estimation of sound-wave sequences. Usually, a sound-wave dataset consists of a collection of audio sequences with a sample rate of 16Hz, where each frame (element) of the sequence is a scalar in , representing the normalized amplitude of the sound.

Instead of treating each frame as a single step, Chung et al. [4] propose a multi-frame setting, where every 200 consecutive frames are taken as a single step. Effectively, the data can be viewed as a sequence of 200-dimensional real-valued vectors, i.e., with . During training, every steps (8,000 frames) are taken as an i.i.d. sequence to form the training set.

Under this data format, notice that the output distributions and now correspond to an -dimensional random vector . Therefore, how to parameterize this multivariate distribution can largely influence the empirical performance. That said, recent approaches [8, 12] have all followed Chung et al. [4] to employ a fully factorized parametric form which ignores the inner dependency:


Here, we have used the to emphasize this choice effectively poses an independent assumption. Despite this convenience, note that the restriction of fully factorized form is not necessary at all. Nevertheless, we will refer to the models in Eqn. (5) and Eqn. (6), respectively, as factorized RNN (F-RNN) and factorized SRNN (F-SRNN) in the following.

To provide a baseline for further discussion, we replicate the experiments under the setting introduced above and evaluate them on three speech datasets, namely TIMIT, VCTK and Blizzard. Following the previous work [4], we choose a Gaussian mixture to model the per-frame distribution of F-RNN, which enables a basic multi-modality.

We report the averaged test log-likelihood in Table 1. For consistency with previous results in literature, the results of TIMIT and Blizzard are based on sequence-level average, while the result of VCTK is frame-level average. As we can see, similar to previous observations, F-SRNN outperforms F-RNN on all three datasets by a dramatic margin.

Models TIMIT VCTK Blizzard
F-RNN 32,745 0.786 7,610
F-SRNN 69,223 2.383 15,258
Table 1: Performance comparison on three benchmark datasets.

4.2 Decomposing the Advantages of Factorized SRNN

To understand why the F-SRNN outperforms F-RNN by such a large margin, it is helpful to examine the effective output distribution of F-SRNN after marginalizing out the latent variables:


From this particular form, we can see two potential causes of the performance gap between F-SRNN and F-RNN in the multi-frame setting:

  • [leftmargin=*,topsep=0em,itemsep=0em]

  • Advantage under High Volatility: By incorporating the continuous latent variable, the distribution of F-SRNN essentially forms an infinite mixture of simpler distributions (see first line of Eqn. 7). As a result, the distribution is significantly more expressive and flexible, and it is believed to be particularly suitable for modeling high-entropy sequential dynamics [4].

    The multi-frame setting introduced above well matches this description. Concretely, since the model is required to predict the next frames all together in this setting, the long prediction horizon will naturally involve a higher uncertainty. Therefore, the high volatility of the multi-frame setting may provide a perfect scenario for SRNN to exhibit its theoretical advantage in expressiveness.

  • Utilizing the Intra-Step Correlation: From the second line of Eqn. (7), notice that the distribution after marginalization is generally not factorized any more, due to the coupling with . In contrast, recall the same distribution of the F-RNN (Eqn. 5) is fully factorized . Therefore, in theory, a factorized SRNN could still model the correlation among the frames within each step, if properly trained, while the factorized RNN has no means to do so at all. Thus, SRNN may also benefit from this difference.

While both advantages could have jointly led to the performance gap in Table 1, the implications are totally different. The first advantage under high volatility is a unique property of latent-variable models that other generative models without latent variables can hardly to obtain. Therefore, if this property significantly contributes to the superior performance of F-SRNN over F-RNN, it suggests a more general effectiveness of incorporating stochastic latent variables.

Quite the contrary, being able to utilize the intra-step correlation is more like an unfair benefit to SRNN, since it is the unnecessary restriction of fully factorized output distributions in previous experimental design that prevents RNNs from modeling the correlation. In practice, one can easily enable RNNs to do so by employing a non-factorized output distribution. In this case, it remains unclear whether this particular advantage will sustain.

Motivated by the distinct implications, in the sequel, we will try to figure out how much each of the two hypotheses above actually contributes to the performance gap.

4.3 Advantage under High Volatility

In order to test the advantage of F-SRNN in modeling high-volatile data in isolation, the idea is to construct a sequential dataset where each step consists of a single frame (i.e., a uni-variate variable), while there exists high volatility between every two consecutive steps.

Concretely, for each sequence , we create a sub-sequence by selecting one frame from every consecutive frames, i.e., with . Intuitively, a larger stride will lead to a longer horizon between two selected frames and hence a higher uncertainty. Moreover, since each step corresponds to a single scalar, the second advantage (i.e., the potential confounding factor) automatically disappears.

Following this idea, from the original datasets, we derive the stride-TIMIT, stride-VCTK and stride-Blizzard with different stride values , and evaluate the RNN and SRNN on each of them. Again, we report the sequence- or frame-average test likelihood in Table 2.

Stride = 50 Stride = 200
Model TIMIT VCTK Blizzard TIMIT VCTK Blizzard
RNN 20,655 0.668 4,607 4,124 0.177 -320
SRNN 14,469 0.601 3,603 -1,137 -0.003 -1,231
Table 2: Performance comparison on high-volatility datasets.

Surprisingly, RNN consistently achieves a better performance than SRNN in this setting. It suggests the theoretically better expressiveness of SRNN does not help that much in high-volatility scenarios. Hence, this potential advantage does not really contribute to the performance gap observed in Table 1.

4.4 Utilizing the Intra-Step Correlation

After ruling out the first hypothesis, it becomes more likely that being able to utilize the intra-step correlation actually leads to the superior performance of F-SRNN. However, despite the non-factorized form in Eqn. (7), it is still not clear how F-SRNN computationally captures the correlation in practice. Here, we provide a particular possibility.

To facilitate the discussion, we first rewrite the ELBO in Eqn. (4) in terms of the reconstruction and the KL term:


From Eqn. (8), notice that the vector we hope to reconstruct at step is included in the conditional input to the posterior . With this computational structure, the encoder can theoretically leak a subset of the vector into the latent variable , and leverage the leaked subset to predict (reconstruct) the rest elements in . Intuitively, the procedure of using the leaked subset to predict the remained subset is essentially exploiting the dependency between the two subsets, or in other words, the correlation within .

To make this informal description more concrete, we construct a special example. Following the intuition above, we split the elements of the vector into two arbitrary disjoint subsets, the leaked subset and its complement . Then, we consider a special posterior:



denotes a delta function that puts all probability mass of

on the single point . Effectively, this posterior simply memorizes the leaked subset . Under this delta posterior, if we further assume , Eqn. (8) and (9) can be simplified into

Notice the term in and the term in

can cancel out each other, because they are both degenerated delta distributions with the random variable

on both sides of the conditional bar. Thus, after the cancellation, the ELBO further reduces to


Now, the second term above is always conditioned on the leaked subset of to predict , which is exactly utilizing the correlation between the two subsets. From another perspective, the form of Eqn. (11) is equivalent to a particular auto-regressive factorization of the output distribution:


In other words, with a proper posterior, F-SRNN can recover a certain auto-regressive parameterization, making it possible to utilize the intra-step correlation. More importantly, the conditioning on the leaked subset is not affected by the choice of fully factorized output distributions, since the information is passed through the posterior.

Although the analysis and construction above provide a theoretical possibility, we still lack concrete evidence to support the hypothesis that F-SRNN has significantly benefited from modeling the intra-step correlation. While it is difficult to verify this hypothesis in general, we can exploit the equivalence in Eqn. (12) to get some empirical evidence. Specifically, we can parameterize an RNN according to Eqn. (12), which is equivalent to an F-SRNN with a delta posterior as defined in Eqn. (10). Therefore, by measuring the performance of this special RNN, we can get a conservative estimate of how much modeling the intra-step correlation can contribute to the performance of F-SRNN.

To finish the special RNN idea, we still need to specify how is split into and . Here, we consider two methods with different intuitions:

  • [leftmargin=*,itemsep=0em,topsep=0em]

  • Interleaving: The first method takes one out of every elements to construct . Essentially, this method interleaves the two subsets and . As a result, when we condition on to predict , each element in will have some nearby elements from to provide information, which eases the prediction. In the extreme case of ,

    includes the odd elements of

    and the even ones. Hence, when predicting an even element , the output distribution is conditioned on both the elements to the left and to the right , making the problem much easier.

  • Random: The second method simply uniformly selects random elements from to form , and leaves the rest for . Intuitively, this can be viewed as an informal “lower bound” of performance gain through modeling the intra-step correlation.

Models TIMIT VCTK Blizzard
F-RNN 32,745 0.786 7,610
F-SRNN 69,223 2.383 15,258
-RNN () 70,900 2.027 15,306
-RNN () 72,067 2.262 15,284
-RNN () 70,435 2.333 14,934
-RNN () 56,558 2.243 13,188
-RNN () 57,216 2.153 12,803
-RNN () 66,122 2.199 14,389
-RNN () 66,453 2.120 14,585
-RNN () 64,382 1.948 13,776
Table 3: Performance comparison between -RNN and F-SRNN. Note that a smaller corresponds to leaking more elements.

Since the parametric form (12) is derived from a delta posterior, we will refer to the special RNN model as -RNN. Based on the two split methods, we train -RNN on TIMIT, VCTK and Blizzard with different values of and . The results are summarized in Table 3. As we can see, when the interleaving split scheme is used, -RNN significantly improves upon F-RNN, and becomes very competitive with F-SRNN. Specifically, on TIMIT and Blizzard, -RNN can even outperform F-SRNN in certain cases. More surprisingly, the -RNN with the random-copy scheme can also achieve a performance that is very close to that of F-SRNN, especially compared to F-RNN.

Recall that -RNN is equivalent to employing a manually designed delta posterior that can only copy but never compresses (auto-encodes) the information in . As a result, compared to a posterior that can learn to compress information, the delta posterior will involve a higher KL cost when leaking information through the posterior. Despite this disadvantage, -RNN is still able to match or even surpasses the performance of F-SRNN, suggesting the learned posterior in F-SRNN is far from satisfying. Quite contrary to that, the limited performance gap between F-SRNN and the random copy baseline raises a serious concern about the effectiveness of current variational inference techniques.

Nevertheless, putting the analysis and empirical evidence together, we can conclude that the performance advantage of F-SRNN in the multi-frame setting can be entirely attributed to the second cause. That is, under the factorized constraint in previous experiments, F-SRNN can still implicitly leverage the intra-step correlation, while F-RNN is prohibited to do so. However, as we have discussed earlier in Section 4.2, this is essentially an unfair comparison. More importantly, the claimed superiority of SRNN over RNN may be misleading, as it is unclear whether performance advantage of SRNN will sustain or not when a non-factorized output distribution is employed to capture the intra-step correlation explicitly.

As far as we know, no previous work has carefully compared the performance of SRNN and RNN when non-factorized output distribution is allowed. On the other hand, as shown in Table 3, by modeling the multivariate simultaneity in the simplest way, -RNN can achieve a dramatic performance improvement. Motivated by the huge potential as well as the lack of a systematic study, we will next include non-factorized output distributions in our consideration, and properly re-evaluate SRNN and RNN for multivariate sequence modeling.

5 Proper Multivariate Sequence Modeling with or without Latent Variables

5.1 Avoiding the Implicit Data Bias

In this section, we aim to eliminate any experimental bias and provide a proper evaluation of SRNN and RNN for multivariate sequence modeling. Apart from the “model bias” of employing fully factorized output distributions we have discussed, another possible source of bias is actually the experimental data. For example, as we discussed in Section 4.1, the multi-frame speech sequences are constructed by reshaping consecutive real-valued frames into -dimensional vectors. Consequently, elements within each step are simply temporally correlated with a natural order, which would favor a model that recurrently process each element from to with parameter sharing.

Thus, to avoid such “data bias”, besides speech sequences, we additionally consider three more types of multivariate sequences with different patterns of intra-step correlation:

  • [leftmargin=*,topsep=0em,itemsep=0em]

  • The first type is the MIDI sound sequence introduced in [2]. Each step of the MIDI sound sequence is 88-dimensional binary vector, representing the activated piano notes ranging from A0 to C8. Intuitively, to make the MIDI sound musically plausible, there must be some correlations among the notes within each step. However, different from the multi-frame speech data, the correlation structure is not temporal any more.

    To avoid the unnecessary complication due to overfitting, we utilize the two relatively larger datasets, namely the Muse (orchestral music) and Nottingham (folk tunes). Following earlier work [2], we report step-averaged log-likelihood for these two MIDI datasets.

  • The second one we consider is the widely used handwriting trajectory dataset, IAM-OnDB. Each step of the trajectory is represented by a 3-dimension vector, where the first dimension is of binary value, indicating whether the pen is touching the paper or not, and the second and third dimensions are the coordinates of the pen given it is on the paper. Different from other datasets, the dimensionality of each step in IAM-OnDB is significantly lower. Hence, it is reasonable to believe the intra-step structure is relatively simpler here. Following earlier work [4], we report sequence-averaged log-likelihood for the IAM-OnDB dataset.

  • The last type is actually a synthetic dataset we derive from TIMIT. Specifically, we maintain the multi-frame structure of the speech sequence, but permute the frames in each step with a pre-determined random order. Intuitively, this can be viewed as an extreme test of a model’s capability of discovering the underlying correlation between frames. Ideally, an optimal model should be able to discover the correct sequential order and recover the same performance as the original TIMIT. For convenience, we will call this dataset Perm-TIMIT.

The detailed statistics of all datasets we will use are summarized in Supplementary B.

5.2 Modeling Simultaneity with Auto-Regressive Decomposition

With proper datasets, we now consider how to construct a family of non-factorized distributions that (1) can be easily integrated into RNN and SRNN as the output distribution, and (2) are reasonably expressive for modeling multivariate correlations. Among many possible choices, the most straightforward choice would be the auto-regressive parameterization. Compared to other options such as the normalizing flow or Markov Random Field (e.g. RBM), the auto-regressive structure is conceptually simpler and can be applied to both discrete and continuous data with full tractability. Moreover, various dedicated neural architectures have been developed to support the auto-regressive form. In light of these benefits, we choose to follow this simple idea, and decompose the output distribution of the RNN and SRNN, respectively, as


Notice that although we use the natural decomposition order from smallest index to largest one, this particular order is generally not optimal for modeling multivariate distributions. A better choice could be adapting the orderless training previously explored in literature [33]. But for simplicity, we will stick to this simple approach.

Given the auto-regressive decomposition, a natural neural instantiation would be a recurrent hierarchical model that utilizes a two-level architecture to process the sequence:

  • [leftmargin=*,topsep=0em,itemsep=0em]

  • Firstly, a high-level RNN or SRNN is employed to encode the multivariate steps into a sequence of high-level hidden vectors , which follows exactly the same as the computational procedure used in F-RNN and F-SRNN (see Eqn. 2 and 3). Recall that, in the case of SRNN, the computation of high-level vectors involves sampling the latent variables.

  • Based on the high-level representations, for each multivariate step , another neural model will take both the elements and the high-level vector as input, and auto-regressively produce a sequence of low-level hidden vectors :

    Now, it is easy to verify that computationally satisfy the auto-regressive (causal) constraint and only depend on valid conditional factors for constructing the corresponding output distributions, i.e., for RNN and for SRNN. Hence, the low-level hidden vectors can be then used to form the per-element output distributions in Eqn. (13) and (14).

In practice, the low-level model could simply be an RNN or a causally masked MLP [11], depending on our prior about the data. For instance, RNN is clearly not a suitable choice for the Perm-TIMIT dataset, since the element after permutation do not possess any recurrent pattern. Therefore, for our evaluation, RNN is employed as the low-level neural architecture on all datasets except for the Perm-TIMIT, where we employ a causally masked MLP without parameter sharing, i.e.,

For convenience, we will refer to the hierarchical models as RNN-hier and SRNN-hier.

Models TIMIT VCTK Blizzard Muse Nottingham IAM-OnDB Perm-TIMIT
VRNN 28,982 - 9,392 - - 1384 -
SRNN 60,550 - 11,991 -6.28 -2.94 - -
Z-Forcing 68,903 - 14,435 - - - -
Z-Forcing + aux 70,469 - 15,430 - - - -
F-RNN 32,745 0.786 7,610 -6.991 -3.400 1397 25,679
F-SRNN 69,223 2.383 15,258 -6.438 -2.811 1402 67,613
-RNN-random 66,453 2.199 14,585 -6.252 -2.834 N/A 61,103
RNN-flat 117,721 3.2173 22,714 -5.251 -2.180 N/A 15,763
SRNN-flat 109,284 3.2062 22,290 -5.616 -2.324 N/A 14,278
RNN-hier 109,641 3.1822 21,950 -5.161 -2.028 1440 95,161
SRNN-hier 107,912 3.1423 21,845 -5.483 -2.065 1395 94,402
Table 4: Performance comparison on a diverse set of datasets. The models with indicate that the performances are directly copied from previous publications. Numbers with indicate the state-of-the-art performances. N/A suggests the model is not application on the dataset.

In some cases where all the elements within a step share the same statistical type, such as on the speech or MIDI dataset, one may alternatively consider a flat model. As the name suggests, the flat model will break the boundary between steps and flatten the data into a new uni-variate sequence, where each step is simply a single element. Then, the new uni-variate sequence can be directly fed into a standard RNN or SRNN model, producing each conditional factor in Eqn. (13) and (14) in an auto-regressive manner. Similarly, this class of RNN and SRNN will be referred to as RNN-flat and SRNN-flat, respectively.

Compared to the hierarchical model, the flat variant implicitly assumes a sequential continuity between and , since their computational dependency is the same as that between any two consecutive elements within the same step. Since this inductive bias matches the characteristics of multi-frame speech sequences, we expect flat model to perform better in this case.

5.3 Experiment Results

Based on the seven datasets listed in Table 7, we compare the performance of factorized models, including F-RNN and F-SRNN, and non-factorized models introduced above. To provide a random baseline, we include the -RNN with the random split scheme in the comparison. Moreover, previous results, if exist, are also presented to provide additional information. For a fair comparison, we make sure all models share the same parameter size. For more implementation details, please refer to the Supplementary C as well as the source Finally, the results are summarized in Table 4, where we make several important observations.

Firstly, on the speech and MIDI datasets, models with auto-regressive (lower-half) output distributions obtain a dramatic advantage over models with fully factorized output distributions (upper-half), achieving new SOTA results on three speech datasets. This observation reminds us that, besides capturing the long-term temporal structure across steps, how to properly model the intra-step dependency is equally, if not more, crucial to the practical performance.

Secondly, when the auto-regressive output distribution is employed (lower-half), the non-stochastic recurrent models consistently outperform their stochastic counterparts across all datasets. In other words, the advantage of SRNN completely disappears once a powerful output distribution is used. Combined with the previous observation, it verifies our earlier concern that the so-called superiority of F-SRNN over F-RNN is merely a result of the biased experiment design in previous work.

In addition, as we expected, when the inductive bias of the flat model match the characteristics of speech data, it will achieves a better performance than the hierarchical model. Inversely, when the prior does not match data property on the other datasets, the hierarchical model is always better. In the extreme case of permuted TIMIT, the flat model even falls behind factorized models, while the hierarchical model achieves a very decent performance that is even much better than what F-SRNN can achieve on the original TIMIT. This shows that hierarchical model is usually more robust, especially when we don’t have a good prior.

Overall, we don’t find any advantage of employing stochastic latent variables for multivariate sequence modeling. Instead, relying on a full auto-regressive solution yields better or even state-of-the-art performances. Combined with the observation that -RNN-random can often achieve a competitive performance to F-SRNN, we believe that the theoretical advantage of latent-variable models in sequence modeling is still far from fulfilled, if ultimately possible. In addition, we suggest future develop along this line compare with the simple but extremely robust baselines with an auto-regressive output distribution.

6 Conclusion and Discussion

In summary, our re-examination reveals a misleading impression on the benefits of latent variables in sequence modeling. From our empirical observation, the main effect of latent variables is only to provide a mechanism to leverage the intra-step correlation, which is however, not as powerful as employing the straightforward auto-regressive decomposition. It remains unclear what leads to the significant gap between the theoretical potential of latent variables and their practical effectiveness, which we believe deserves more research attention. Meanwhile, given the large gain of modeling simultaneity, using sequential structures to better capture local patterns is another good future direction in sequence modeling.


This work is supported in part by the National Science Foundation (NSF) under grant IIS-1546329 and by DOE-Office of Science under grant ASCR #KJ040201.


Appendix A Different Variants of Stochastic Recurrent Neural Networks

As stated in Section 3

, we employ a simplified version of stochastic recurrent neural network in our study to evaluate the effectiveness of latent variables in sequence modeling. This section detail the connection and difference between our parameterization and the stochastic recurrent network models proposed in previous publications 

[1, 4, 8, 12].

For stochastic recurrent neural networks, the generic decomposition of generative distribution shared by previous methods has the form:

where each new step depends on the entire history of the observation and the latent variables . Similarly, for the approximate posterior distribution, all previous approaches can be unified under the form

Given the generic forms, various parameterizations with different independence assumptions have been introduced:

  • [leftmargin=*,itemsep=0em,topsep=0em]

  • STORN [1]: This parameterization makes two simplifications. Firstly, the prior distribution is assumed to be context independent, i.e.,

    Secondly, the posterior distribution is simplified as

    which drops both the dependence on the future information as well as that on sub-sequence of previous latent variables .

    Despite the simplification in the prior, STORN imposes no independence assumption on the output distribution . Specifically, an RNN is used to capture the two conditional factors :

    Notice that, the RNN is capable of modeling the correlation among the latent variables and encodes the information into .

  • VRNN [4]: This parameterization eliminates some independence assumptions in STORN. Firstly, the prior distribution becomes fully context dependent via a context RNN:

    Notice that is dependent on all previous latent variables . Hence, there is no independence assumptions involved in the prior distribution. However, notice that the computation of cannot be parallelized due to the dependence on the latent variable as an input.

    Secondly, compared to STORN, the posterior in VRNN additionally depends on the previous latent variables :

    where is the same forward vector used to construct the prior distribution above. However, the posterior still does not depend on the future observations .

    Finally, the output distribution is simply constructed as

  • SRNN [8]: Compared to VRNN, SRNN (1) introduces a Markov assumption into the latent-to-latent dependence and (2) makes the posterior condition on the future observations .

    Specifically, SRNN employs two RNNs, one forward and the other backward, to consume the observation sequence from the two different directions:

    From the parametric form, notice that is always conditioned on the entire observation , while only has access to .

    Then, the prior and posterior are respectively formed by

    where the indicates the aforementioned Markov assumption. In other words, given the sampled value of , is independent of .

    Finally, the output distribution of SRNN also involves the same simplification:

  • Z-Forcing SRNN [12]: By feeding the latent variable as an additional input into the forward RNN, an approach similar to the VRNN, this parameterization successfully removes the Markov assumption in SRNN.

    Specifically, the computation goes as follows:

    where the is sampled from either the prior or posterior:

    Notice that, since relies on in a deterministic manner, there is no Markov assumption anymore when is used to construct the prior and posterior.

    The same property also extends to the output distribution, which has the same parametric form as SRNN although the contains different information:

As explained above, as long as the construction of prior or the posterior conditions on the value of previous latent variables as input, the computation will be completely sequential and cannot be parallelized. When the number of steps reaches thousands, such as in the case of using the SRNN-flat model introduced in Section 5.2, the training speed will be unbearably slow.

Faced with this problem, we remove the dependency between and , leading to the prior and posterior introduced in Section 3:

where the forward and backward vectors are both computed separately in a single pass:

However, this simplification entirely throws away the dependency among latent variables, which could be over simplified. As a compensation, we employ an additional RNN to process the latent variables (Eqn. 3):

where can potentially capture the correlation among . This is similar to the solution in STORN.

An advantage of this simplified solution is that we can directly utilize cuDNN accelerated RNN module instead of relying on a customized for-loop as used in VRNN, SRNN and Z-Forcing SRNN. However, the speed advantage will be meaningless if the performance degrades significantly. To ensure this does not happen, based on the TIMIT dataset, we compare the simplified SRNN and our implementation of Z-Forcing SRNN in Table 5. Specifically, we consider two cases where the output distribution either has a factorized form (F-SRNN) and uses a flat auto-regressive parameterization (SRNN-flat).

As we can see, given similar number of parameters, the simplified version can match or even surpass the performance of Z-Forcing SRNN without the auxiliary cost. Speed wise, the simplified version is consistently faster than Z-Forcing SRNN, especially the flat auto-regressive case (3x faster).

Model #Param LL Time
Z-F-SRNN 40 200 17.44M 69,296 1.38h
Sim-F-SRNN 40 200 17.45M 69,223 0.96h
Z-SRNN-flat 1,000 1 16.98M 97,194 62.86h
Sim-SRNN-flat 1,000 1 16.78M 99,102 20.62h
Table 5: Comparison between our simplified SRNN and Z-Forcing SRNN. The prefixes “Z-” and “Sim-” indicate Z-Forcing and Simplified respectively.

Additionally, in Table 6, we also compare the simplified SRNN with the published performance of the stochastic recurrent networks on the speech, hand-writing trajectory, and music datasets. The result demonstrates that the simplified SRNN is able to reproduce the SOTA level performance.

Model TIMIT Blizzard Muse Nottingham IAM-OnDB
VRNN 28,982 9,392 - - 1384
SRNN 60,550 11,991 -6.28 -2.94 -
Z-Forcing 68,903 14,435 - - -
Z-Forcing + aux 70,469 15,430 - - -
Sim-SRNN 69,223 15,258 -6.44 -2.81 1402
Table 6: Comparison between simplified SRNN and reported results in previous publications.

Due to the comparative performance and improved speed of SRNN, we choose it as the default parameterization of the stochastic recurrent network in this work.

Appendix B Data Statistics

Datasets Number of Steps Frames / Step
TIMIT 1.54M 200
VCTK 12.6M 200
Blizzard 90.5M 200
Perm-TIMIT 1.54M 200
Muse 36.1M 88
Nottingham 23.5M 88
IAM-OnDB 7.63M 3
Table 7: Statistics of the datasets in consideration.

The dataset statistic is summarized in Table 7. “Frame / Step” indicates the dimension of the vector at each time stamp. “Number of Steps” is the total length for the multivariate sequence.

Appendix C Experiment Details

Domains Speech MIDI Handwriting
F-RNN 17.41M 0.57M 0.93M
F-SRNN 17.53M 2.28M 1.17M
-RNN-random 18.57M 0.71M N/A
RNN-flat 16.86M 1.58M N/A
SRNN-flat 16.93M 2.24M N/A
RNN-hier 17.28M 1.87M 0.97M
SRNN-hier 17.25M 3.05M 1.02M
Table 8: The parameter numbers of all implemented methods.

In the following, we will provide more details about our implementation. Firstly, Table 8 reports the parameter size of all models compared in Table 4. For data domains with enough data (i.e., speech and handwriting), we ensure the parameter size is about the same. On the smaller MIDI dataset, we only make sure the RNN variants do not use more parameters than SRNNs do.

For all methods, we use the Adam algorithm [18] as the optimizer with learning rate 0.001. The cosine schedule [24] is used to anneal the learning rate from 0.001 to 0.000001 during the training process. The batch size is set to 32 for TIMIT, 128 for VCTK and Blizzard, 16 for Muse, Nottingham, and 32 for IAM-OnDB. The total number of training steps is 20k for Muse, Nottingham, and IAM-OnDB, 40k for TIMIT, 80k for VCTK, 160K for Blizzard. For all SRNN variants, we follow previous work to employ the KL annealing strategy, where the coefficient on the KL term is increased from 0.2 to 1.0 by an increment of 0.00005 after each parameter update [12].

For the architectural details such as the number of layers and hidden dimensions used in this study, we refer the readers to the accompanied source code.