Variational Bi-LSTMs

by   Samira Shabanian, et al.

Recurrent neural networks like long short-term memory (LSTM) are important architectures for sequential prediction tasks. LSTMs (and RNNs in general) model sequences along the forward time direction. Bidirectional LSTMs (Bi-LSTMs) on the other hand model sequences along both forward and backward directions and are generally known to perform better at such tasks because they capture a richer representation of the data. In the training of Bi-LSTMs, the forward and backward paths are learned independently. We propose a variant of the Bi-LSTM architecture, which we call Variational Bi-LSTM, that creates a channel between the two paths (during training, but which may be omitted during inference); thus optimizing the two paths jointly. We arrive at this joint objective for our model by minimizing a variational lower bound of the joint likelihood of the data sequence. Our model acts as a regularizer and encourages the two networks to inform each other in making their respective predictions using distinct information. We perform ablation studies to better understand the different components of our model and evaluate the method on various benchmarks, showing state-of-the-art performance.


page 1

page 2

page 3

page 4


Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss

Bidirectional long short-term memory (bi-LSTM) networks have recently pr...

Investigating gated recurrent neural networks for speech synthesis

Recently, recurrent neural networks (RNNs) as powerful sequence models h...


Long Short-Term Memory (LSTM) Recurrent Neural networks (RNNs) rely on g...

Pain Assessment based on fNIRS using Bidirectional LSTMs

Assessing pain in patients unable to speak (also called non-verbal patie...

Learning From Graph Neighborhoods Using LSTMs

Many prediction problems can be phrased as inferences over local neighbo...

Evaluating the Ability of LSTMs to Learn Context-Free Grammars

While long short-term memory (LSTM) neural net architectures are designe...

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Training recurrent neural networks (RNNs) on long sequence tasks is plag...

1 Introduction

Recurrent neural networks (RNNs) have become the standard models for sequential prediction tasks with state of the art performance in a number of applications like sequence prediction, language translation, machine comprehension, and speech synthesis (Arik et al., 2017; Wang et al., 2017; Mehri et al., 2016; Sotelo et al., 2017). RNNs model temporal data by encoding a given arbitrary-length input sequentially, at each time step combining a transformation of the current input with the encoding from the previous time step. This encoding, referred to as the RNN hidden state, summarizes all previous input tokens.

Viewed as “unrolled” feedforward networks, RNNs can become arbitrarily deep depending on the input sequence length, and use a repeating module to combine the input with the previous state at each time step. Consequently, they suffer from the vanishing/exploding gradient problem 

(Pascanu et al., 2012). This problem has been addressed through architectural variants like the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997)

and the gated recurrent unit (GRU) 

(Chung et al., 2014). These architectures add a linear path along the temporal sequence which allows gradients to flow more smoothly back through time.

Various regularization techniques have also been explored to improve RNN performance and generalization. Dropout (Srivastava et al., 2014) regularizes a network by randomly dropping hidden units during training. However, it has been observed that using dropout directly on RNNs is not as effective as in the case of feed-forward networks. To combat this, Zaremba et al. (2014) propose to instead apply dropout on the activations that are not involved in the recurrent connections (Eg. in a multi-layer RNN); Gal & Ghahramani (2016) propose to apply the same dropout mask through an input sequence during training. In a similar spirit to dropout, Zoneout (Krueger et al., 2016) proposes to choose randomly whether to use the previous RNN hidden state.

The aforementioned architectures model sequences along the forward direction of the input sequence. Bidirectional-LSTM, on the other hand, is a variant of LSTM that simultaneously models each sequence in both the forward and backward direction. This enables a richer representation of data, since each token’s encoding contains context information from the past and the future. It has been shown empirically that bidirectional architectures generally outperform unidirectional ones on many sequence-prediction tasks. However, the forward and backward paths in Bi-LSTMs are trained separately and the benefit usually comes from the combined hidden representation from both paths. In this paper, our main idea is to frame a joint objective for Bi-directional LSTMs by minimizing a variational lower bound of the joint likelihood of the training data sequence. This in effect implies using a variational auto-encoder (VAE;

Kingma & Welling (2014)) that takes as input the hidden states from the two paths of the Bi-LSTM and maps them to a shared hidden representation of the VAE at each time step. The samples from the VAE’s hidden state are then used for reconstructing the hidden states of both the LSTMs. While the use of a shared hidden state acts as a regularizaer during training, the dependence on the backward path can be ignored during inference by sampling from the VAE prior. Thus our model is applicable in domains where the future information is not available during inference (Eg. language generation). We refer to our model as Variational Bi-LSTM. We note that recently proposed methods like TwinNet (Serdyuk et al., 2017) and Z-forcing (Sordoni et al., 2017) are similar in spirit to this idea. We discuss the differences between our approach and these models in section 5.

Below, we describe Variational Bi-LSTMs in detail and then demonstrate empirically their ability to model complex sequential distributions. In experiments, we obtain state-of-the-art or competitive performance on the tasks of Penn Treebank, IMDB, TIMIT, Blizzard, and Sequential MNIST.

2 Variational Bi-LSTM

Bi-LSTM is a powerful architecture for sequential tasks because it models temporal data both in the forward and backward direction. For this, it uses two LSTMs that are generally learned independent of each other; the richer representation in Bi-LSTMs results from combining the hidden states of the two LSTMs, where combination is often by concatenation. The idea behind variational Bi-LSTMs is to create a channel of information exchange between the two LSTMs that helps the model to learn better representations. We create this dependence by using the variational auto-encoder (VAE) framework. This enables us to take advantage of the fact that VAE allows for sampling from the prior during inference. For sequence prediction tasks like language generation, while one can use Bi-LSTMs during training, there is no straightforward way to employ the full bidirectional model during inference because it would involve, Eg., generating a sentence starting at both its beginning and end. In such cases, the VAE framework allows us to sample from the prior at inference time to make up for the absence of the backward LSTM.

Now we describe our variational Bi-LSTM model formally. Let be a dataset consisting of i.i.d. sequential data samples of continuous or discrete variables. For notational convenience, we will henceforth drop the superscript indexing samples. For each sample sequence , the hidden state of the forward LSTM is given by:


The hidden state of the backward LSTM is given by,


In the forward LSTM, represents the standard LSTM function modified to account for the additional arguments used in our model using separate additional dense matrices for and . The function in the backward LSTM is defined as in a standard Bi-LSTM.

In the forward LSTM model, we introduce additional latent random variables,

and , where depends on and during training, and depends on (see figure 1-left, for a graphical representation). We also introduce the random variable which depends on and is used in an auxiliary cost which we will discuss later. Note that so far, and

are simply latent vectors drawn from conditional distributions that depend on

, to be defined below. However, as explained in Section 2.1 (see also dashed lines in figure 1-left), we will encourage these to lie near the manifolds of backward and forward LSTM states respectively by adding auxiliary costs to our objective.

By design, the joint conditional distribution over latent variables and with parameters and factorizes as . This factorization enables us to formulate several helpful auxiliary costs, as defined in the next subsection. Further, defines the generating model, which induces the distribution over the next observation given the previous states and the current input.

Then the marginal likelihood of each individual sequential data sample can be written as


where is the set of all parameters of the model. Here, we assume that all conditional distributions belong to parametrized families of distributions which can be evaluated and sampled from efficiently.

Note that the joint distribution in equation (

3) is intractable. Kingma & Welling (2014) demonstrated how to maximize a variational lower bound of the likelihood function. Here we derive a similar lower bound for the joint likelihood given as , of the data log likelihood, which is given by


where is the conditional inference model, is the Kullback-Leibler (KL) divergence between the approximate posterior and the conditional prior (see the appendix). Notice the above function is a general lower bound that is not explicitly defined in terms of and , but rather all the terms are conditional upon the previous predictions . The choice of how the model is defined in terms of and is a design choice which we will make more explicit in the next section.

(a) Training phase of variational Bi-LSTM
(b) Inference phase of variational Bi-LSTM
Figure 1: Graphical description of our proposed variational Bi-LSTM model during train phase (left) and inference phase (right). During training, each step is composed of an encoder which receives both the past and future summary via and respectively, and a decoder that generates and which are forced to be close enough to and using two auxiliary reconstruction costs (dashed lines). This dependence between backward and forward LSTM through the latent random variable encourages the forward LSTM to learn a richer representation. During inference, the backward LSTM is removed. In this case, is sampled from the prior as in a typical VAE, which in our case, is defined as a function of .

2.1 Training and Inference

In the proposed variational Bi-LSTM, the latent variable is inferred as


in which where is a multi-layered feed-forward network with Gaussian outputs. We assume that the prior over

is a diagonal multivariate Gaussian distribution given by


for a fully connected network . This is important because, during generation (see Figure 1-right, for a graphical representation), we will not have access to the backward LSTM. In this case, as in a VAE, we will sample from the prior during inference which only depends on the forward path. Since we define the prior to be a function of , the forward LSTM is encouraged during training to learn the dependence due to the backward hidden state .

The latent variable is meant to model information coming from the future of the sequence. Its conditional distribution is given by


where for a fully connected neural network (See Figure 1(a)). To encourage the encoding of future information in

, we maximize the probability of the true backward hidden state,

, under the distribution , as an auxiliary cost during training. In this way we treat as a predictor of , similarly to what was done by Sordoni et al. (2017).

To capture information from the past in the latents, we similarly use as a predictor of . This is accomplished by maximizing the probability of the latter under the conditional distribution of the former, , as another auxiliary cost, where


Here, is the output of a fully-connected neural network taking as input. The auxiliary costs arising from distributions and teach the variational Bi-LSTM to encode past and future information into the latent space of .

We assume that the generating distribution is parameterized via a recurrent fully connected network, taking the form of either a Gaussian distribution output in the continuous case or categorical proportions output in the discrete (ie, one-hot) prediction case.

Finally, we define the Variational Bi-LSTM objective we use in this paper by instantiating the conditionals upon used in Eq. 5 with functions and as,


where and are non-negative real numbers denoting the coefficients of the auxiliary costs and respectively. These auxiliary costs ensure that and remain close to and . All the parameters in and

are updated based on backpropagation through time 

(Rumelhart et al., 1988) using the reparameterization trick (Kingma & Welling, 2014).

As a side note, we improve training convergence with a trick which we refer to as stochastic backprop, meant to ease learning of the latent variables. It is well known that autoregressive decoder models tend to ignore their stochastic variables (Bowman et al., 2015). Stochastic backprop is a technique to encourage that relevant summaries of the past and the future are encoded in the latent space. The idea is to stochastically skip gradients of the auxiliary costs with respect to the recurrent units

from backpropagating through time. To achieve this, at each time step, a mask drawn from a Bernoulli distribution which governs whether to skip the gradient or to backpropagate it for a given data point.

3 Experimental results

In this section we demonstrate the effectiveness of our proposed model on several tasks. We present experimental results obtained when training Variational Bi-LSTM on various sequential datasets: Penn Treebank (PTB), IMDB, TIMIT, Blizzard, and Sequential MNIST. Our main goal is to ensure that the model proposed in Section 2 can benefit from a generated relevant summary of the future that yields competitive results. In all experiments, we train all the models using ADAM optimizer (Kingma & Ba, 2014) and we set all MLPs in Section 2

to have one hidden layer with leaky-ReLU hidden activation. All the models are implemented using Theano 

(Theano Development Team, 2016) and the code is available at https://anonymous.url.

Blizzard: Blizzard is a speech model dataset with 300 hours of English, spoken by a single female speaker. We report the average log-likelihood for half-second sequences (Fraccaro et al., 2016). In our experimental setting, we use 1024 hidden units for MLPs, 1024 LSTM units and 512 latents. Our model is trained using learning rate of 0.001 and minibatches of size 32 and we set

. A fully factorized multivariate Gaussian distribution is used as the output distribution. The final lower bound estimation on TIMIT can be found in Table 


Model Blizzard TIMIT
RNN-Gauss 3539 -1900
RNN-GMM 7413 26643
VRNN-I-Gauss 8933 28340
VRNN-Gauss 9223 28805
VRNN-GMM 9392 28982
SRNN (smooth+res) 11991 60550
Z-Forcing  (Sordoni et al., 2017) 14315 68852
Variational Bi-LSTM 17319 73976
Table 1: The average of log-likelihood per sequence on Blizzard and TIMIT testset

TIMIT: Another speech modeling dataset is TIMIT with 6300 English sentences read by 630 speakers. Like  Fraccaro et al. (2016), our model is trained on raw sequences of 200 dimensional frames. In our experiments, we use 1024 hidden units, 1024 LSTM units and 128 latent variables, and batch size of 128. We train the model using learning rate of 0.0001, and . The average log-likelihood for the sequences on test can be found in Table 1.

Sequential MNIST:

We use the MNIST dataset which is binarized according to 

Murray & Salakhutdinov (2009) and we download it from Larochelle (2011). Our best model consists of 1024 hidden units, 1024 LSTM units and 256 latent variables. We train the model using a learning rate of 0.0001 and a batch size of 32. To reach the negative log-likelihood reported in Table 2, we set and .

Models Seq-MNIST
DBN 2hl (Germain et al., 2015) 84.55
NADE (Uria et al., 2016) 88.33
EoNADE-5 2hl (Raiko et al., 2014) 84.68
DLGM 8  (Salimans et al., 2014) 85.51
DARN 1hl (Gregor et al., 2015) 84.13
BiHM (Bornschein et al., 2015) 84.23
DRAW (Gregor et al., 2015) 80.97
PixelVAE (Gulrajani et al., 2016) 79.02
Prof. Forcing (Goyal et al., 2016) 79.58
PixelRNN (Oord et al., 2016) 80.75
PixelRNN (Oord et al., 2016) 79.20
Z-Forcing (Sordoni et al., 2017) 80.09
Variational Bi-LSTM 79.78
Table 2: The average of negative log-likelihood on sequential MNIST

IMDB: It is a dataset consists of 350000 movie reviews (Diao et al., 2014) in which each sentence has less than 16 words and the vocabulary size is fixed to 16000 words. In this experiment, we use 500 hidden units, 500 LSTM units and latent variables of size 64. The model is trained with a batch size of 32 and a learning rate of 0.001 and we set . The word perplexity on valid and test dataset is shown in Table 3.

Model Valid Test
Gated Word-Char 70.60 70.87
Z-Forcing (Sordoni et al., 2017) 56.48 65.68
Variational Bi-LSTM 51.43 51.60
Table 3: Word perplexity on IMDB on valid and test sets

PTB: Penn Treebank (Marcus et al. (1993)) is a language model dataset consists of 1 million words. We train our model with 1024 LSTM units, 1024 hidden units, and the latent variables of size 128. We train the model using a standard Gaussian prior, a learning rate of 0.001 and batch size of 50 and we set . The model is trained to predict the next character in a sequence and the final bits per character on test and valid sets are shown in Table 4.

Model Valid Test
Unregularized LSTM 1.47 1.36
Weight noise 1.51 1.34
Norm stabilizer 1.46 1.35
Stochastic depth 1.43 1.34
Recurrent dropout 1.40 1.29
Zoneout (Krueger et al. (2016)) 1.36 1.25
RBN (Cooijmans et al. (2016)) - 1.32
H-LSTM + LN (Ha et al. (2016)) 1.28 1.25
3-HM-LSTM + LN (Chung et al., 2016) - 1.24
2-H-LSTM + LN (Ha et al. (2016)) 1.25 1.22
Z-Forcing 1.29 1.26
Variational Bi-LSTM 1.26 1.23
Table 4: Bits Per Character (BPC) on PTB valid and test sets

4 Ablation Studies

The goal of this section is to study the importance of the various components in our model and ensure that these components provide performance gains. The experiments are as follows:

0.001 1. 4. 8. 16.
Test perplexity 56.07 60.74 69.97 77.24 86.72
Table 5: Perplexity on IMDB using different coefficient for activity regularization

1. Reconstruction loss on vs activity regularization on

Merity et al. (2017) study the importance of activity regularization (AR) on the hidden states of LSTMs given as,


Since our model’s reconstruction term on can be decomposed as,


we perform experiments to confirm that the gains in our approach is not due to the regularization alone since our regularization encapsulates an term along with the dot product term.

To do so, we replace the auxiliary reconstruction terms in our objective with activity regularization using hyperparameter

and study the test perplexity. The results are shown in table LABEL:table_ar_ablation. We find that in all the cases performance using activity regularization is worse compared with our best model shown in table 3.

Dataset PTB Seq-MNIST IMDB TIMIT Blizzard
KL 0.001 0.02 0.18 3204.71 3799.79
Table 6: KL divergence of the Variational Bi-LSTM
Figure 2: Evolution of validation set performance during training of Variational Bi-LSTMs with and without auxiliary reconstruction costs and stochastic backprop through auxiliary costs on PTB and Blizzard. We see that both the presence of reconstruction loss and stochastic back-propagation through them helps performance.
Figure 3: Evolution of bits per character (BPC) on PTB validation set with sampling latent variables from during training vs. using the expected value of . Performance drops when is not sampled. This shows sampling randomly during training acts as a regularizer.

2. Use of parametric encoder prior vs. fixed Gaussian prior

In our variational Bi-LSTM model, we propose to have the encoder prior over as a function of the previous forward LSTM hidden state . This is done to omit the need of the backward LSTM during inference because it is unavailable in practical scenarios since predictions are made in the forward direction. However, to study whether the model learns to use this encoder or not, we record the KL divergence value of the best validation model for the various datasets. The results are reported in table 6. We can see that the KL divergence values are large in the case of IMDB, TIMIT and Blizzard datasets, but small in the case of Seq-MNIST and PTB. To further explore, we run experiments on these datasets with fixed standard Gaussian prior like in the case of traditional VAE. Interestingly we find that the model with fixed prior performed similarly in the case of PTB, but hurt performance in the other cases, which can be explained given their large KL divergence values in the original experiments.

3. Effectiveness of auxiliary costs and stochastic back-propagation

In our model description, we propose to stochastically back propagate gradients through the auxiliary (reconstruction) costs, i.e., randomly choose to pass the gradients of the auxiliary cost or not. Here we evaluate the importance of the auxiliary costs and stochastic back-propagation. Figure 2 shows the evolution of validation performance on the Blizzard and PTB dataset. In both cases we see that both the auxiliary costs and stochastic back-propagation help the validation set performance.

4. Importance of sampling from VAE prior during training

We evaluate the effectiveness of sampling from the prior during training vs. using the mean of the Gaussian prior. The validation set performance during training is shown in figure 3. It can be seen that sampling leads to better generalization. Further, with the model trained with sampled, we also evaluate if using samples is necessary during inference or not. Interestingly we find that during inference, the performance is identical in both cases; thus the deterministic mean of the prior can be used during inference.

5 Related Work

Variational auto-encoders (Kingma & Welling, 2014)

can be easily combined with many deep learning models. They have been applied in the feed-forward setting but they have also found usage in RNNs to better capture variation in sequential data 

(Sordoni et al., 2017; Fraccaro et al., 2016; Chung et al., 2015; Bayer & Osendorfer, 2014). VAEs consists of several muti-layer neural networks as probabilistic encoders and decoders and training is based on the gradient on log-likelihood lower bound (as the likelihood is in general intractable) of the model parameters along with a reparametrization trick. The derived variational lower-bound for an observed random variable is:



denotes the Kullback-Leibler divergence and

is the prior over a latent variable . The KL divergence term can be expressed as the difference between the cross-entropy of the prior w.r.t. and the entropy of , and fortunately, it can be analytically computed and differentiated for some distribution families like Gaussians. Although maximizing the log-likelihood corresponds to minimizing the KL divergence, we have to ensure that the resulting remains far enough from an undesired equilibrium state where is almost everywhere equal to the prior over latent variables. Combining recurrent neural networks with variational auto encoders can lead to powerful generative models that are capable of capturing the variations in data, however, they suffer badly from this optimization issue as discussed by Bowman et al. (2015).

However, VAEs have successfully been applied to Bi-LSTMs by Sordoni et al. (2017) through a technique called Z-forcing. It is a powerful generative auto-regressive model which is trained using the following variational evidence lower-bound

plus an auxiliary cost as a regularizer which is defined as . It is shown that the auxiliary cost helps in improving the final performance; however during inference the backward reconstructions are not used in their approach. In our ablation study section below, we show experimentally that this connection is important for improving the performance of Bi-LSTMs as is the case in our model.

Twin networks on the other hand is a deterministic method which enforces the hidden states of the forward and backward paths of a Bi-LSTM to be similar. Specifically, this is done by adding as a regularization norm of difference between the pair of hidden states at each time step. The intuition behind this regularization is to encourage the hidden representations to be compatible towards future predictions. Notice a difference between twin networks and our approach is that twin networks forces the hidden states of both the LSTMs to be similar to each other while our approach directly feeds a latent variable to the forward LSTM that is trained to be similar to the backward LSTM hidden state . Hence while twin networks discard information from the backward path during inference, our model encourages the use of this information.

6 Conclusion

We propose Variational Bi-LSTM as an auto-regressive generative model by framing a joint objective that effectively creates a channel for exchanging information between the forward and backward LSTM. We achieve this by deriving a variational lower bound of the joint likelihood of the temporal data sequence. We empirically show that our variational Bi-LSTM model acts as a regularizer for the forward LSTM and leads to performance improvement on different benchmark sequence generation problems. We also show through ablation studies the importance of the various components in our model.


The authors would like to thank Theano developers  (Theano Development Team, 2016) for their great work. DA was supported by IVADO, CIFAR and NSERC.



A: Derivation of variation lower bound in equation (5) in more details: