1 Introduction
Attention models have gained widespread traction from their successful use in tasks such as object recognition, machine translation, speech recognition where they are used to integrate information from different parts of the input before producing outputs. Soft attention does this by weighting and combining all input elements into a context vector while hard attention selects specific inputs and discards others, leading to computational gains and greater interpretability. While soft attention models are differentiable endtoend and thus easy to train, hard attention models introduce discrete latent variables that often require reinforcement learning style approaches.
Classic reinforcement learning methods such as REINFORCE [1] and Qlearning [2] have been used to train hard attention models, but these methods can provide highvariance gradient estimates, making training slow and providing inferior solutions. An alternative to reinforcement learning is variational inference, which trains a second model, called the approximate posterior, to be close to the true posterior over the latent variables. The approximate posterior uses information about both the input and its labels to produce settings of the latent variables used to train the original model. This can provide lowervariance gradient estimates and better solutions.
In this paper, we leverage recent developments in variational inference to fit hard attention models in a sequential setting. We specialize these method to sequences and develop a model for the approximate posterior. In response to issues applying variational inference techniques to long sequences, we develop new variance control methods. Finally we show experimentally that our approach improves performance and substantially improves training time for speech recognition on the TIMIT dataset as well as a challenging noisy, multispeaker version of TIMIT that we call MultiTIMIT.
2 Methods
s are the hidden states of the recurrent neural networks (RNNs) that parameterize the conditional distributions of the models. Square nodes are deterministic, round nodes are stochastic. A shaded
indicates that the model chose to consume an input and not emit an output while an unshaded mean that the model chose to produce an output and not consume an input. For example, in (a) note that is shaded, so the model did not produce an output on timestep 1 and instead consumes the input on the next timestep. is unshaded, so on the second timestep the model produced output .2.1 Model
In this paper we use the online sequencetosequence model described in [3] to demonstrate our methods. We model where is a sequence of observed target tokens and is a sequence of observed inputs. The Bernoulli latent variables define when the model outputs tokens, i.e. implies the model emitted a token at timestep , and implies the model did not emit a token at timestep . If , the model is forced to dwell on the same input at the next time step, i.e. the observation fed in at timestep is fed in again at timestep when . Let be the number of target tokens, the number of inputs, and the number of steps the model is run for. Our model assumes factorizes as
(1) 
where is the position in the output at time and
is the input position at time t. Intuitively, this expression is the product over time of the probability assigned to the current ground truth given that the model emitted, multiplied by the probability that the model emitted. When
the model did not emit at time , so there is no probability assigned to the ground truth on that timestep. For brevity, we will use to implicitly mean (i.e., the target at step ). Similarly, we will refer to as and similarly for ranges over time for these variables.2.2 Learning
To fit the model (1) with maximum likelihood we are concerned with maximizing the probability of the observed variables . However, (1) is written in terms of the unobserved latents, , so we must marginalize over them. We maximize
where is the state at time and the expectations are over . Note that this is a lower bound on the log probability of the observed , so maximizing this bound will hopefully increase the likelihood of the observed data. Differentiating this objective gives
(2) 
where is the return at timestep , understood intuitively as the log probability the model assigns to observed data for a given series of emission decisions. The first gradient term can be estimated with a single Monte Carlo sample, but the second term exhibits high variance because it involves an unbounded log probability. To reduce variance, [3] subtracts a learned baseline from the return, which does not change the expectation as long as it is independent of .
Performing stochastic gradient ascent with this gradient estimator is the standard REINFORCE algorithm where the reward is the loglikelihood. Unfortunately, this requires sampling from during training, which can lead to gradient estimates with high variance when settings of that assign high likelihood to are rare [4]. Variational inference is a family of techniques that use importance sampling to instead sample from a different model, called the approximate posterior or , which approximates the true posterior over , . We factorize the approximate posterior as
(3) 
The approximate posterior has access to all past and future and , as well as past , and leverages this information to assign high probability to that produce large values of . Intuitively, in speech recognition, knowing the token the model must emit is helpful in deciding when to emit.
Using and an importance sampling identity we obtain a lower bound on the loglikelihood
(4) 
where we can simultaneously optimize and the parameters of the model to improve the lower bound. Optimizing this bound via stochastic gradient ascent can be thought of as training with maximum likelihood to reproduce s sampled from . is then updated with REINFORCEstyle gradients where the reward is the logprobability assigns to given , similar to (2), see [4] for details. Setting recovers the REINFORCE objective.
2.2.1 Multisample Objectives
Both the REINFORCE and the variational inference objectives admit multisample versions that give tighter bounds on the loglikelihood [5]. In particular, the multisample variational lower bound is
(5) 
where is the number of samples and denotes the th sample of the latent variables. Setting recovers the multisample analogue to REINFORCE.
The gradient of (5) takes a similar form to (2), with one lowvariance term and one REINFORCEstyle term with high variance, for details see [4]. Similarly to the REINFORCE objective, we can use a baseline to reduce the variance of the gradient as long as it does not depend on . Notably, the baseline for trajectory is allowed to depend on all timesteps of other trajectories, i.e. .
2.3 Variance Reduction
Training these models is challenging due to high variance gradient estimates. We can reduce the variance of the estimators by using information from multiple trajectories to construct baselines. In particular, for REINFORCE, we can write the gradient update as
where is a baseline for sample that is a function of the th trajectory’s state up to time as well as the returns produced by all other trajectories. The goal is to pick a that is a good estimate of the return, and a straightforward choice of is the average return from the other samples
This ignores the fact that , which can make this standard baseline unusable. For example, in our setting different trajectories may have emitted different numbers of tokens on a given timestep, resulting in substantial differences in return between trajectories that do not indicate the relative merit of those trajectories. Ideally, we would average over multiple trajectories starting from , but this is computationally expensive. In [4] the authors propose the following baseline which adds a residual term to address this. Let be the instantaneous reward at timestep , then the baseline at timestep can be written
(6) 
This baseline results in a learning signal that is the same across all timesteps, potentially increasing variance as all decisions in a trajectory are rewarded or punished together. We will call this the leaveoneout (LOO) baseline because the baseline for a given sample is constructed using an average of the return of the other samples. Note that VIMCO optimizes the multisample variational lower bound in equation (5) with the leaveoneout baseline, and NVIL optimizes the single sample variational lower bound in equation (4) with a baseline that can be learned or computed from averages [6].
As the return strongly depends on the number of emitted tokens at time , we can instead average the return of the other samples from when they have emitted the same number of tokens as sample . In particular, let be the first timestep when sample has emitted the same number of tokens as sample at timestep , then
(7) 
We call this new baseline the temporal leaveoneout baseline because it takes into account the temporal reward structure of our setting. This baseline can be combined with the parametric baseline, and is applicable to both variational inference and REINFORCE objectives in single and multisample settings. We explore the performance of these baselines empirically in the experiments section.
3 Related Work
In this section we first highlight the relationship between our model and other models for attention. Tang et. al. [7] proposed visual attention within the context of generative models, while Mnih et. al. [8] proposed using recurrent models of visual attention for discriminative tasks. Subsequently, visual attention was used in an image captioning model [9]
. These forms of attention use discrete variables for attention location. Recently, ‘softattention’ models were proposed for neural machine translation and speech recognition
[10, 11]. Unlike the earlier mentioned, hardattention models, these models pay attention to the entire input and compute features by blending spatial features with an attention vector that is normalized over the entire input. Our paper is most similar to the hard attention models in that features at discrete locations are used to compute predictions. However it is different from the above models in the training method: While the hard attention models use REINFORCE for training, we follow variational techniques. We are also different from the above models in the specific application – attention in our models is over temporal locations only, rather than visual and temporal locations. As a result, we additionally propose the temporal leaveoneout baseline.Because the attention model we use is hardattention, the model we use has parallels to prior work on online sequencetosequence models [12, 3]. The neural transducer model [12] can use either hard attention, or a combination of hard attention with local soft attention. However it explicitly splits the input sequences into chunks, and it is trained with an approximate maximum likelihood procedure that is similar to a policy search. The model of Luo et. al. [3] is most similar to our model. Both models use the same architecture; however, while they use REINFORCE for training, we explore VIMCO for training the attention model. We also propose the novel temporal LOO baseline. A similar model with REINFORCE has also been used for training an online translation model [13]
and for training Neural Turing Machines
[14]. Our work would be equally valid for these domains, which we leave for future work.There has also been work using reweighted wake sleep to train sequential models. In [15], Ba et. al. optimize a variational lower bound with the prior instead of using a variational posterior. In this work, we refer to this as REINFORCE to distinguish it from variational inference with an inference network. In [16] the authors revisit this topic, using reweighted wake sleep to train similar models. Their algorithm makes use of an inference network but does not optimize a variational lower bound. Instead they optimize separate objectives for the model and the inference network that produce a biased estimate of the gradient of the log marginal likelihood.
4 Experiments
For our experiments we used the standard TIMIT phoneme recognition task. The TIMIT dataset has 3696 training utterances, 400 validation utterances, and 182 test utterances. The audio waveforms were processed into frames of log mel filterbank spectrograms every 25ms with a stride of 10ms. Each frame had 40 mel frequency channels and one energy channel; deltas and accelerations of the features were append to each frame. As a result each frame was a 123 dimensional input. The targets for each utterance were the sequence of phonemes. We used the 61 phoneme labels provided with TIMIT for training and decoding. To compute the phone error rate (PER) we collapsed the 61 phonemes to 39 as is standard on this task
[17].To model we used a 2layer LSTM with 256 units in each layer. For the variational posterior we first processed the inputs with a 4layer bidirectional LSTM and then fed the final layer’s hidden state into a 2layer unidirectional LSTM along with the current target and the previous emission decision . Each layer had 256 units. Note that in this case the approximate posterior does not have access to at timestep — in practice we found giving access to far in the future did not improve performance.
We regularized the models with variational noise [18] and performed a grid search over the values
for the standard deviation of the noise. We also used L2 regularization and grid searched over the values
for the weight of the regularization.Method  PER 

REINFORCE with leaveoneout (LOO) baseline  20.5 
NVIL with LOO baseline  21.1 
VIMCO with LOO baseline  20.0 
REINFORCE with temporal LOO baseline  20.0 
NVIL with temporal LOO baseline  21.4 
VIMCO with temporal LOO baseline  20.0 
Online Alignment RNN (stacked LSTM) [3]  21.5 
Neural Transducer with unsupervised alignments [12]  20.8 
Online Alignment RNN (grid LSTM) [3]  20.5 
Monotonic Alignment Decoder [19]  20.4 
Neural Transducer with supervised alignments [12]  19.8 
Connectionist Temporal Classification [20]  19.6 
Method  Mixing Proportion  

0.50  0.25  0.1  
Connectionist Temporal Classification  43.8  33.3  27.3 
RNN Transducer  48.9  32.2  25.7 
REINFORCE with LOO baseline  42.9  32.5  25.9 
NVIL with LOO baseline  70.1  71.8  55.2 
VIMCO with LOO baseline  41.7  30.7  25.4 
REINFORCE with temporal LOO baseline  43.5  31.6  25.6 
NVIL with temporal LOO baseline  74.3  71.9  54.9 
VIMCO with temporal LOO baseline  41.7  30.75  25.2 
4.1 MultiTIMIT
We generated a multispeaker dataset by mixing male and female voices from TIMIT. Each utterance in the original TIMIT dataset was paired with an utterance from the opposite gender. The waveform of both utterances was first scaled to lie within the same range, and then the scale of the second utterance was reduced to a smaller volume before mixing the two utterances. We used three different scales for the second utterance: 50%, 25%, and 10%. The new raw utterances were processed in the same manner as the original TIMIT utterances, resulting in a 123 dimensional input per frame. The transcript of the speaker 1 was used as the ground truth transcript for this new utterance. MultiTIMIT has the same number of train, dev, and test utterances as the original TIMIT, as well as the same target phonemes.
We trained models with the same configuration described above on the different mixing scales, and also trained 2layer unidirectional LSTM models with Connectionist Temporal Classification for comparison. The results are shown in Table 2.
5 Results
Figure 4 shows a plot of the training curves for the different methods of training and the different datasets. The variational methods (VIMCO and NVIL) require many fewer training steps compared to REINFORCE on both datasets. All methods used the same batch size and number of samples, so training steps are comparable. NVIL performs well enough on a simple task like TIMIT, but struggles with MultiTIMIT. It can be seen that the gap between REINFORCE and VIMCO increases on MultiTIMIT (also see table 2).
The right panel of Figure 4 shows that REINFORCE attempts to wait to emit outputs until more information has come in, compared to VIMCO. This is presumably because it requires more information during learning. VIMCO, on the other hand, leverages the variational posterior which can access future and find the optimal place to emit.
In our experiments the difference between the performance of VIMCO and REINFORCE was larger for the more complicated task of MultiTIMIT than for the simpler task of TIMIT. This can be explained by considering the samples that the models learn from. In the simpler problem of single speaker TIMIT, MonteCarlo samples generated by REINFORCE have very high likelihood under – there are only a small number of samples that explain the entire probability mass, and these are sampled easily by a left to right ancestral pass (in time) of the model. These are very similar to the samples generated by the approximate posterior from VIMCO. As a result both methods perform approximately the same. In the case of MultiTIMIT, however, in the ancestral pass the probabilities for individual emissions are much lower. Thus the likelihood is less ’peaked’, and a large diversity of samples is chosen, leading to higher variance and poor learning. VIMCO, on the other hand does not face this problem because it samples from the approximate posterior, which is close to the true posterior and so very peaked around the ‘correct’ samples of experience.
6 Conclusion
In this paper we have showed how we can adapt VIMCO to perform hard attention for the case of temporal problems and introduce a new variancereducing baseline. Our method outperforms other methods of training online sequence to sequence models, and the improvements are greater for more difficult problems such as noisy mixed speech. In the future we will apply these techniques to other challenging domains, such as visual attention.
References
 [1] Ronald J Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 [2] Christopher John Cornish Hellaby Watkins, Learning from delayed rewards, Ph.D. thesis, King’s College, Cambridge, 1989.
 [3] Yuping Luo, ChungCheng Chiu, Navdeep Jaitly, and Ilya Sutskever, “Learning online alignments with continuous rewards policy gradient,” CoRR, vol. abs/1608.01281, 2016.
 [4] Andriy Mnih and Danilo Jimenez Rezende, “Variational inference for monte carlo objectives,” CoRR, vol. abs/1602.06725, 2016.
 [5] Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov, “Importance weighted autoencoders,” CoRR, vol. abs/1509.00519, 2015.
 [6] Andriy Mnih and Karol Gregor, “Neural variational inference and learning in belief networks,” CoRR, vol. abs/1402.0030, 2014.
 [7] Yichuan Tang, Nitish Srivastava, and Ruslan R Salakhutdinov, “Learning generative models with visual attention,” in Advances in Neural Information Processing Systems, 2014, pp. 1808–1816.
 [8] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
 [9] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio, “Show, attend and tell: Neural image caption generation with visual attention.,” in ICML, 2015, vol. 14, pp. 77–81.
 [10] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
 [11] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attentionbased models for speech recognition,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds., pp. 577–585. Curran Associates, Inc., 2015.
 [12] Navdeep Jaitly, David Sussillo, Quoc V. Le, Oriol Vinyals, Ilya Sutskever, and Samy Bengio, “An online sequencetosequence model using partial conditioning,” CoRR, vol. abs/1511.04868, 2015.
 [13] Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Victor O. K. Li, “Learning to translate in realtime with neural machine translation,” CoRR, vol. abs/1610.00388, 2016.
 [14] Wojciech Zaremba and Ilya Sutskever, “Reinforcement learning neural turing machinesrevised,” arXiv preprint arXiv:1505.00521, 2015.
 [15] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu, “Multiple object recognition with visual attention,” arXiv preprint arXiv:1412.7755, 2014.
 [16] Jimmy Ba, Ruslan R Salakhutdinov, Roger B Grosse, and Brendan J Frey, “Learning wakesleep recurrent attention models,” in Advances in Neural Information Processing Systems, 2015, pp. 2593–2601.

[17]
KF Lee and HW Hon,
“Speakerindependent phone recognition using hidden markov models,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989.  [18] Alex Graves, “Practical variational inference for neural networks,” in Advances in Neural Information Processing Systems, 2011, pp. 2348–2356.
 [19] Colin Raffel, Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck, “Online and lineartime attention by enforcing monotonic alignments,” arXiv preprint arXiv:1704.00784, 2017.
 [20] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.