1 Introduction
Since its first use as a language model in 2010 [19]
, a recurrent neural network has become a
de facto choice for implementing a language model [28, 25]. One of the appealing properties of this approach to language modelling, to which we refer as recurrent language modelling, is that a recurrent language model can generate a long, coherent sentence [26]. This is due to the ability of a recurrent neural network to capture longterm dependencies.This property has come under spotlight in recent years as the conditional version of a recurrent language model began to be used in many different problems that require generating a natural language description of a highdimensional, complex input. These tasks include machine translation, speech recognition, image/video description generation and many more [9] and references therein.
Much of the recent advances in conditional recurrent language model have focused either on network architectures (e.g., [1]), learning algorithms (e.g., [4, 22, 2]) or novel applications (see [9] and references therein). On the other hand, we notice that there has not been much research on decoding algorithms for conditional recurrent language models. In the most of work using recurrent language models, it is a common practice to use either greedy or beam search to find the most likely natural language description given an input.
In this paper, we investigate whether it is possible to decode better from a conditional recurrent language model. More specifically, we propose a decoding strategy motivated by earlier observations that nonlinear hidden layers of a deep neural network stretch the data manifold such that a neighbourhood in the hidden state space corresponds to a set of semantically similar configurations in the input space [6]. This observation is exploited in the proposed strategy by injecting noise in the hidden transition function of a recurrent language model.
The proposed strategy, called noisy parallel approximate decoding (NPAD), is a metaalgorithm that runs in parallel many chains of the noisy version of an inner decoding algorithm, such as greedy or beam search. Once those parallel chains generate the candidates, the NPAD selects the one with the highest score. As there is effectively no communication overhead during decoding, the wallclock performance of the proposed NPAD is comparable to a single run of an inner decoding algorithm in a distributed setting, while it improves the performance of the inner decoding algorithm. We empirically evaluate the proposed NPAD against the greedy search, beam search as well as stochastic sampling and diverse decoding [16] in attentionbased neural machine translation.
2 Conditional Recurrent Language Model
A language model aims at modelling a probabilistic distribution over natural language text. A recurrent language model is a language model implemented as a recurrent neural network [18].
Let us define a probability of a given natural language sentence,
^{1}^{1}1 Although I use a “sentence” here, this is absolutely not necessary, and any level of text, such as a phrase, paragraph, chapter and document, can be used as a unit of language modelling. Furthermore, it does not have to be a natural language text but any sequence such as speech, video or actions. which we represent as a sequence of linguistic symbols , as(1) 
where is all the symbols preceding the th symbol in the sentence . Note that this conditional dependency structure is not necessary but is preferred over other possible structures due to its naturalness as well as the fact that the length of a given sentence is often unknown in advance.
In a neural language model [5], a neural network is used to compute each of the conditional probability terms in Eq. (1). A difficulty in doing so is that the input to the neural network is of variable size. A recurrent neural network cleverly addresses this difficulty by reading one symbol at a time while maintaining an internal memory state:
(2) 
where is the internal memory state at time .
is a vector representation of the
th symbol in the input sentence. The internal memory state effectively summarizes all the symbols read up to the th time step.The recurrent activation function
in Eq. (2) can be as simple as an affine transformation followed by a pointwise nonlinearity (e.g.,) to as complicated a function as long shortterm memory (LSTM,
[13]) or gated recurrent units (GRU,
[10]). The latter two are often preferred, as they effectively avoid the issue of vanishing gradient [7].Given the internal hidden state, the recurrent neural network computes the conditional distribution over the next symbol . Assuming a fixed vocabulary of linguistic symbols, it is straightforward to make a parametric function that returns a probability of each symbol in the vocabulary:
(3) 
where is the th component of the output of the function . The formulation on the righthand side of Eq. (3) is called a softmax function [8].
Given Eqs. (2)–(3), the recurrent neural network reads one symbol of a given sentence at a time from left to right and computes the conditional probability of each symbol until the end of the sequence is reached. The probability of the sentence is then given by a product of all those conditional probabilities. We call this recurrent neural network a recurrent language model.
Conditional Recurrent Language Model
A recurrent language model is turned into a conditional recurrent language model
, when the distribution over sentences is conditioned on another modality including another language. In other words, a conditional recurrent language model estimates
(4) 
in Eq. (4) can be anything from a sentence in another language (machine translation), an image (image caption generation), a video clip (video description generation) to speech (speech recognition). In any of those cases, a previously described recurrent language model requires only a slightest tweak in order to take into account .
Learning
Given a data set of pairs , the conditional recurrent language model is trained to maximize the loglikelihood function which is defined as
This maximization is often done by stochastic gradient descent with the gradient computed by backpropagation
[23]. Instead of a scalar learning rate, adaptive learning rate methods, such as Adadelta [27] and Adam [14], are often used.3 Decoding
Decoding in a conditional recurrent language model corresponds to finding a target sequence that maximizes the conditional probability from Eq. (4):
As is clear from the formulation in Eqs. (5)–(6), exact decoding is intractable, as the state space of grows exponentially with respect to the length of the sequence, i.e., , without any trivial structure that can be exploited. Thus, we must resort to approximate decoding.
3.1 Greedy Decoding
Greedy decoding is perhaps the most naive way to approximately decode from the conditional recurrent language model. At each time step, it greedily selects the most likely symbol under the conditional probability:
(7) 
This continues until a special marker indicating the end of the sequence is selected.
This greedy approach is computationally efficient, but is likely too crude. Any early choice based on a high conditional probability can easily turn out to be unlikely one due to low conditional probabilities later on. This issue is closely related to the garden path sentence problem (see Sec. 3.2.4 of [17].)
3.2 Beam Search
Beam search improves upon the greedy decoding strategy by maintaining hypotheses at each time step, instead of a single one. Let
be a set of current hypotheses at time . Then, from each current hypothesis the following candidate hypotheses are generated:
where denotes the th symbols in the vocabulary .
The top hypotheses from the union of all such hypotheses sets are selected based on their scores. In other words,
where
Among the top hypotheses, we consider the ones whose last symbols are the special marker for the end of sequence to be complete and stop expanding such hypotheses. All the other hypotheses continue to be expanded, however, with reduced by the number of complete hypotheses. When reaches , the beam search ends, and the best one among all the complete hypotheses is returned.
4 NPAD: Noisy Parallel Approximate Decoding
In this section, we introduce a strategy that can be used in conjunction with the two decoding strategies discussed earlier. This new strategy is motivated by the fact that a deep neural network, including a recurrent neural network, learns to stretch the input manifold (on which only likely input examples lie) and fill the hidden state space with it. This implies that a neighbourhood in the hidden state space corresponds to a set of semantically similar configurations in the input space, regardless of whether those configurations are close to each other in the input space [6]. In other words, small perturbation in the hidden space corresponds to jumping from one plausible configuration to another.
In the case of conditional recurrent language model, we can achieve this behaviour of efficiently exploration across multiple modes by injecting noise to the transition function of the recurrent neural network. In other words, we replace Eq. (5) with
(8) 
where
The timedependent standard deviation
should be selected to reflect the uncertainty dynamics in the conditional recurrent language model. As the recurrent network models a target sequence in one direction, uncertainty is often greatest when predicting earlier symbols and gradually decreases as more and more context becomes available for the conditional distribution . This naturally suggests a strategy where we start with a high level of noise (high ) and anneal it () as the decoding progresses. One such scheduling scheme iswhere is an initial noise level. Although there are many alternatives, we find this simple formulation to be effective in experiments later.
We run such noisy decoding processes in parallel. This can be done easily and efficiently, as there is no communication between these parallel processes except at the end of the decoding processing. Let us denote by a sequence decoded from the th decoding process. Among these hypotheses, we select the one with the highest probability assigned by the nonnoisy model:
We call this decoding strategy, based on running multiple parallel approximate decoding processes with noise injected, noisy parallel approximate decoding (NPAD).
Computational Complexity
Clearly, the proposed decoding strategy is times more expensive, i.e., , where is the computational complexity of either greedy or beam search (see Sec. 3.) It is however important to note that the proposed NPAD is embarrassingly parallelizable, which is well suited for distributed and parallel environments of modern computing. By utilizing multicore machines, the practical cost of computation reduces to simply running the greedy or beam search once (with a constant multiplicative factor of due to computing the nonnoisy score and generating pseudo random numbers.) This is contrary to, for instance, when comparing the beam search to the greedy search, in which case the benefit from parallelization is limited due to the heavy communication cost at each step.
Quality Guarantee
A major issue with the proposed strategy is that the resulting sequence may be worse than running a single innerdecoder, due to the stochasticity. This is however easily avoided by setting to for one of the decoding processes. By doing so, even if all the other noisy decoding processes resulted in sequences whose probabilities are worse than the nonnoisy process, the proposed strategy nevertheless returns a sequence that is as good as a single run of the inner decoding algorithm.
4.1 Why not Sampling?
The formulation of the conditional recurrent language model in Eq. (4) implies that we can generate exact samples from the model, as this is a directed acyclic graphical model. At each time step , a sample from the categorical distribution given all the samples of the previous time steps (Eq. (6)) is generated. This procedure is done iteratively either up to time steps or another type of stopping criterion is met (e.g., the endofsequence symbol is sampled.) Similarly to the proposed NPAD, we can run a set of this sampling procedures in parallel.
A major difference between this samplingattheoutput and the proposed NPAD is that the NPAD exploits the hidden state space of a neural network in which the data manifold is highly linearized. In other words, training a neural network tends to fill up the hidden state space as much as possible with valid data points,^{2}^{2}2
This behaviour can be further encouraged by regularizing the (approximate) posterior over the hidden state, for instance, as in variational autoencoders (see, e.g.,
[15, 11].) and consequently any point in the neighbourhood of a valid hidden state ( Eq. (5)) should map to a plausible point in the output space. This is contrary to the actual output space, where only a fraction of the output space is plausible.Later, we show empirically that it is indeed more efficient to sample in the hidden state space than in the output state space.
4.2 Related Work
PerturbandMAP
PerturbandMAP [21] is an algorithm that reduces probabilistic inference, such as sampling, to energy minimization in a Markov random field (MRF) [20]. For instance, instead of Gibbs sampling, one can use the perturbandMAP algorithm to find multiple instances of configurations that minimize the perturbed energy function. Each instance of the perturbandMAP works by first injecting noise to the energy function of the MRF, i.e., , followed by maximumaposterior (MAP) step, i.e., .
A connection between this perturbandMAP and the proposed NPAD is clear. Let us define the energy function of the conditional recurrent language model as its logprobability, i.e., (see Eq. (4).) Then, the noise injection to the hidden state in Eq. (8) is a process similar to injecting noise to the energy function. This connection arises from the fact that the NPAD and perturbandMAP share the same goal of “[giving] other low energy states the chance” [20].
Diverse Decoding
One can view the proposed NPAD as a way to generate a diverse set of likely solutions from a conditional recurrent language model. In [16], a variant of beam search was proposed, which modifies the scoring function at each time step of beam search to promote diverse decoding. This is done by penalizing low ranked hypotheses that share a previous hypothesis. This approach is however only applicable to beam search and is not as parallelizable as the proposed NPAD. It should be noted that the NPAD and the diverse decoding can be used together.
Earlier, Batra et al. [3] proposed another approach that enables decoding multiple, diverse solutions from an MRF. This method decodes one solution at a time, while regularizing the energy function of an MRF with the diversity measure between the solution currently being decoded and all the previous solutions. Unlike the perturbandMAP or the NPAD, this is a deterministic algorithm. A major downside to this approach is that it is inherently sequential. This makes it impractical especially for neural machine translation, as already the major issue behind its deployment is the computational bottleneck in decoding.
5 Experiments: Attentionbased Neural Machine Translation
5.1 Settings
In this paper, we evaluate the proposed noisy parallel approximate decoding (NPAD) strategy in attentionbased neural machine translation. More specifically, we train an attentionbased encoderdecoder network on the task of EnglishtoCzech translation and evaluate different decoding strategies.
The encoder is a singlelayer bidirectional recurrent neural network with 1028 gated recurrent units (GRU,
[10]).^{3}^{3}3 The number 1028 resulted from a typo, when originally we intended to use 1024. The decoder consists of an attention mechanism [1] and a recurrent neural network again with 1028 GRU’s. Both source and target words were projected to a 512dimensional continuous space. We used the code from dl4mttutorial available online^{4}^{4}4 https://github.com/nyudl/dl4mttutorial/tree/master/session2 for training. Both source and target sentences were represented as sequences of BPE subword symbols [24].We trained this model on a large parallel corpus of approximately 12m sentence pairs, available from WMT’15,^{5}^{5}5 http://www.statmt.org/wmt15/translationtask.html for 2.5 weeks. During training, ADADELTA [27] was used to adaptively adjust the learning rate of each parameter, and the norm of the gradient was renormalized to , if it exceed . The training run was earlystopped based on the validation perplexity using newstest2013 from WMT’15. The model is tested with two heldout sets, newstest2014 and newstest2015.^{6}^{6}6 Due to the space constraint, we only report the result on newstest2014. We however observed the same trend from newstest2014 on newstest2015.
We closely followed the training and test strategies from [12], and more details can be found in it.
Evaluation Metric
The main evaluation metric is the negative conditional logprobability of a decoded sentence, where lower is better. Additionally, we use BLEU as a secondary evaluation metric. BLEU is a defacto standard metric for automatically measuring the translation quality of machine translation systems, in which case higher is better.
5.2 Decoding Strategies
We evaluate four decoding strategies. We choose the strategies that have comparable computational complexity per core/machine, assuming multiple cores/machines are available. This selection left us with greedy search, beam search, stochastic sampling, diverse decoding and the proposed NPAD.
Greedy and Beam Search
Both greedy and beam search are the most widely used decoding strategies in neural machine translation, as well as other conditional recurrent language models for other tasks. In the case of beam search, we test with two beamwidths, 5 and 10. We use the script made available at dl4mttutorial.
Stochastic Sampling
A naive baseline for injecting noise during decoding is to simply sample from the output distribution at each time step, instead of taking the top entries. We test three configurations, where 5, 10 or 50 such samplers are run in parallel.
Noisy Parallel Approximate Decoding (NPAD)
We extensively evaluate the NPAD by varying the number of parallel decoding (5, 10 or 50), the beamwidth (1, 5 or 10) and the initial noise level (, , or ).
Diverse Decoding
We try the diverse decoding strategy from [16]
. There is one hyperparameter
, and we search over , as suggested by the authors of [16] based on the validation set performance.^{7}^{7}7 Personal communication. Also, we vary the beam width (5 or 10). This is included as a deterministic counterpart to the NPAD.5.3 Results and Analysis
Effect of Noise Injection
First, we analyze the effect of noise injection by comparing the stochastic sampling and the proposed NPAD against the deterministic greedy decoding. In doing so, we used 50 parallel decoding processes for both stochastic sampling and NPAD. We varied the amount of initial noise as well.
In Table 1, we present both the average negative logprobability and BLEU for all the cases. As expected, the proposed NPAD improves upon the deterministic greedy decoding as well as the stochastic sampling strategy. It is important to notice that the improvement by the NPAD is significantly larger than that by the stochastic sampling, which confirms that it is more efficient and effective to inject noise in the hidden state of the neural network.
Effect of the Number of Parallel Chains
Next, we see the effect of having more parallel decoding processes of the proposed NPAD. As we show in Table 2, the translation quality, in both the average negative logprobability and BLEU, improves as more parallel decoding processes are used, while it does significantly better than greedy strategy even with five chains. We observed this trend for all the other noise levels. This is an important observation, as it implies that the performance of decoding can easily be improved without sacrificing the delay between receiving the input and returning the result by simply adding in more cores/machines.
NPAD with Beam Search
As described earlier, NPAD can be used with any other deterministic decoding strategy. Hence, we test it together with the beam search strategy. As in Table 3, we observe again that the proposed NPAD improves the deterministic strategy. However, as the beam search is already able to find a good solution, the improvement is much smaller than that against the greedy strategy.
In Table 3, we observe that difference between the greedy and beam search strategies is much smaller when the NPAD is used as an outer loop. For instance, comparing the greedy decoding and beam search with with 10, the differences without and with NPAD are 7.9617 vs. 0.7789 (NLL) and 1.66 vs. 0.43 (BLEU). This again confirms that the proposed NPAD has a potential to make the neural machine translation more suitable for deploying in the real world.
NPAD vs Diverse Decoding
In Table 4, we present the result using the diverse decoding. The diverse decoding was proposed in [16] as a way to improve the translation quality, and accordingly, we present the best approaches based on the validation BLEU. Unlike what was reported in [16], we were not able to see any substantial improvement by the diverse decoding. This may be due to the fact that Li & Jurafsky [16] used additional translation/language models to rerank the hypotheses collected by diverse decoding. As those additional models are trained and selected for a specific application of machine translation, we find the proposed NPAD to be more generally applicable than the diverse decoding is. It is however worthwhile to note that the diverse decoding may also benefit from having the NPAD as an outer loop.
6 Conclusion and Future Work
In this paper, we have proposed a novel decoding strategy for conditional recurrent language models. The proposed strategy, called noisy, parallel approximate decoding (NPAD), exploits the hidden state space of a recurrent language model by injecting unstructured Gaussian noise at each transition. Multiple chains of this noisy decoding process are run in parallel without any communication overhead, which makes the NPAD appealing in practice.
We empirically evaluated the proposed NPAD against the widely used greedy and beam search as well as stochastic sampling and diverse decoding strategies. The empirical evaluation has confirmed that the NPAD indeed improves decoding, and this improvement is especially apparent when the inner decoding strategy, which can be any of the existing strategies, is more approximate. Using NPAD as an outer loop significantly closed the gap between fast, but more approximate greedy search and slow, but more accurate beam search, increasing the potential for deploying conditional recurrent language models, such as neural machine translation, in practice.
Future Work
We consider this work as a first step toward developing a better decoding strategy for recurrent language models. The success of this simple NPAD suggests a number of future research directions. First, thorough investigation into injecting noise during training should be done, not only in terms of learning and optimization (see, e.g., [4]
), but also in the context of its influence on decoding. It is conceivable that there exists a noise injection mechanism during training that may fit better with the noise injection process during decoding (as in the NPAD.) Second, we must study the relationship between different types and scheduling of noise in the NPAD in addition to white Gaussian noise with annealed variance investigated in this paper. Lastly, the NPAD should be validated on the tasks other than neural machine translation, such as image/video caption generation and speech recognition (see, e.g.,
[9] and references therein.)Acknowledgments
KC thanks the support by Facebook, Google (Google Faculty Award 2016) and NVidia (GPU Center of Excellence 20152016).
References
 [1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR 2015, 2015.
 [2] D. Bahdanau, D. Serdyuk, P. Brakel, N. R. Ke, J. Chorowski, A. Courville, and Y. Bengio. Task loss estimation for sequence prediction. arXiv preprint arXiv:1511.06456, 2015.
 [3] D. Batra, P. Yadollahpour, A. GuzmanRivera, and G. Shakhnarovich. Diverse mbest solutions in Markov random fields. In Computer Vision–ECCV 2012, pages 1–16. Springer, 2012.
 [4] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, pages 1171–1179, 2015.

[5]
Y. Bengio, R. Ducharme, and P. Vincent.
A neural probabilistic language model.
Journal of Machine Learning Research
, 3:1137–1155, 2003.  [6] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In Proceedings of The 30th International Conference on Machine Learning, pages 552–560, 2013.
 [7] Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.

[8]
J. S. Bridle.
Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition.
In Neurocomputing, pages 227–236. Springer, 1990.  [9] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attentionbased encoderdecoder networks. Multimedia, IEEE Transactions on, 17(11):1875–1886, 2015.
 [10] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv:1406.1078, 2014.
 [11] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems (NIPS), 2015.
 [12] O. Firat, K. Cho, and Y. Bengio. Multiway, multilingual neural machine translation with a shared attention mechanism. In NAACL, 2016.
 [13] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. The International Conference on Learning Representations (ICLR), 2015.
 [15] D. P. Kingma and M. Welling. Autoencoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), number 2014, 2013.
 [16] J. Li and D. Jurafsky. Mutual information and diverse decoding improve neural machine translation. arXiv preprint arXiv:1601.00372, 2016.

[17]
C. D. Manning and H. Schütze.
Foundations of statistical natural language processing
, volume 999. MIT Press, 1999.  [18] T. Mikolov. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
 [19] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent neural network based language model. INTERSPEECH, 2:3, 2010.
 [20] G. Papandreou and A. Yuille. PerturbandMAP random fields: Reducing random sampling to optimization, with applications in computer vision. Advanced Structured Prediction, page 159, 2014.
 [21] G. Papandreou and A. L. Yuille. PerturbandMAP random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 193–200. IEEE, 2011.
 [22] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
 [23] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by backpropagating errors. Nature, pages 323–533, 1986.
 [24] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
 [25] M. Sundermeyer, H. Ney, and R. Schlüter. From feedforward to recurrent LSTM neural networks for language modeling. TASLP, 23(3):517–529, 2015.
 [26] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1017–1024, 2011.
 [27] M. D. Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.

[28]
B. Zoph, A. Vaswani, J. May, and K. Knight.
Simple, fast noisecontrastive estimation for large RNN vocabularies.
In NAACL, 2016.
Comments
There are no comments yet.