In recent years, neural text generation using recurrent networks have witnessed rapid progress, quickly becoming the state-of-the-art paradigms in machine translation (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2014), summarization (Rush et al., 2015; Ranzato et al., 2016), and image captioning (Vinyals et al., 2015; Xu et al., 2015). In the decoder of neural generation, beam search is widely employed to boost the output text quality, often leading to substantial improvement over greedy search (equivalent to beam size 1) in metrics such as BLEU or ROUGE; for example, Ranzato et al. (2016) reported +2.2 BLEU (on single reference) in translation and +3.5 ROUGE-2 in summarization, both using a beam of 10. Our own experiments on machine translation (see Sec. 5) show +4.2 BLEU (on four references) using a beam of 5.
However, unlike traditional beam search in phrase-based MT or shift-reduce parsing where all hypotheses finish in the same number of steps, here in neural generation, hypotheses can finish in vastly different numbers of steps. Once you find a completed hypothesis (by generating the </s> symbol), there are still other active hypotheses in the beam that can continue to grow, which might lead to better scores. Therefore when can you end the beam search? How (and when) can you guarantee that the returned hypothesis has the optimal score modulo beam size?
There have not been satisfying answers to these questions, and existing beam search strategies are heuristic methods that do not guarantee optimality. For example, the widely influential RNNsearch(Bahdanau et al., 2014) employs a “shrinking beam” method: once a completed hypothesis is found, beam size shrinks by 1, and beam search would finish if beam size shrinks to 0 or if the number of steps hits a hard limit. The best scoring completed hypothesis among all completed ones encountered so far is returned. On the other hand, OpenNMT (Klein et al., 2017)
, whose PyTorch version will be the baseline in our experiments, uses a very different strategy: beam search terminates whenever the highest-ranking hypothesis in the current step is completed (which is also the one returned), without considering any other completed hypotheses. Neither of these two methods guarantee optimality of the returned hypothesis.
We therefore propose a novel and simple beam search variant that will always return the optimal-score complete hypothesis (modulo beam size), and finish as soon as the optimality is established. However, another well-known problem remains, that the generated sentences are often too short, compared to previous paradigms such as SMT (Shen et al., 2016). To alleviate this problem, previous efforts introduce length normalization (as a switch in RNNsearch) or length reward (He et al., 2016) borrowed from SMT (Koehn et al., 2007). Unfortunately these changes will invalidate the optimal property of our proposed algorithm. So we introduce a bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Experiments on neural machine translation demonstrate that our principled beam search algorithm leads to improvement in BLEU score over previously proposed alternatives.
2 Neural Generation and Beam Search
Here we briefly review neural text generation and then review existing beam search algorithms.
Assume the input sentence, document, or image is embedded into a vector, from which we generate the output sentence which is a completed hypothesis:111For simplicity reasons we do not discuss bidirectional LSTMs and attentional mechanisms here but our algorithms still work with those encoders (we have tested them).
where is a popular shorthand notation for the prefix . We say that a hypothesis is completed, notated , if its last word is </s>, i.e.,
in which case it will not be further expanded.
A crucial difference in RNN-based neural generation compared to previous paradigms such as phrase-based MT is that we no longer decompose into the translation model, , and the language model, , and more importantly, we no longer approximate the latter by -gram models. This ability to model arbitrarily-lengthed history using RNNs is an important reason for NMT’s substantially improved fluency compared to SMT.
To (approximately) search for the best output , we use beam search, where the beam at step is an ordered list of size (at most) , and expands to the next beam of the same size:
where the notation selects the top scoring items from the set , and each item is a pair where is the current prefix and
is its accumulated score (i.e., product of probabilities).
3 Optimal Beam Search (modulo beam size)
We propose a very simple method to optimally finish beam search, which guarantees the returned hypothesis is the highest-scoring completed hypothesis modulo beam size; in other words, we will finish as soon as an “optimality certificate” can be established that future hypotheses will never score better than the current best one.
Let be the best completed hypothesis so far up to step , i.e.,
We update it every time we find a completed hypothesis (if there is none yet, then it remains undefined). Now at any step , if is defined, and the highest scoring item in the current beam scores worse than or equal to , i.e., when
we claim the optimality certificate is established, and terminate beam search, returning (here smaller means worse, since we aim for the highest-probability completed hypothesis).
Theorem 1 (optimality).
When our beam search algorithm terminates, the current best completed hypothesis (i.e., ) is the highest-probability completed hypothesis (modulo beam size).
If then for all items in beam . Future descendants grown from these items will only be no better, since probability , so all items in current and future steps are no better than . ∎
Theorem 2 (early stopping).
Our beam search algorithm terminates no later than OpenNMT’s termination criteria (when is completed).
When is itself completed, , so our stopping criteria is also met. ∎
This above Theorem shows that our search is stopping earlier once the optimality certificate is established, exploring fewer items than OpenNMT’s default search. Also note that the latter, even though exploring more items than ours, still can return suboptimal solutions; e.g., when is worst than (they never stored ). In practice, we noticed our search finishes about 3–5 steps earlier than OpenNMT at a beam of 10, and this advantage widens as beam size increases, although the overall speedup is not too noticeable, given the target language sentence length is much longer. Also, our model scores (i.e., log-probabilities) are indeed better (see Fig. 1), where the advantage is also more pronounced with larger beams (note that OpenNMT baseline is almost flat after , while our optimal beam search still steadily improves). Combining these two Theorems, it is interesting to note that our method is not just optimal but also faster.
4 Optimal Beam Search for Bounded Length Reward
However, optimal-score hypothesis, though satisfying in theory, is not ideal in practice, since neural models are notoriously bad in producing very short sentences, as opposed to older paradigms such as SMT (Shen et al., 2016). To alleviate this problem, two methods have been proposed: (a) length normalization, used in RNNsearch as an option, where the revised score of a hypothesis is divided by its length, thus favoring longer sentences; and (b) explicit length reward (He et al., 2016) borrowed from SMT, rewarding each generated word by a constant tuned on the dev set.
Unfortunately, each of these methods breaks the optimality proof of our beam search algorithm in Section 3
, since a future hypothesis, being longer, might end up with a higher (revised) score. We therefore devise a novel mechanism called “bounded length reward”, that is, we reward each word until the length of the hypothesis is longer than the “estimated optimal length”. In machine translation and summarization, this optimal lengthcan be where is the source sentence length, and is the average ratio of reference translation length over source sentence length on the dev set (in our Chinese-to-English NMT experiments, it is 1.27 as the English side is a bit longer). Note that we use the same estimated from dev on test, assuming that the optimal length ratio for test (which we do not know) should be similar to those of dev ones. We denote to be the revised score of hypothesis with the bounded length reward, i.e.,
We also define to be the revised version of that optimizes the revised instead of the original score, i.e.,
Now with bounded length reward, we can modify our beam search algorithm a little bit and still guarantee optimality. First we include in the revised cost a reward for each generated word, as long as the length is less than , the estimated optimal length. If at step , the highest scoring item ’s revised score (i.e., including bounded length reward) plus the heuristic “future” extra length reward of a descendant, , is worse than (or equal to) the similarly revised version of , i.e.,
at which time we claim the revised optimality certificate is established, and terminate the beam search and return .
Actually with some trivial math we can simplify the stopping criteria to
This much simplified but still equivalent criteria can speed up decoding in practice, since this means we actually do not need to compute the revised score for every hypothesis in the beam; we only need to add the bounded length reward when one is finished (i.e., when updating ), and the simplified criteria only compares it with the original score of a hypothesis plus a constant reward .
Theorem 3 (modified optimality).
Our modified beam search returns the highest-scoring completed hypothesis where the score of an item is its log-probability plus a bounded length reward.
by admissibility of the heuristic. ∎
5 Experiments: Neural Translation
5.1 Data Preparation, Training, and Baselines
|(a) BLEU vs. beam size||(b) length ratio vs. beam size|
. We choose this library because PyTorch’s combination of Python with Torch’s dynamic computation graphs made it much easier to implement various search algorithms on it than on Theano-based implementations derived from RNNsearch(Bahdanau et al., 2014) (such as the widely used GroundHog333https://github.com/lisa-groundhog/ and Laulysta444https://github.com/laulysta/nmt/ codebases) as well as the original LuaTorch version of OpenNMT. We use 1M Chinese/English sentence pairs for training (see Table 1 for statistics); we also trained on 2M sentence pairs and only saw a minor improvement so below we report results from 1M training. To alleviate the vocabulary size issue we employ byte-pair encoding (BPE) (Sennrich et al., 2015) which reduces the source and target language vocabulary sizes to 18k and 10k, respectively; we found BPE to significantly improve BLEU scores (by at least +2 BLEU) and reduce training time. Following other papers on Chinese-English translation such as Shen et al. (2016), we use NIST 06 newswire portion (616 sentences) for development and NIST 08 newswire portion (691 sentences) for testing; we will report case-insensitive 4-reference BLEU-4 scores (using original segmentation).
Following OpenNMT-py’s default settings, we train our NMT model for 20 epochs to minimize perplexity on the training set (excluding 15% sentences longer than 50 source tokens), with a batch size of 64, word embedding size of 500, and dropout rate of 0.3. The total number of parameters is 29M. Training takes about an hour per epoch on Geforce 980 Ti GPU, and the model at epoch 15 reaches the lowest perplexity on the dev set (9.10) which is chosen as the model for testing.
On dev set, the default decoder of OpenNMT-py reaches 29.2 BLEU with beam size 1 (greedy) and 33.2 BLEU with the default beam size of 5. To put this in perspective, the most commonly used SMT toolkit Moses (Koehn et al., 2007) reaches 30.1 BLEU (with beam size 70) using the same 1M sentence training set (trigram language model trained on the target side). With 2.56M training sentence pairs, Shen et al. (2016) reported 32.7 BLEU on the same dev set using Moses and 30.7 BLEU using the baseline RNNsearch (GroundHog) with beam size 10 (without BPE, without length normalization or length reward). So our OpenNMT-py baseline is extremely competitve.
5.2 Beam Search & Bounded Length Reward
We compare the following beam search variants:
Notice that length reward has no effect on both methods 1 and 2(a) above. To tune the optimal length reward we run our modified optimal-ending beam search algorithm with all combinations of with beam sizes on the dev set, since different beam sizes might prefer different length rewards. We found to be the best among all length rewards (see Table 2) which is used in Figure 2 and is the best for .
We can observe from Figure 2 that (a) our optimal beam search with bounded length reward performs the best, and at =15 it is +5 BLEU better than =1; (b) pure optimal beam search degrades after =4 due to extremely short translations; (c) both the shrinking beam method with length normalization and OpenNMT-py’s default search alleviate the shortening problem, but still produce very short translations (length ratio 0.9). (d) the shrinking beam method with length reward works well, but still 0.3 BLEU below our best method. These are confirmed by the test set (Tab. 3).
|shrinking, len. norm.||17||33.71||30.11|
|shrinking, reward =1.3||15||34.42||30.37|
|optimal beam search, =1.2||15||34.70||30.61|
We have presented a beam search algorithm for neural sentence generation that always returns optimal-score completed hypotheses. To counter neural generation’s natural tendancy for shorter hypotheses, we introduced a bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Experiments on top of strong baselines have confirmed that our principled search algorithms (together with our bounded length reward mechanism) outperform existing beam search methods in terms of BLEU scores. We will release our implementations (which will hopefully be merged into OpenNMT-py) when this paper is published. 555While implementing our search algorithms we also found and fixed an obscure but serious bug in OpenNMT-py’s baseline beam search code (not related to discussions in this paper), which boosts BLEU scores by about +0.7 in all cases. We will release this fix as well.
We thank the anonymous reviewers from both EMNLP and WMT for helpful comments. This work is supported in part by NSF IIS-1656051, DARPA N66001-17-2-4030 (XAI), a Google Faculty Research Award, and HP.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .
He et al. (2016)
Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. 2016.
Improved neural machine translation with smt features.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, pages 151–157.
- Kalchbrenner and Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In EMNLP. volume 3, page 413.
- Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints .
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, pages 177–180.
Ranzato et al. (2016)
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016.
Sequence level training with recurrent neural networks.ICLR .
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 .
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 .
- Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of ACL.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. pages 3104–3112.
- Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In . pages 3156–3164.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 .
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. volume 14, pages 77–81.