The most simple decoding strategy is to always choose the word with the highest probability at each time step. This greedy approach doesn’t give us the most likely sequence and is prone to have grammatical errors in the output sequence. Beam search (BS), on the other hand, maintains top-scoring successors at each time step and then scores all expanded sequences to choose one to output. BS decoding strategy has shown good results in many sequence generation tasks and has been the most popular decoding strategy so far. Although the beam search considers the whole sequence for scoring, it only uses the current node to decide the nodes to expand and doesn’t consider the possible future to expand a node. This incompleteness in the search leads to sub-optimal results.
When speaking or writing, we do not just consider the last word we generate to choose the next word; we also consider what we want to say or write in the future. Regarding the future output is crucial to improve sequence generation. For example, Monte-Carlo Tree Search (MCTS) focuses on the analysis of promising moves and has achieved great success in game-play, e.g., Go . An MCTS-based strategy predicts the next action by carrying out several rollouts from the present time step and calculate the reward for each rollout using a trained value network. It then makes the decision at each time step by choosing the action which leads to the highest average future rewards. However, MCTS requires to train another value network and takes more runtime to run the simulation. It is not practical to run MCTS to decode sequences with large vocabulary size and many time steps. Instead of applying MCTS in sequence decoding, we propose a new -step look-ahead (-LA) module that doesn’t need an external value network and has a practical run-time. Namely, our proposed look-ahead module can be plugged into the decoding phase of any existing sequence model to improve the inference results.
Figure 1 illustrates how -LA works in a example to choose a word at time step . At time step , we expand every word until the search tree reaches time step . For each word in the tree rooted at time , we can compute the likelihood of extending that word from its parent using the pretrained sequence model. To select a word at , we choose the word whose sub-tree has the highest accumulated probability, i.e., the highest expected likelihood in the future.
We test the proposed -step look-ahead module on three datasets of increasing difficulties: IM2LATEX OCR, WMT16 multimodal English-German translation, and WMT14 English-German machine translation. Our results show that the look-ahead module can improve the decoding in IM2LATEX and WMT16 but only marginal over the greedy search in WMT14. Our analysis suggests that the more difficult datasets (usually containing longer sequences, e.g., WMT14 and a subset of WMT16 where sequence length ) suffer from the overestimated end-of-sentence (EOS) probability. The overestimated EOS probability encourages the sequence decoder to favor short sequences. Even with the look-ahead module, the decoder still cannot recover from that bias. To fix the EOS problem, we use an auxiliary EOS loss in training to make a more accurate EOS estimation. We show that the model trained with the auxiliary EOS loss not only improves the performance of the look-ahead module but also makes the beam search more robust.
This work makes a number of contributions. We show how we can incorporate future information to improve the decoders using pretrained sequence models only. Our analysis with the proposed decoder also help us pinpoint the issues of the pretrained sequence model and further fix the sequence model. We expect that looking into both decoders and models together can provide a better picture of sequence generation results and help design a more robust sequence model and training framework.
2 Related Work
Learning to search with look-ahead cues
: Reinforcement learning (RL) techniques, especially the value network, are often used to incorporate hypothetical future information into predictions. train their policy and value networks by RL but allow the value network to also take the correct output as its input so that the policy can optimize for BLEU scores directly.  train image captioning policy and value networks using actor-critic methods. The authors found that the global guidance introduced by the value network greatly improves performance over the approach that only uses a policy network.  apply self-play and MCTS to train policy and value networks for Go. It show that MCTS is a powerful policy evaluation method.
Augmenting information in training sequence model:  focus on using auxiliary reward to improve the maximum likelihood training decoder. They define the auxiliary reward as the negative edit distance between the predicted sentences and the ground truth labels.  optimize the seq2seq models based on edit distance instead of maximizing the likelihood. They show the improvements on the speech recognition dataset.  focus on improving the decoder by alleviating the mismatch between training and testing. They introduced a search-based loss that directly optimizes the network for beam search decoding.
Sequence modeling errors: 
analyze the machine translation decoder by enumerating all the possible predicted sequences. They predict the decoded sequence by choosing the sequence with the highest likelihood. Their results demonstrate that the neural machine translation model usually assigned its best score to the empty sentences for over 50% of inference sentences. In, they argue that the seq2seq models suffer from the overestimated word probability in the training stage. They propose to solve the issue using the label smoothing technique.
In the following discussions of the paper, we evaluated the proposed approaches on three different datasets: IM2LATEX-100k OCR dataset , WMT16 Multimodal English-German (EN-DE) machine translation dataset , and WMT14 English-German (EN-DE) machine translation dataset. In IM2LATEX-100K, the input is given an image and the goal is to generate the corresponded LaTeX equation. The dataset is separated into the training set (83,883 equations), the validation set (9,319 equations) and the test set (10,354 equations). The average length of the target LaTeX equations is 64.86 characters per equation. The WMT16 multimodal dataset consists of 29,000 EN-DE training pairs, 1,014 validation pairs and 1,000 testing pairs. Each EN-DE training pair are the descriptions of an image. The average length of the testing target sentences is 12.39 words per sentence. In this paper, we didn’t use the support from the image information. The WMT14 EN-DE machine translation dataset consists of 4,542,486 training pairs, 1,014 validation pars. We train on WMT14 data but evaluate the model on the newstest2017 dataset which consists of 3,004 testing pairs. The average length of target sequences in newstest2017 is 28.23 words per sentence, which is a much longer compared to the average length of the WMT16 translation dataset. The longer target sequences in WMT14 makes WMT14 a more difficult task than the WMT16 translation task.
4 Look-ahead Prediction
We present a look-ahead prediction module to take advantage of the future cues. This proposed look-ahead moduel is based on depth-first search (DFS) instead of using the Monte-Carlo Tree-based (MCTS) method. In the DFS-based look-ahead module, we are able to prune the negligible paths and nodes whose probability is too small to be the word lead to the largest probability. In contrast, MCTS-based method requires plenty of samples to estimate the nodes’ expected probability. To compare the real execution of these two look-ahead methods, we test both methods on the transformer model trained on the WMT14 dataset. We run the experiment on Tesla V100 GPU with 500 input sentences. We set the look-ahead time step equals to 3 for both search strategies. In the MCTS setting, we operate 20 rollouts in each time step and the average execution time is 32.47 seconds per sentence. As for DFS-based method, the average execution time is 0.60 seconds per sentence. To make the look-ahead module more practical, we choose the DFS-based look-ahead module as our node expansion strategy.
Figure 1 illustrate our proposed DFS look-ahead module. Algorithm 1 is the pseudo-code of the proposed method. Given a pretrained sequence model and a size vocabulary, we are able to expand a tree in the current time step to the in the -step look-ahead setting. The height of the tree is k. For example, in the -step look-ahead setting, there are nodes at height 1 and leaf nodes at height 2. At , we select the word which has the maximum summation of the log-likelihood along the path from height 1 to the leaf nodes. We repeat the previous operation at each time step until we predict the EOS token. Although the time complexity of the DFS is , we are able to prune a lot of insignificant paths in our tree. At line 9 in Algorithm 1, we early stop DFS when then current cumulative log-probability is smaller than the maximum summation of log-probability we have encountered so far. Since we sort log probabilities before we perform the DFS, we are able to prune many paths which can’t be the optimal path in the expanded tree. By using the foresight word information in the prediction, we can select the word guiding to the largest probability in advance.
We train and test the sequence models using OpenNMT . For the IM2LATEX-100K image to LaTeX OCR dataset, our CNN feature extractor is based on  and we pass the visual features in each time step to a 512 hidden units bi-LSTM model. For the WMT16 EN-DE translation dataset, we trained an LSTM model with 500 hidden units. As for the WMT14 EN-DE translation dataset, we trained a transformer model with 8 heads, 6 layers, 512 hidden units and 2048 units in the feed-forward network. We report the BLEU scores of the greedy search and the look-ahead (LA) search with different -steps in all three datasets. In our look-ahead module definition, the 1-LA setting is equivalent to the greedy search since we only use the current time step information. The look-ahead module is more directly comparable to the greedy search method than the beam search method because the beam size of either the greedy search or the look-ahead module is 1. For a better reference of the range of the performance, we also report the scores of the beam search. Note that we may combine the beam search method and the look-ahead method at the same time. For simplicity, we test our look-ahead module with the beam width = 1 setting.
We test the look-ahead module with five different settings, which are 1-LA (Greedy) to 5-LA and we evaluate the models with Sacre BLEU scores  which is a commonly used machine translation metric. We demonstrate the results of three different models in Table 1, 2, and 3. Our results show that the look-ahead module can improve the models on the IM2LATEX-100K dataset and the WMT16 dataset. We show the examples of using the look-ahead module on the model trained on the WMT16 dataset in Figure 2. However, the improvement becomes marginal on the WMT14 dataset and even harms the performances in the 5-LA setting. We argue that the look-ahead module might be less effective on the more difficult datasets, i.e., the longer target sequences. We show that in Table 2, both the look-ahead module and the beam search harm the model on the target sequences longer than 25 words. We didn’t discuss IM2LATEX task because the accuracy of IM2LATEX task is highly dependent on the recognition accuracy of the CNN models and this makes the model a different scheme compared to the rest of the two textual translation models. We argue that the ineffectiveness of the look-ahead module on WMT14 is caused by the overestimated end-of-sentence (EOS) probability. The overestimated EOS probability will lead to shorter sentences and make the wrong prediction at the same time.
To support our argument, we show the average length differences between the predicted sequences and the ground truth sequences. For each sentence, the difference is calculate by (Prediction Length - Ground Truth Length). Therefore, a positive number indicates that the model tends to predict longer sentences than the ground truth sentences and vice versa. We test the WMT16 LSTM model and the WMT14 transformer model with different search strategies. The two trends shown in the two figures are the same. Both models tend to predict shorter sequences with the increasing of the look-ahead steps. However, the WMT16 LSTM model tend to predict “overlong” sentences while the WMT14 transformer model usually predicts “overshort” sentences in the greedy search setting. These two properties make the look-ahead module substantially improve the WMT16 model but marginally improve the WMT14 model. The results substantiate our argument of the overestimated EOS problem in the more difficult dataset.
In , they enumerate all the possible sequences and find that the model assigns the highest probability to the empty sequence for over 50% of testing sentences. Their result is consistent with our analysis. Both demonstrate the EOS problem in different schemes. However, their experiment settings are not practical because enumerating all the possible sequences in the exponential-growth search space is time-consuming.
|Beam Search (B=10)||86.28|
|Search Strategy||BLEU||BLEU (Target len)|
|Beam Search (B=10)||33.83||22.45|
|Beam Search (B=10)||28.21|
5 Auxiliary EOS Loss
To tackle the EOS problem, we introduce an auxiliary EOS loss to effectively solve the problem. We test the model trained with our proposed auxiliary EOS loss in our proposed DFS based look-ahead setting which is more practical in the real world.
We ensure that the model doesn’t ignore the EOS probability of the negative EOS ground truth token in each time step, i.e., the ground truth word which is not the EOS token. By given a batch of training data, the original sequence modeling loss can be represented as
where N is the batch size and c is the correct class of the data in the batch. We could see the original loss only focuses on the loss of the correct classes. In order to incorporate the EOS token loss into our train, we treat the auxiliary EOS task as a binary classification problem. Our auxiliary EOS loss can be written as
where gamma is a scalar indicating the portion of the EOS loss.
We integrate the EOS loss into the training stage of the transformer model trained on WMT14 machine translation dataset. We train the transformer model with different weights of the auxiliary EOS loss ranged from to and we compare the models trained with the auxiliary results with the performance of the original model () under the greedy search and look-ahead search strategies. Moreover, we test the models by utilizing the beam search as the search strategy since people sometimes find that the larger beam size would seriously harm the performances. We suspect the larger beam size issue is also related to the EOS problem. To see the effectiveness of the EOS loss, we also show the average length difference of the model trained with the auxiliary EOS loss.
In this experiment, we add the auxiliary EOS loss into the transformer models. We set the in 1 equals to 0.0 (the original model), 0.25, 0.5 0.75,1.0 and 1.25. The results are shown in Table 4. Surprisingly, the EOS loss consistently enhances the models with the greedy search strategy. Moreover, the model trained with the auxiliary loss is more robust to the longer look-ahead steps with the auxiliary weights smaller than 1.25. In our setting, we get the best results when we set equals to one. Furthermore, we compare the auxiliary EOS loss model () with the original model with the beam search strategy. The beam search results are shown in Figure 4. Our results demonstrate that the model trained with the auxiliary EOS loss surpassed the original model with a significant margin. Moreover, unlike the original model, the auxiliary EOS model is more robust to large beam width settings. In addition, we plot the average length difference results of the original model and the model with the auxiliary loss in Figure 5. The average length difference results show that training with the auxiliary EOS loss () encourage the model to predict longer sequence compared with the original model.
6 Conclusion and Future Work
Working on the decoding strategy can help researchers pinpoint the problems of the decoders and further improve the models by the diagnosing the errors. Our work is an example. We investigate the decoders using our proposed look-ahead module and then fix the overestimated EOS problem. In the look-ahead experiments, we find the look-ahead module is able to improve on some easier datasets but less effective on a more difficult dataset, WMT14. Our analysis suggests that the overestimated EOS probability is one of the issues and we can alleviate the problem by training the model with the auxiliary EOS loss. There are still other feasible approaches to solve the EOS problem and integrating our proposed look-ahead model. One of the possible ways is building an external classification network to predict the EOS at each time step instead of treating the EOS as one of the vocabulary tokens. Another approach is incorporating the look-ahead module into the training stage and calculating the auxiliary loss using the information provided by the look-ahead module. It is also very promising to combine the look-ahead module with beam search. We hope this work can encourage other search strategies in the decoder and other methods to analyze the model errors in the future.
VQA: visual question answering.
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2425–2433. External Links: Cited by: §1.
-  (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §1, §2.
-  (2017) Towards better decoding and language model integration in sequence to sequence models. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pp. 523–527. External Links: Cited by: §2.
Image-to-markup generation with coarse-to-fine attention.
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 980–989. External Links: Cited by: §1, §3.
-  (20162016) Multi30K: multilingual english-german image descriptions. pp. 70–74. Cited by: §3.
-  (2017) A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 123–135. External Links: Cited by: §4.2.
Supervised sequence labelling with recurrent neural networks. Studies in Computational Intelligence, Vol. 385, Springer. External Links: Cited by: §1.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §1.
OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints. External Links: Cited by: §4.2.
TVQA: localized, compositional video question answering.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 1369–1379. External Links: Cited by: §1.
-  (2017) Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2157–2169. External Links: Cited by: §1.
-  (2016) Reward augmented maximum likelihood for neural structured prediction. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1723–1731. External Links: Cited by: §2.
-  (2018-10) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. External Links: Cited by: §4.3.
-  (2019) Optimal completion distillation for sequence learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: §2.
-  (2017-10) Mastering the game of go without human knowledge. Nature 550, pp. 354–359. External Links: Cited by: §1, §2.
-  (2019-11) On NMT search errors and model errors: cat got your tongue?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3354–3360. External Links: Cited by: §2, §4.3.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998–6008. External Links: Cited by: §1.
Show and tell: A neural image caption generator.
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3156–3164. External Links: Cited by: §1.
-  (2016) Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1296–1306. External Links: Cited by: §2.
End-to-end dense video captioning with masked transformer. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8739–8748. External Links: Cited by: §1, §2.