1 Introduction
Neural sequence models [7, 8, 17] have been widely applied to solve various sequence generation tasks including machine translation [2], optical character recognition [4]
[18, 20], visual question answering [1, 10], and dialogue generation [11]. Such neuralbased architectures model the conditional probability of an output sequence given an input . By using these neural models, sequence decoding can be performed by Maximum a posteriori (MAP) estimation of the word sequence, given a trained neural sequence model and an observed input sequence. However, in settings where the vocabulary size is huge and the length of the predicted sequence is long, the exact MAP inference is not feasible. For example, a size vocabulary and a length target sequence would lead to total possible sequences. Other approximate inference strategies are more commonly used to decode the sequences than exact MAP inference.The most simple decoding strategy is to always choose the word with the highest probability at each time step. This greedy approach doesn’t give us the most likely sequence and is prone to have grammatical errors in the output sequence. Beam search (BS), on the other hand, maintains topscoring successors at each time step and then scores all expanded sequences to choose one to output. BS decoding strategy has shown good results in many sequence generation tasks and has been the most popular decoding strategy so far. Although the beam search considers the whole sequence for scoring, it only uses the current node to decide the nodes to expand and doesn’t consider the possible future to expand a node. This incompleteness in the search leads to suboptimal results.
When speaking or writing, we do not just consider the last word we generate to choose the next word; we also consider what we want to say or write in the future. Regarding the future output is crucial to improve sequence generation. For example, MonteCarlo Tree Search (MCTS) focuses on the analysis of promising moves and has achieved great success in gameplay, e.g., Go [15]. An MCTSbased strategy predicts the next action by carrying out several rollouts from the present time step and calculate the reward for each rollout using a trained value network. It then makes the decision at each time step by choosing the action which leads to the highest average future rewards. However, MCTS requires to train another value network and takes more runtime to run the simulation. It is not practical to run MCTS to decode sequences with large vocabulary size and many time steps. Instead of applying MCTS in sequence decoding, we propose a new step lookahead (LA) module that doesn’t need an external value network and has a practical runtime. Namely, our proposed lookahead module can be plugged into the decoding phase of any existing sequence model to improve the inference results.
Figure 1 illustrates how LA works in a example to choose a word at time step . At time step , we expand every word until the search tree reaches time step . For each word in the tree rooted at time , we can compute the likelihood of extending that word from its parent using the pretrained sequence model. To select a word at , we choose the word whose subtree has the highest accumulated probability, i.e., the highest expected likelihood in the future.
We test the proposed step lookahead module on three datasets of increasing difficulties: IM2LATEX OCR, WMT16 multimodal EnglishGerman translation, and WMT14 EnglishGerman machine translation. Our results show that the lookahead module can improve the decoding in IM2LATEX and WMT16 but only marginal over the greedy search in WMT14. Our analysis suggests that the more difficult datasets (usually containing longer sequences, e.g., WMT14 and a subset of WMT16 where sequence length ) suffer from the overestimated endofsentence (EOS) probability. The overestimated EOS probability encourages the sequence decoder to favor short sequences. Even with the lookahead module, the decoder still cannot recover from that bias. To fix the EOS problem, we use an auxiliary EOS loss in training to make a more accurate EOS estimation. We show that the model trained with the auxiliary EOS loss not only improves the performance of the lookahead module but also makes the beam search more robust.
This work makes a number of contributions. We show how we can incorporate future information to improve the decoders using pretrained sequence models only. Our analysis with the proposed decoder also help us pinpoint the issues of the pretrained sequence model and further fix the sequence model. We expect that looking into both decoders and models together can provide a better picture of sequence generation results and help design a more robust sequence model and training framework.
2 Related Work
Learning to search with lookahead cues
: Reinforcement learning (RL) techniques, especially the value network, are often used to incorporate hypothetical future information into predictions.
[2] train their policy and value networks by RL but allow the value network to also take the correct output as its input so that the policy can optimize for BLEU scores directly. [20] train image captioning policy and value networks using actorcritic methods. The authors found that the global guidance introduced by the value network greatly improves performance over the approach that only uses a policy network. [15] apply selfplay and MCTS to train policy and value networks for Go. It show that MCTS is a powerful policy evaluation method.Augmenting information in training sequence model: [12] focus on using auxiliary reward to improve the maximum likelihood training decoder. They define the auxiliary reward as the negative edit distance between the predicted sentences and the ground truth labels. [14] optimize the seq2seq models based on edit distance instead of maximizing the likelihood. They show the improvements on the speech recognition dataset. [19] focus on improving the decoder by alleviating the mismatch between training and testing. They introduced a searchbased loss that directly optimizes the network for beam search decoding.
Sequence modeling errors: [16]
analyze the machine translation decoder by enumerating all the possible predicted sequences. They predict the decoded sequence by choosing the sequence with the highest likelihood. Their results demonstrate that the neural machine translation model usually assigned its best score to the empty sentences for over 50% of inference sentences. In
[3], they argue that the seq2seq models suffer from the overestimated word probability in the training stage. They propose to solve the issue using the label smoothing technique.3 Datasets
In the following discussions of the paper, we evaluated the proposed approaches on three different datasets: IM2LATEX100k OCR dataset [4], WMT16 Multimodal EnglishGerman (ENDE) machine translation dataset [5], and WMT14 EnglishGerman (ENDE) machine translation dataset. In IM2LATEX100K, the input is given an image and the goal is to generate the corresponded LaTeX equation. The dataset is separated into the training set (83,883 equations), the validation set (9,319 equations) and the test set (10,354 equations). The average length of the target LaTeX equations is 64.86 characters per equation. The WMT16 multimodal dataset consists of 29,000 ENDE training pairs, 1,014 validation pairs and 1,000 testing pairs. Each ENDE training pair are the descriptions of an image. The average length of the testing target sentences is 12.39 words per sentence. In this paper, we didn’t use the support from the image information. The WMT14 ENDE machine translation dataset consists of 4,542,486 training pairs, 1,014 validation pars. We train on WMT14 data but evaluate the model on the newstest2017 dataset which consists of 3,004 testing pairs. The average length of target sequences in newstest2017 is 28.23 words per sentence, which is a much longer compared to the average length of the WMT16 translation dataset. The longer target sequences in WMT14 makes WMT14 a more difficult task than the WMT16 translation task.
4 Lookahead Prediction
We present a lookahead prediction module to take advantage of the future cues. This proposed lookahead moduel is based on depthfirst search (DFS) instead of using the MonteCarlo Treebased (MCTS) method. In the DFSbased lookahead module, we are able to prune the negligible paths and nodes whose probability is too small to be the word lead to the largest probability. In contrast, MCTSbased method requires plenty of samples to estimate the nodes’ expected probability. To compare the real execution of these two lookahead methods, we test both methods on the transformer model trained on the WMT14 dataset. We run the experiment on Tesla V100 GPU with 500 input sentences. We set the lookahead time step equals to 3 for both search strategies. In the MCTS setting, we operate 20 rollouts in each time step and the average execution time is 32.47 seconds per sentence. As for DFSbased method, the average execution time is 0.60 seconds per sentence. To make the lookahead module more practical, we choose the DFSbased lookahead module as our node expansion strategy.
4.1 Method
Figure 1 illustrate our proposed DFS lookahead module. Algorithm 1 is the pseudocode of the proposed method. Given a pretrained sequence model and a size vocabulary, we are able to expand a tree in the current time step to the in the step lookahead setting. The height of the tree is k. For example, in the step lookahead setting, there are nodes at height 1 and leaf nodes at height 2. At , we select the word which has the maximum summation of the loglikelihood along the path from height 1 to the leaf nodes. We repeat the previous operation at each time step until we predict the EOS token. Although the time complexity of the DFS is , we are able to prune a lot of insignificant paths in our tree. At line 9 in Algorithm 1, we early stop DFS when then current cumulative logprobability is smaller than the maximum summation of logprobability we have encountered so far. Since we sort log probabilities before we perform the DFS, we are able to prune many paths which can’t be the optimal path in the expanded tree. By using the foresight word information in the prediction, we can select the word guiding to the largest probability in advance.
4.2 Experiments
We train and test the sequence models using OpenNMT [9]. For the IM2LATEX100K image to LaTeX OCR dataset, our CNN feature extractor is based on [6] and we pass the visual features in each time step to a 512 hidden units biLSTM model. For the WMT16 ENDE translation dataset, we trained an LSTM model with 500 hidden units. As for the WMT14 ENDE translation dataset, we trained a transformer model with 8 heads, 6 layers, 512 hidden units and 2048 units in the feedforward network. We report the BLEU scores of the greedy search and the lookahead (LA) search with different steps in all three datasets. In our lookahead module definition, the 1LA setting is equivalent to the greedy search since we only use the current time step information. The lookahead module is more directly comparable to the greedy search method than the beam search method because the beam size of either the greedy search or the lookahead module is 1. For a better reference of the range of the performance, we also report the scores of the beam search. Note that we may combine the beam search method and the lookahead method at the same time. For simplicity, we test our lookahead module with the beam width = 1 setting.
4.3 Results
We test the lookahead module with five different settings, which are 1LA (Greedy) to 5LA and we evaluate the models with Sacre BLEU scores [13] which is a commonly used machine translation metric. We demonstrate the results of three different models in Table 1, 2, and 3. Our results show that the lookahead module can improve the models on the IM2LATEX100K dataset and the WMT16 dataset. We show the examples of using the lookahead module on the model trained on the WMT16 dataset in Figure 2. However, the improvement becomes marginal on the WMT14 dataset and even harms the performances in the 5LA setting. We argue that the lookahead module might be less effective on the more difficult datasets, i.e., the longer target sequences. We show that in Table 2, both the lookahead module and the beam search harm the model on the target sequences longer than 25 words. We didn’t discuss IM2LATEX task because the accuracy of IM2LATEX task is highly dependent on the recognition accuracy of the CNN models and this makes the model a different scheme compared to the rest of the two textual translation models. We argue that the ineffectiveness of the lookahead module on WMT14 is caused by the overestimated endofsentence (EOS) probability. The overestimated EOS probability will lead to shorter sentences and make the wrong prediction at the same time.
To support our argument, we show the average length differences between the predicted sequences and the ground truth sequences. For each sentence, the difference is calculate by (Prediction Length  Ground Truth Length). Therefore, a positive number indicates that the model tends to predict longer sentences than the ground truth sentences and vice versa. We test the WMT16 LSTM model and the WMT14 transformer model with different search strategies. The two trends shown in the two figures are the same. Both models tend to predict shorter sequences with the increasing of the lookahead steps. However, the WMT16 LSTM model tend to predict “overlong” sentences while the WMT14 transformer model usually predicts “overshort” sentences in the greedy search setting. These two properties make the lookahead module substantially improve the WMT16 model but marginally improve the WMT14 model. The results substantiate our argument of the overestimated EOS problem in the more difficult dataset.
In [16], they enumerate all the possible sequences and find that the model assigns the highest probability to the empty sequence for over 50% of testing sentences. Their result is consistent with our analysis. Both demonstrate the EOS problem in different schemes. However, their experiment settings are not practical because enumerating all the possible sequences in the exponentialgrowth search space is timeconsuming.
Search Strategy  BLEU 


Greedy Search  86.24 
2LA  86.65 
3LA  86.71 
4LA  86.77 
5LA  86.79 


Beam Search (B=10)  86.28 
Search Strategy  BLEU  BLEU (Target len) 


Greedy Search  31.67  23.86 
2LA  32.07  21.50 
3LA  32.20  22.78 
4LA  32.42  22.45 
5LA  32.41  23.30 


Beam Search (B=10)  33.83  22.45 
Search Strategy  BLEU 


Greedy Search  27.50 
2LA  27.71 
3LA  27.62 
4LA  27.56 
5LA  27.35 


Beam Search (B=10)  28.21 
Search Strategy  0.0  0.25  0.50  0.75  1.0  1.25 


Greedy  27.50  27.81  27.74  27.75  27.90  27.71 
2LA  27.71  28.05  27.95  27.99  28.20  27.85 
3LA  27.89  27.82  27.87  27.82  28.10  27.68 
4LA  27.56  27.81  27.87  27.74  27.84  27.68 
5LA  27.35  27.71  27.74  27.63  27.87  27.55 
5 Auxiliary EOS Loss
To tackle the EOS problem, we introduce an auxiliary EOS loss to effectively solve the problem. We test the model trained with our proposed auxiliary EOS loss in our proposed DFS based lookahead setting which is more practical in the real world.
5.1 Methods
We ensure that the model doesn’t ignore the EOS probability of the negative EOS ground truth token in each time step, i.e., the ground truth word which is not the EOS token. By given a batch of training data, the original sequence modeling loss can be represented as
where N is the batch size and c is the correct class of the data in the batch. We could see the original loss only focuses on the loss of the correct classes. In order to incorporate the EOS token loss into our train, we treat the auxiliary EOS task as a binary classification problem. Our auxiliary EOS loss can be written as
(1) 
where gamma is a scalar indicating the portion of the EOS loss.
5.2 Experiments
We integrate the EOS loss into the training stage of the transformer model trained on WMT14 machine translation dataset. We train the transformer model with different weights of the auxiliary EOS loss ranged from to and we compare the models trained with the auxiliary results with the performance of the original model () under the greedy search and lookahead search strategies. Moreover, we test the models by utilizing the beam search as the search strategy since people sometimes find that the larger beam size would seriously harm the performances. We suspect the larger beam size issue is also related to the EOS problem. To see the effectiveness of the EOS loss, we also show the average length difference of the model trained with the auxiliary EOS loss.
5.3 Results
In this experiment, we add the auxiliary EOS loss into the transformer models. We set the in 1 equals to 0.0 (the original model), 0.25, 0.5 0.75,1.0 and 1.25. The results are shown in Table 4. Surprisingly, the EOS loss consistently enhances the models with the greedy search strategy. Moreover, the model trained with the auxiliary loss is more robust to the longer lookahead steps with the auxiliary weights smaller than 1.25. In our setting, we get the best results when we set equals to one. Furthermore, we compare the auxiliary EOS loss model () with the original model with the beam search strategy. The beam search results are shown in Figure 4. Our results demonstrate that the model trained with the auxiliary EOS loss surpassed the original model with a significant margin. Moreover, unlike the original model, the auxiliary EOS model is more robust to large beam width settings. In addition, we plot the average length difference results of the original model and the model with the auxiliary loss in Figure 5. The average length difference results show that training with the auxiliary EOS loss () encourage the model to predict longer sequence compared with the original model.
6 Conclusion and Future Work
Working on the decoding strategy can help researchers pinpoint the problems of the decoders and further improve the models by the diagnosing the errors. Our work is an example. We investigate the decoders using our proposed lookahead module and then fix the overestimated EOS problem. In the lookahead experiments, we find the lookahead module is able to improve on some easier datasets but less effective on a more difficult dataset, WMT14. Our analysis suggests that the overestimated EOS probability is one of the issues and we can alleviate the problem by training the model with the auxiliary EOS loss. There are still other feasible approaches to solve the EOS problem and integrating our proposed lookahead model. One of the possible ways is building an external classification network to predict the EOS at each time step instead of treating the EOS as one of the vocabulary tokens. Another approach is incorporating the lookahead module into the training stage and calculating the auxiliary loss using the information provided by the lookahead module. It is also very promising to combine the lookahead module with beam search. We hope this work can encourage other search strategies in the decoder and other methods to analyze the model errors in the future.
References

[1]
(2015)
VQA: visual question answering.
In
2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 713, 2015
, pp. 2425–2433. External Links: Link, Document Cited by: §1.  [2] (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §2.
 [3] (2017) Towards better decoding and language model integration in sequence to sequence models. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 2024, 2017, pp. 523–527. External Links: Link Cited by: §2.

[4]
(2017)
Imagetomarkup generation with coarsetofine attention.
In
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017
, pp. 980–989. External Links: Link Cited by: §1, §3.  [5] (20162016) Multi30K: multilingual englishgerman image descriptions. pp. 70–74. Cited by: §3.
 [6] (2017) A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30  August 4, Volume 1: Long Papers, pp. 123–135. External Links: Link, Document Cited by: §4.2.

[7]
(2012)
Supervised sequence labelling with recurrent neural networks
. Studies in Computational Intelligence, Vol. 385, Springer. External Links: Link, Document, ISBN 9783642247965 Cited by: §1.  [8] (1997) Long shortterm memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §1.

[9]
OpenNMT: OpenSource Toolkit for Neural Machine Translation
. ArXiv eprints. External Links: 1701.02810 Cited by: §4.2. 
[10]
(2018)
TVQA: localized, compositional video question answering.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31  November 4, 2018
, pp. 1369–1379. External Links: Link Cited by: §1.  [11] (2017) Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 911, 2017, pp. 2157–2169. External Links: Link Cited by: §1.
 [12] (2016) Reward augmented maximum likelihood for neural structured prediction. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1723–1731. External Links: Link Cited by: §2.
 [13] (201810) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. External Links: Link, Document Cited by: §4.3.
 [14] (2019) Optimal completion distillation for sequence learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, External Links: Link Cited by: §2.
 [15] (201710) Mastering the game of go without human knowledge. Nature 550, pp. 354–359. External Links: Document Cited by: §1, §2.
 [16] (201911) On NMT search errors and model errors: cat got your tongue?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 3354–3360. External Links: Link, Document Cited by: §2, §4.3.
 [17] (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 5998–6008. External Links: Link Cited by: §1.

[18]
(2015)
Show and tell: A neural image caption generator.
In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015
, pp. 3156–3164. External Links: Link, Document Cited by: §1.  [19] (2016) Sequencetosequence learning as beamsearch optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016, pp. 1296–1306. External Links: Link Cited by: §2.

[20]
(2018)
Endtoend dense video captioning with masked transformer
. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 8739–8748. External Links: Link, Document Cited by: §1, §2.
Comments
There are no comments yet.