Investigating the Decoders of Maximum Likelihood Sequence Models: A Look-ahead Approach

03/08/2020 ∙ by Yu-Siang Wang, et al. ∙ UNIVERSITY OF TORONTO MIT 0

We demonstrate how we can practically incorporate multi-step future information into a decoder of maximum likelihood sequence models. We propose a "k-step look-ahead" module to consider the likelihood information of a rollout up to k steps. Unlike other approaches that need to train another value network to evaluate the rollouts, we can directly apply this look-ahead module to improve the decoding of any sequence model trained in a maximum likelihood framework. We evaluate our look-ahead module on three datasets of varying difficulties: IM2LATEX-100k OCR image to LaTeX, WMT16 multimodal machine translation, and WMT14 machine translation. Our look-ahead module improves the performance of the simpler datasets such as IM2LATEX-100k and WMT16 multimodal machine translation. However, the improvement of the more difficult dataset (e.g., containing longer sequences), WMT14 machine translation, becomes marginal. Our further investigation using the k-step look-ahead suggests that the more difficult tasks suffer from the overestimated EOS (end-of-sentence) probability. We argue that the overestimated EOS probability also causes the decreased performance of beam search when increasing its beam width. We tackle the EOS problem by integrating an auxiliary EOS loss into the training to estimate if the model should emit EOS or other words. Our experiments show that improving EOS estimation not only increases the performance of our proposed look-ahead module but also the robustness of the beam search.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A synthetic example to illustrate a -step look-ahead inference module for the decoder. Each node represents a word token and a probability value from the decoder. The model is predicting the word at time step . The vocabulary size in this example is set to three and consists of three tokens {Token #0, Token #1, and EOS(end-of-sentence)}. We don’t expand the tree from the EOS node because the node implies the end of the sentence. The depth of the expanded tree is 2 in the -step look-ahead scenario. When we predict the word at time step , we compute the summation of the log probabilities from the node at time step to the leaf of the tree. We select the word which has the maximum summation of the log probabilities along its path as our prediction at time step . In the example, the Token #0 has the maximum likelihood (0.80.6) among the entire paths from to . So we choose Token #0 as the prediction at time step .

Neural sequence models [7, 8, 17] have been widely applied to solve various sequence generation tasks including machine translation [2], optical character recognition [4]

, image captioning 

[18, 20], visual question answering [1, 10], and dialogue generation [11]. Such neural-based architectures model the conditional probability of an output sequence given an input . By using these neural models, sequence decoding can be performed by Maximum a posteriori (MAP) estimation of the word sequence, given a trained neural sequence model and an observed input sequence. However, in settings where the vocabulary size is huge and the length of the predicted sequence is long, the exact MAP inference is not feasible. For example, a size vocabulary and a length target sequence would lead to total possible sequences. Other approximate inference strategies are more commonly used to decode the sequences than exact MAP inference.

The most simple decoding strategy is to always choose the word with the highest probability at each time step. This greedy approach doesn’t give us the most likely sequence and is prone to have grammatical errors in the output sequence. Beam search (BS), on the other hand, maintains top-scoring successors at each time step and then scores all expanded sequences to choose one to output. BS decoding strategy has shown good results in many sequence generation tasks and has been the most popular decoding strategy so far. Although the beam search considers the whole sequence for scoring, it only uses the current node to decide the nodes to expand and doesn’t consider the possible future to expand a node. This incompleteness in the search leads to sub-optimal results.

When speaking or writing, we do not just consider the last word we generate to choose the next word; we also consider what we want to say or write in the future. Regarding the future output is crucial to improve sequence generation. For example, Monte-Carlo Tree Search (MCTS) focuses on the analysis of promising moves and has achieved great success in game-play, e.g., Go [15]. An MCTS-based strategy predicts the next action by carrying out several rollouts from the present time step and calculate the reward for each rollout using a trained value network. It then makes the decision at each time step by choosing the action which leads to the highest average future rewards. However, MCTS requires to train another value network and takes more runtime to run the simulation. It is not practical to run MCTS to decode sequences with large vocabulary size and many time steps. Instead of applying MCTS in sequence decoding, we propose a new -step look-ahead (-LA) module that doesn’t need an external value network and has a practical run-time. Namely, our proposed look-ahead module can be plugged into the decoding phase of any existing sequence model to improve the inference results.

Figure 1 illustrates how -LA works in a example to choose a word at time step . At time step , we expand every word until the search tree reaches time step . For each word in the tree rooted at time , we can compute the likelihood of extending that word from its parent using the pretrained sequence model. To select a word at , we choose the word whose sub-tree has the highest accumulated probability, i.e., the highest expected likelihood in the future.

We test the proposed -step look-ahead module on three datasets of increasing difficulties: IM2LATEX OCR, WMT16 multimodal English-German translation, and WMT14 English-German machine translation. Our results show that the look-ahead module can improve the decoding in IM2LATEX and WMT16 but only marginal over the greedy search in WMT14. Our analysis suggests that the more difficult datasets (usually containing longer sequences, e.g., WMT14 and a subset of WMT16 where sequence length ) suffer from the overestimated end-of-sentence (EOS) probability. The overestimated EOS probability encourages the sequence decoder to favor short sequences. Even with the look-ahead module, the decoder still cannot recover from that bias. To fix the EOS problem, we use an auxiliary EOS loss in training to make a more accurate EOS estimation. We show that the model trained with the auxiliary EOS loss not only improves the performance of the look-ahead module but also makes the beam search more robust.

This work makes a number of contributions. We show how we can incorporate future information to improve the decoders using pretrained sequence models only. Our analysis with the proposed decoder also help us pinpoint the issues of the pretrained sequence model and further fix the sequence model. We expect that looking into both decoders and models together can provide a better picture of sequence generation results and help design a more robust sequence model and training framework.

Figure 2: Two examples from the WMT16 dataset. The input of the first row English sentence is, “Two black dogs , a black puppy , and a white dog in the snow” and the input of the second row English sentence is “A young female artists paints an image of a woman on a wall”. We exhibit the translation results with different strategies. The first row illustrates the successful example and the second row shows the fail example of the look-ahead module. We define the successful example in terms of the BLEU scores.

2 Related Work

Learning to search with look-ahead cues

: Reinforcement learning (RL) techniques, especially the value network, are often used to incorporate hypothetical future information into predictions.

[2] train their policy and value networks by RL but allow the value network to also take the correct output as its input so that the policy can optimize for BLEU scores directly. [20] train image captioning policy and value networks using actor-critic methods. The authors found that the global guidance introduced by the value network greatly improves performance over the approach that only uses a policy network. [15] apply self-play and MCTS to train policy and value networks for Go. It show that MCTS is a powerful policy evaluation method.
Augmenting information in training sequence model: [12] focus on using auxiliary reward to improve the maximum likelihood training decoder. They define the auxiliary reward as the negative edit distance between the predicted sentences and the ground truth labels. [14] optimize the seq2seq models based on edit distance instead of maximizing the likelihood. They show the improvements on the speech recognition dataset. [19] focus on improving the decoder by alleviating the mismatch between training and testing. They introduced a search-based loss that directly optimizes the network for beam search decoding.
Sequence modeling errors: [16]

analyze the machine translation decoder by enumerating all the possible predicted sequences. They predict the decoded sequence by choosing the sequence with the highest likelihood. Their results demonstrate that the neural machine translation model usually assigned its best score to the empty sentences for over 50% of inference sentences. In

[3], they argue that the seq2seq models suffer from the overestimated word probability in the training stage. They propose to solve the issue using the label smoothing technique.

3 Datasets

In the following discussions of the paper, we evaluated the proposed approaches on three different datasets: IM2LATEX-100k OCR dataset [4], WMT16 Multimodal English-German (EN-DE) machine translation dataset [5], and WMT14 English-German (EN-DE) machine translation dataset. In IM2LATEX-100K, the input is given an image and the goal is to generate the corresponded LaTeX equation. The dataset is separated into the training set (83,883 equations), the validation set (9,319 equations) and the test set (10,354 equations). The average length of the target LaTeX equations is 64.86 characters per equation. The WMT16 multimodal dataset consists of 29,000 EN-DE training pairs, 1,014 validation pairs and 1,000 testing pairs. Each EN-DE training pair are the descriptions of an image. The average length of the testing target sentences is 12.39 words per sentence. In this paper, we didn’t use the support from the image information. The WMT14 EN-DE machine translation dataset consists of 4,542,486 training pairs, 1,014 validation pars. We train on WMT14 data but evaluate the model on the newstest2017 dataset which consists of 3,004 testing pairs. The average length of target sequences in newstest2017 is 28.23 words per sentence, which is a much longer compared to the average length of the WMT16 translation dataset. The longer target sequences in WMT14 makes WMT14 a more difficult task than the WMT16 translation task.

1 Input: Pretrained sequence model parameterized by , Max time step , look ahead step ,
2 Initialize predicted sentence to empty sequence
3 Initialize as the
4 : Begin-of-Sentence token
5 Initialize max_prob to -INF Function DFSLookAhead(, , , , ):
6       for prob, word in (probs, words) do
7             += prob
8             if  max_prob then
9                   break
10            if  or word ==  then
11                   break
12                   if  max_prob then
13                         max_prob =
14                         = head
16            else
17                   =
18                   , = Sorted()
19                   if  == 1 then
20                         = word
21                  DFS(, , +1, , )
23      return
24for  to  do
25       Initialize to 1
26       Initialize to 0
27       Initialize to None
28       Initialize to None
29       =
30       , = Sorted()
31       DFSLookAhead(, , , , )
32       S.append()
Algorithm 1 DFS Look Ahead for Prediction

4 Look-ahead Prediction

We present a look-ahead prediction module to take advantage of the future cues. This proposed look-ahead moduel is based on depth-first search (DFS) instead of using the Monte-Carlo Tree-based (MCTS) method. In the DFS-based look-ahead module, we are able to prune the negligible paths and nodes whose probability is too small to be the word lead to the largest probability. In contrast, MCTS-based method requires plenty of samples to estimate the nodes’ expected probability. To compare the real execution of these two look-ahead methods, we test both methods on the transformer model trained on the WMT14 dataset. We run the experiment on Tesla V100 GPU with 500 input sentences. We set the look-ahead time step equals to 3 for both search strategies. In the MCTS setting, we operate 20 rollouts in each time step and the average execution time is 32.47 seconds per sentence. As for DFS-based method, the average execution time is 0.60 seconds per sentence. To make the look-ahead module more practical, we choose the DFS-based look-ahead module as our node expansion strategy.

4.1 Method

Figure 1 illustrate our proposed DFS look-ahead module. Algorithm 1 is the pseudo-code of the proposed method. Given a pretrained sequence model and a size vocabulary, we are able to expand a tree in the current time step to the in the -step look-ahead setting. The height of the tree is k. For example, in the -step look-ahead setting, there are nodes at height 1 and leaf nodes at height 2. At , we select the word which has the maximum summation of the log-likelihood along the path from height 1 to the leaf nodes. We repeat the previous operation at each time step until we predict the EOS token. Although the time complexity of the DFS is , we are able to prune a lot of insignificant paths in our tree. At line 9 in Algorithm 1, we early stop DFS when then current cumulative log-probability is smaller than the maximum summation of log-probability we have encountered so far. Since we sort log probabilities before we perform the DFS, we are able to prune many paths which can’t be the optimal path in the expanded tree. By using the foresight word information in the prediction, we can select the word guiding to the largest probability in advance.

4.2 Experiments

We train and test the sequence models using OpenNMT [9]. For the IM2LATEX-100K image to LaTeX OCR dataset, our CNN feature extractor is based on [6] and we pass the visual features in each time step to a 512 hidden units bi-LSTM model. For the WMT16 EN-DE translation dataset, we trained an LSTM model with 500 hidden units. As for the WMT14 EN-DE translation dataset, we trained a transformer model with 8 heads, 6 layers, 512 hidden units and 2048 units in the feed-forward network. We report the BLEU scores of the greedy search and the look-ahead (LA) search with different -steps in all three datasets. In our look-ahead module definition, the 1-LA setting is equivalent to the greedy search since we only use the current time step information. The look-ahead module is more directly comparable to the greedy search method than the beam search method because the beam size of either the greedy search or the look-ahead module is 1. For a better reference of the range of the performance, we also report the scores of the beam search. Note that we may combine the beam search method and the look-ahead method at the same time. For simplicity, we test our look-ahead module with the beam width = 1 setting.

4.3 Results

We test the look-ahead module with five different settings, which are 1-LA (Greedy) to 5-LA and we evaluate the models with Sacre BLEU scores [13] which is a commonly used machine translation metric. We demonstrate the results of three different models in Table 1, 2, and 3. Our results show that the look-ahead module can improve the models on the IM2LATEX-100K dataset and the WMT16 dataset. We show the examples of using the look-ahead module on the model trained on the WMT16 dataset in Figure 2. However, the improvement becomes marginal on the WMT14 dataset and even harms the performances in the 5-LA setting. We argue that the look-ahead module might be less effective on the more difficult datasets, i.e., the longer target sequences. We show that in Table 2, both the look-ahead module and the beam search harm the model on the target sequences longer than 25 words. We didn’t discuss IM2LATEX task because the accuracy of IM2LATEX task is highly dependent on the recognition accuracy of the CNN models and this makes the model a different scheme compared to the rest of the two textual translation models. We argue that the ineffectiveness of the look-ahead module on WMT14 is caused by the overestimated end-of-sentence (EOS) probability. The overestimated EOS probability will lead to shorter sentences and make the wrong prediction at the same time.

To support our argument, we show the average length differences between the predicted sequences and the ground truth sequences. For each sentence, the difference is calculate by (Prediction Length - Ground Truth Length). Therefore, a positive number indicates that the model tends to predict longer sentences than the ground truth sentences and vice versa. We test the WMT16 LSTM model and the WMT14 transformer model with different search strategies. The two trends shown in the two figures are the same. Both models tend to predict shorter sequences with the increasing of the look-ahead steps. However, the WMT16 LSTM model tend to predict “overlong” sentences while the WMT14 transformer model usually predicts “overshort” sentences in the greedy search setting. These two properties make the look-ahead module substantially improve the WMT16 model but marginally improve the WMT14 model. The results substantiate our argument of the overestimated EOS problem in the more difficult dataset.

In [16], they enumerate all the possible sequences and find that the model assigns the highest probability to the empty sequence for over 50% of testing sentences. Their result is consistent with our analysis. Both demonstrate the EOS problem in different schemes. However, their experiment settings are not practical because enumerating all the possible sequences in the exponential-growth search space is time-consuming.

Search Strategy BLEU


Greedy Search 86.24
2-LA 86.65
3-LA 86.71
4-LA 86.77
5-LA 86.79


Beam Search (B=10) 86.28
Table 1: The performances of the IM2LATEX-100K Bi-LSTM model. We discover that the look-ahead improves the model from the greedy search method — noted that LA is more directly comparable to the greedy search because of their same beam size. We also show the scores of the beam search for the reference
Search Strategy BLEU BLEU (Target len)


Greedy Search 31.67 23.86
2-LA 32.07 21.50
3-LA 32.20 22.78
4-LA 32.42 22.45
5-LA 32.41 23.30


Beam Search (B=10) 33.83 22.45
Table 2: The performances of the LSTM model trained on the WMT16 multimodal translation dataset with different LA steps. We show the look-ahead module is able to improve the model on the entire testing set. However, either the LA module or the beam search method harm the models when the length of the target sentences is longer than 25 words.
Search Strategy BLEU


Greedy Search 27.50
2-LA 27.71
3-LA 27.62
4-LA 27.56
5-LA 27.35


Beam Search (B=10) 28.21
Table 3: We show the results of applying LA module to the transformer model trained on the WMT14 dataset. We find that the LA module slightly improves the original model but harms the performance when the LA time step is 5. We suggest one of the reasons of these results are caused by the EOS problem.
Figure 3: We demonstrated the average length differences between the predicted sequences and the ground truth sequences. A positive number means the model tends to predict longer sentences than the ground truth sentences and vice versa. With the increase of the look-ahead steps, the two models tend to predict shorter sequences.
Search Strategy 0.0 0.25 0.50 0.75 1.0 1.25


Greedy 27.50 27.81 27.74 27.75 27.90 27.71
2-LA 27.71 28.05 27.95 27.99 28.20 27.85
3-LA 27.89 27.82 27.87 27.82 28.10 27.68
4-LA 27.56 27.81 27.87 27.74 27.84 27.68
5-LA 27.35 27.71 27.74 27.63 27.87 27.55
Table 4: We show the results of integrating auxiliary EOS loss into the training state. is the weight of the auxiliary EOS loss. We find the EOS loss not only boosts the performance of the model when using the greedy search, the model is more robust to the larger Look-ahead steps with reasonable weights of auxiliary EOS loss.

5 Auxiliary EOS Loss

To tackle the EOS problem, we introduce an auxiliary EOS loss to effectively solve the problem. We test the model trained with our proposed auxiliary EOS loss in our proposed DFS based look-ahead setting which is more practical in the real world.

5.1 Methods

We ensure that the model doesn’t ignore the EOS probability of the negative EOS ground truth token in each time step, i.e., the ground truth word which is not the EOS token. By given a batch of training data, the original sequence modeling loss can be represented as

where N is the batch size and c is the correct class of the data in the batch. We could see the original loss only focuses on the loss of the correct classes. In order to incorporate the EOS token loss into our train, we treat the auxiliary EOS task as a binary classification problem. Our auxiliary EOS loss can be written as


where gamma is a scalar indicating the portion of the EOS loss.

5.2 Experiments

We integrate the EOS loss into the training stage of the transformer model trained on WMT14 machine translation dataset. We train the transformer model with different weights of the auxiliary EOS loss ranged from to and we compare the models trained with the auxiliary results with the performance of the original model () under the greedy search and look-ahead search strategies. Moreover, we test the models by utilizing the beam search as the search strategy since people sometimes find that the larger beam size would seriously harm the performances. We suspect the larger beam size issue is also related to the EOS problem. To see the effectiveness of the EOS loss, we also show the average length difference of the model trained with the auxiliary EOS loss.

5.3 Results

In this experiment, we add the auxiliary EOS loss into the transformer models. We set the in 1 equals to 0.0 (the original model), 0.25, 0.5 0.75,1.0 and 1.25. The results are shown in Table 4. Surprisingly, the EOS loss consistently enhances the models with the greedy search strategy. Moreover, the model trained with the auxiliary loss is more robust to the longer look-ahead steps with the auxiliary weights smaller than 1.25. In our setting, we get the best results when we set equals to one. Furthermore, we compare the auxiliary EOS loss model () with the original model with the beam search strategy. The beam search results are shown in Figure 4. Our results demonstrate that the model trained with the auxiliary EOS loss surpassed the original model with a significant margin. Moreover, unlike the original model, the auxiliary EOS model is more robust to large beam width settings. In addition, we plot the average length difference results of the original model and the model with the auxiliary loss in Figure 5. The average length difference results show that training with the auxiliary EOS loss () encourage the model to predict longer sequence compared with the original model.

Figure 4: Results of the original model and the model with auxiliary EOS loss () with different beam sizes. We can find the model trained with the auxiliary EOS loss is more robust to the different beam sizes compared with the original model.
Figure 5: Average length Difference Results of the original model and the model with auxiliary EOS loss () with different LA steps. We can find the model trained with the auxiliary EOS loss can predict longer sentences compared with the original model.

6 Conclusion and Future Work

Working on the decoding strategy can help researchers pinpoint the problems of the decoders and further improve the models by the diagnosing the errors. Our work is an example. We investigate the decoders using our proposed look-ahead module and then fix the overestimated EOS problem. In the look-ahead experiments, we find the look-ahead module is able to improve on some easier datasets but less effective on a more difficult dataset, WMT14. Our analysis suggests that the overestimated EOS probability is one of the issues and we can alleviate the problem by training the model with the auxiliary EOS loss. There are still other feasible approaches to solve the EOS problem and integrating our proposed look-ahead model. One of the possible ways is building an external classification network to predict the EOS at each time step instead of treating the EOS as one of the vocabulary tokens. Another approach is incorporating the look-ahead module into the training stage and calculating the auxiliary loss using the information provided by the look-ahead module. It is also very promising to combine the look-ahead module with beam search. We hope this work can encourage other search strategies in the decoder and other methods to analyze the model errors in the future.


  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: visual question answering. In

    2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015

    pp. 2425–2433. External Links: Link, Document Cited by: §1.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §2.
  • [3] J. Chorowski and N. Jaitly (2017) Towards better decoding and language model integration in sequence to sequence models. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pp. 523–527. External Links: Link Cited by: §2.
  • [4] Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush (2017) Image-to-markup generation with coarse-to-fine attention. In

    Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017

    pp. 980–989. External Links: Link Cited by: §1, §3.
  • [5] D. Elliott, S. Frank, K. Sima’an, and L. Specia (20162016) Multi30K: multilingual english-german image descriptions. pp. 70–74. Cited by: §3.
  • [6] J. Gehring, M. Auli, D. Grangier, and Y. N. Dauphin (2017) A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 123–135. External Links: Link, Document Cited by: §4.2.
  • [7] A. Graves (2012)

    Supervised sequence labelling with recurrent neural networks

    Studies in Computational Intelligence, Vol. 385, Springer. External Links: Link, Document, ISBN 978-3-642-24796-5 Cited by: §1.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §1.
  • [9] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush

    OpenNMT: Open-Source Toolkit for Neural Machine Translation

    ArXiv e-prints. External Links: 1701.02810 Cited by: §4.2.
  • [10] J. Lei, L. Yu, M. Bansal, and T. L. Berg (2018) TVQA: localized, compositional video question answering. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018

    pp. 1369–1379. External Links: Link Cited by: §1.
  • [11] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 2157–2169. External Links: Link Cited by: §1.
  • [12] M. Norouzi, S. Bengio, z. Chen, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans (2016) Reward augmented maximum likelihood for neural structured prediction. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1723–1731. External Links: Link Cited by: §2.
  • [13] M. Post (2018-10) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. External Links: Link, Document Cited by: §4.3.
  • [14] S. Sabour, W. Chan, and M. Norouzi (2019) Optimal completion distillation for sequence learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2.
  • [15] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Driessche, T. Graepel, and D. Hassabis (2017-10) Mastering the game of go without human knowledge. Nature 550, pp. 354–359. External Links: Document Cited by: §1, §2.
  • [16] F. Stahlberg and B. Byrne (2019-11) On NMT search errors and model errors: cat got your tongue?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3354–3360. External Links: Link, Document Cited by: §2, §4.3.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 5998–6008. External Links: Link Cited by: §1.
  • [18] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: A neural image caption generator. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015

    pp. 3156–3164. External Links: Link, Document Cited by: §1.
  • [19] S. Wiseman and A. M. Rush (2016) Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 1296–1306. External Links: Link Cited by: §2.
  • [20] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong (2018)

    End-to-end dense video captioning with masked transformer

    In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8739–8748. External Links: Link, Document Cited by: §1, §2.