1 Introduction
With the advent of deep learning, many applications of machine learning have converged on a similar set of methods and models. For example, the Transformer
(Vaswani et al., 2017) sequencetosequence architecture is ubiquitous in various fields of natural language processing (NLP) such as machine translation (MT), grammatical error correction (GEC), speech recognition (Karita et al., 2019), etc., and has also been applied successfully to other tasks such as computer vision
(Dosovitskiy et al., 2021). Recent large pretrained NLP models such as BERT (Devlin et al., 2019), GPT3
(Brown et al., 2020), T5 (Raffel et al., 2020), RoBERTa (Liu et al., 2019), and XLNet (Yang et al., 2019) are all based on the Transformer, with relatively minor changes to the architecture itself.We show that despite this architectural uniformity the learned distribution over sequences has strikingly different characteristics for different NLP tasks. Inspired by Ott et al. (2018) we identify intrinsic uncertainty – the nature of some NLP tasks to allow multiple viable outputs for a given input^{1}^{1}1This is sometimes referred to as aleatoric uncertainty in the literature (Der Kiureghian and Ditlevsen, 2009). – to be a major factor that shapes the search space of Transformer models and determines its tractability. In machine translation (MT) – a task known to have high intrinsic uncertainty (Padó et al., 2009; Dreyer and Marcu, 2012; Ott et al., 2018) – Transformer models suffer from a high number of beam search errors (Stahlberg and Byrne, 2019), an inadequacy of the mode (Eikema and Aziz, 2020), and translation performance degradation with large beam sizes (Koehn and Knowles, 2017) (also known as the “beam search curse”). In contrast, for the correction of writing errors in text (grammatical error correction – GEC) (Brockett et al., 2006), a task with a lower level of uncertainty (Bryant and Ng, 2015), none of these pathologies are evident. This pattern holds even at the sequencelevel: input sentences with high uncertainty tend to result in more search errors and a less tractable search space. To study the influence of uncertainty on sequences around the mode, we propose an exact best search algorithm for neural sequence models. We show that the probability mass covered by the best candidates differs markedly between certain and uncertain tasks and sentences, which shows that intrinsic uncertainty also affects the spread of probability mass and thus the model uncertainty. We confirm recent work showing that beam search has drawbacks as a decoding scheme for MT. Nevertheless, it is effective for GEC, a problem where modes are adequate, search errors are rare, and the best lists cover a large fraction of the probability mass.
2 Measuring Intrinsic Uncertainty
Intrinsic uncertainty refers to the inherent nature of some NLP tasks to allow for more than one feasible output for a given input. For example, intrinsic uncertainty in MT stems from the fact that there are often several semantically equivalent translations for the same source sentence, or that the translation into a highly inflected language is sometimes underspecified (Ott et al., 2018). Studies have shown that even for tasks like GEC, annotators do not always agree (Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010; Bryant and Ng, 2015), but the level of intrinsic uncertainty is arguably lower than for MT because there is a limited number of ways to correct an ungrammatical sentence.
We propose a simple way to measure sentencelevel output uncertainty by making use of multireference test sets. For an way annotated sentence with references we define the uncertainty as the average relative edit distance between two references:
(1)  
where denotes the Levenshtein distance. Fig. 1 presents this uncertainty score for one MT test set and two GEC test sets. MTende is the official WMT19 EnglishGerman test set (Barrault et al., 2019) paired with the additional humanannotated “newstest2019 AR” references provided by Freitag et al. (2020).^{2}^{2}2The AR references are created from scratch, unlike the other paraphrasing references by Freitag et al. (2020). GECconll14 uses the 10 references published by Bryant and Ng (2015) for the CoNLL2014 shared task on GEC (Ng et al., 2014), and GECjfleg is a 4reference GEC test set that represents “a broad range of language proficiency levels” (Napoles et al., 2017). Our uncertainty measure reflects our intuition that MT is a significantly more uncertain task than GEC.^{3}^{3}3The mean value differs significantly between GEC and MT in each length bucket (test value of less than ). For both tasks the uncertainty increases with the sentence length as longer sentences typically have more feasible mappings than shorter ones. We use the edit distance rather than taskspecific metrics like BLEU (Papineni et al., 2002) or BLEURT (Sellam et al., 2020) since they are designed to be robust against uncertainty effects such as reordering or semantically equivalent references, precisely the kinds of effects we aim to capture with . We follow Bryant and Ng (2015) by not using interannotator agreement statistics like Cohen’s (Cohen, 1960) since they are more appropriate for the classification into single, welldefined categories.
3 Modeseeking Search
Neural sequencetosequence models define a probability distribution
over target sequences given a source sequence :(2) 
Sequences are typically computed over a subword (Sennrich et al., 2016; Kudo and Richardson, 2018) vocabulary and end with a special endofsentence symbol </s>:
(3) 
where is the Kleene closure over which includes the empty sequence . Since sequence models are usually trained to maximize the probability of the sequences in the training set, a common strategy to use such a model for inference is to search for the most likely output sequence , also known as the mode of the model distribution:^{4}^{4}4In a Bayesian framework this is often referred to as maximum a posteriori (MAP) inference.
(4) 
Eq. 4 is usually approximated using beam search. For analysis purposes, Stahlberg and Byrne (2019) proposed an exact depthfirst search (DFS) algorithm that is guaranteed to find the mode.
4 best Search
In addition to our investigations into the mode we also examine the cumulative probability mass that is covered by the best hypotheses. If a hypothesis set covers a large fraction of the entire probability mass it approximates the full model distribution well. Approximating the full model distribution is useful for various methods such as minimum risk training (Shen et al., 2016)
(Williams, 1992; Ranzato et al., 2015), minimum Bayes risk decoding (Kumar and Byrne, 2004; Stahlberg et al., 2017; Eikema and Aziz, 2020), etc. Ott et al. (2018) argued that the fraction of probability mass which is covered by a fixed number of candidates reflects the model uncertainty on the sequence level. We show that this model uncertainty is in line with our notion of intrinsic uncertainty that we measure with (Sec. 2). To that end, we propose a generalization of the exact search algorithm of Stahlberg and Byrne (2019) that is able to find the global best hypotheses rather than the single best one. Similarly to the singlebest algorithm, we use the monotonicity of neural sequence model scores:(5) 
Stahlberg and Byrne (2019) keep track of the best complete (i.e. ending with the endofsentence symbol </s>) hypothesis score during search, and use it to safely prune entire subspaces using Eq. 5. In contrast, we keep track of the th best complete hypothesis score by keeping the best complete hypotheses in a priority queue. Our exact best search algorithm is listed in Algorithm 1. Note that we recover the DFS scheme of Stahlberg and Byrne (2019) with .
5 Experimental Setup
Parameter  Value 
Attention dropout rate  0.1 
Attention layer size  512 
Batch size  256 
Dropout rate  0.1 
Embedding size  512 
MLP dimension  2,048 
Number of attention heads  8 
Number of layers  6 
Total number of parameters  121M 
Language pair  #Training sentence pairs  

Unfiltered  Filtered  
GermanEnglish  39M  33M 
FinnishEnglish  6.6M  5.5M 
LithuanianEnglish  2.3M  2.0M 
We trained four Transformer neural machine translation (NMT) models (Table
1) – EnglishGerman (MTende), GermanEnglish (MTdeen), FinnishEnglish (MTfien), and LithuanianEnglish (MTlten) – on the WMT19 (Barrault et al., 2019)training sets as provided by TensorFlow Datasets.
^{5}^{5}5https://www.tensorflow.org/datasets/catalog/wmt19_translate We selected these language pairs to experiment with different training set sizes (Table 2). The MT training sets were filtered using language ID and simple lengthbased heuristics, and split into subwords using joint 32K SentencePiece
(Kudo and Richardson, 2018) models. For training our GEC model we used the hyperparameters from Table 1 and followed the threestage training recipe of Stahlberg and Kumar (2021) using the 32K SentencePiece model from Raffel et al. (2020). All our models were trained until convergence on the development set using the LAMB (You et al., 2020) optimizer in JAX (Bradbury et al., 2018) by minimizing crossentropy without label smoothing. Our NMT models are evaluated on the WMT19 test sets (Barrault et al., 2019) using SacreBLEU (Post, 2018). Our GEC model is evaluated on the CoNLL14 (Ng et al., 2014, GECconll14) test set using Fscores computed with the M2 scorer (Dahlmeier and Ng, 2012) and on the JFLEG test set (Napoles et al., 2017, GECjfleg) using GLEU (Napoles et al., 2015).6 Results
System  ende  deen  fien  lten 

Xia et al. (2019)  44.9  42.8  31.9  35.6 
Our baselines  39.6  39.7  27.7  26.9 
System  conll14 (F)  jfleg (GLEU) 

Lichtarge et al. (2020)  66.8  64.9 
Rothe et al. (2021)  68.9   
Our baseline  60.0  62.1 
In this work our focus is to analyze the impact of intrinsic uncertainty on search. Thus we keep our setup simple, reproducible, and computationally economical rather than obtain new stateoftheart results. Nevertheless, Tables 3 and 4 show that our baselines are not unreasonably far off from the best results in the literature given that the systems we compare with are often highly engineered and use many more parameters. Xia et al. (2019) used various techniques like backtranslation, ensembling, dual learning, MASS pretraining, architecture search, larger models, etc. to improve their systems, and Rothe et al. (2021) used a 11B parameters T5 (Raffel et al., 2020) model.
6.1 Finding the Most Likely Hypothesis
Even though alternative decision rules like MBR have recently received some attention in the NMT literature (Eikema and Aziz, 2020; Müller and Sennrich, 2021), modeseeking decoding schemes such as beam search or Nucleus sampling (Holtzman et al., 2020) are by far the most common choices. In this section we explore how uncertainty changes the mode and the ability of beam search to find it.
A wellknown pathology of NMT models is the “beam search curse” (Koehn and Knowles, 2017): Increasing the beam size improves the predictive logprobabilities of the hypotheses, but it leads to worse translation quality due to the NMT model error of preferring short translations. We replicate this result in Fig. 2: BLEU scores for MT initially improve over greedy search at smaller beam sizes but after reaching a peak at beam size of 4, we observe a dramatic drop in BLEU. The trajectory of the blue curves (GEC) is markedly different: the performance does not drop for large beams but saturates instead. The beam search curse affects tasks with high intrinsic uncertainty like MT but spares more certain tasks like GEC although both tasks use the same neural Transformer architecture.
To determine why the beam size affects NMT and GEC so differently we ran the exact decoding algorithm of Stahlberg and Byrne (2019) to find the global best hypotheses and counted search errors, i.e. the number of sentences in the test set for which beam search does not find the global best sequence. Our results confirm the findings of Stahlberg and Byrne (2019) that increasing the beam sizes leads to fewer NMT search errors (Fig. 3). Among our MT language pairs, EnglishGerman (MTende) suffers the most from the beam search curse and the proportion of search errors in the test set. This is possibly because translation from English to German typically results in a longer sequence and thus more uncertainty. GEC differs significantly from NMT in the total number of search errors. For MT, even with a very large beam size of 500, beam search does not find the mode for more than 20% of the sentences in any language pair. In contrast for GEC, we do not observe any search errors for beam sizes larger than 10. This suggests that task uncertainty determines the tractability of the search space and particularly the search for the mode.
Uncertainty also determines the computational costs of exact search. To abstract away from hardware and implementation details, we measure the time complexity of exact search by counting the number of explored states, i.e. the number of forward passes through the model, which is identical to the number of recursive calls of Algorithm 1.^{6}^{6}6For example, the number of explored states in standard beam search is the beam size times the target sequence length.
(a) Greedy search errors (GECconll14)  (b) Greedy search errors (GECjfleg) 
(c) Number of explored DFS states (GECconll14)  (d) Number of explored DFS states (GECjfleg) 
(a) Greedy search errors (MTende) 
(b) Number of explored DFS states (MTende) 
Fig. 4 plots the fraction of sentences in the test set for which the exact search explores a certain maximum number of states to terminate. For example, exact search returned the mode for around 50% of the MT sentences after exploring no more than 1000 states. With the same computational budget, however, it was able to find the mode for nearly 100% of the GEC sentences (blue curves). For some of the MT sentences, exact search needed to explore around 100K states, or even more in the case of LithuanianEnglish (orange curve).
Sentencelevel uncertainty
In the previous paragraph we showed that MT, a task with high intrinsic uncertainty, suffers from more beam search errors and a less tractable search space than GEC, a task with relatively low intrinsic uncertainty. Figs. 5 and 6 demonstrate that this pattern is not only present at the tasklevel but also at the sentencelevel. First, the bar charts show that there is a general trend towards more search errors and more explored states for longer sentences. Longer input sentences often result in higher entropy distributions (i.e. more uncertainty) since there are usually more ways to map a long sentence than a short one. We also see a pattern within each group, i.e. within a reference length interval, that shows that sentences with higher uncertainty result in more search errors and a longer exact search runtime even when compared to other sentences with similar lengths. Table 5 lists the test set level correlation coefficients.
6.2 The Spread of Probability Mass
We argued in Sec. 4 that the ability to approximate the entire search space with a fixed set of candidates can be useful in training (Shen et al., 2016; Williams, 1992; Ranzato et al., 2015) and decoding (Kumar and Byrne, 2004; Eikema and Aziz, 2020), and proposed an exact best search algorithm. However, finding the exact best hypotheses is computationally much more expensive than finding the singlebest hypothesis (mode). Therefore, to keep the runtime under control, we stopped best decoding after 1M explored states. Fig. 7 shows that the 1M threshold is not reached for for any sentence: it was always possible to find and verify the mode. We can guarantee that the best candidates returned by our algorithm are indeed the global best ones for around 90% of the MTdeen sentences (right end of the green curve in Fig. 7). The blue curves in Fig. 7 suggest that as before the GEC search space is much more tractable given that our exact best search algorithm was able to find the 100 global best hypotheses for all GEC sentences before reaching 1M explored states. Indeed, Fig. 8 shows that exact 100best search terminated with fewer than 10K explored states for almost all GEC sentences while the pruning criterion in Eq. 5 is much less effective for the NMT search space (green curves in Fig. 8).
The cumulative probability mass of the set returned by exact best search is an upper bound for the cumulative probability mass of any hypothesis set with a cardinality of . Despite the high number of search errors (Fig. 3), the probability mass covered by the best beam search hypotheses is very close to this upper bound. Fig. 9 shows that for that difference is less than 0.001 for all setups except MTfien. Since the difference in probability mass is negligible we ran our subsequent investigations of probability mass with beam search instead of exact search to save computational costs.
Fig. 10 visualizes the difference between NMT and GEC in terms of the probability mass covered by the beam search hypotheses. We confirm the finding of Ott et al. (2018); Eikema and Aziz (2020) that the NMT distribution is rather flat: even 1000 MT candidates cover only 20% of the probability mass on average. In GEC, however, the model assigns twice as much probability (40%) to the single best hypothesis on average (left end of the blue curves in Fig. 10). Fig. 11 provides even more insight: A beam size of 1000 covers 40% of the probability mass for nearly all sentences in the GEC test sets. Even more practical beam sizes of 10 cover more than half of the probability mass for around 75% of the GECconll14 sentences. The same plot looks very different for MT (Fig. 12): Covering half the probability mass is only possible for a tiny fraction of the MT sentences.
(a) GECconll14  (b) GECjfleg 
between and…  GEC  MT  

conll14  jfleg  ende  
Greedy search errors  0.18  0.19  0.24 
#Explored DFS states  0.20  0.18  0.19 
Cumul. prob. mass  0.23  0.51  0.53 
Sentencelevel uncertainty
In Sec. 6.1 we reported that the effects caused by intrinsic uncertainty on the ability to find the mode are visible at both the task and the sentencelevels. Similarly, we can track down our observations about how uncertainty determines the probability mass of best lists at the sentence level. Fig. 13 shows that the cumulative probability mass in the best list decreases for longer sentences as the mappings of long sentences are more uncertain. Again, the trend within a group in Fig. 13 suggests that even among sentences with similar lengths, best lists for uncertain sentences (higher ) accumulate less probability mass. We make analogous observations for NMT (Fig. 14), although the total best probability mass is much smaller than for GEC.
7 Related Work
Ambiguity is one of the core challenges in MT, a fact that is supported (inter alia) by the long history of designing evaluation metrics that are robust against it
(Papineni et al., 2002; Banerjee and Lavie, 2005; Sellam et al., 2020). In this work we examine the impact of ambiguity on the NMT search space, and show how it is related to various wellknown issues of NMT models like the beam search curse (Koehn and Knowles, 2017), a pathology that has also been linked to the local normalization in sequence models (Sountsov and Sarawagi, 2016; Murray and Chiang, 2018) or poor model calibration (Kumar and Sarawagi, 2019).Our work is heavily inspired by Ott et al. (2018) who analyzed different kinds of uncertainty in NMT. In particular, they found that NMT spreads out the probability mass over a large number of candidates, and connected the beam search curse with uncertainty. We confirm their results and extend their line of research along the following directions: We introduce a measure for uncertainty in multireference test sets, and show that the negative effects of uncertainty are visible even on the sentence level. Second, we propose an exact best search algorithm and demonstrate how it can be used to analyze the spread of probability mass. Third, we focus not only on MT but also on GEC.
Stahlberg and Byrne (2019) showed that beam search errors often obscure the length deficiency of the NMT modes, and reducing search errors by using large beams exposes this model error. In this work, we found that these mechanics are limited to NMT: GEC does not suffer from the beam search curse since search errors are rare and modes are not too short. Eikema and Aziz (2020) suggested that picking a hypothesis based solely on probability is erratic because NMT spreads out the probability mass over a large set of hypotheses with similar probabilities. Therefore, alternative approaches that in addition to the probabilities incorporate MTspecific metrics such as BLEU (Papineni et al., 2002) or BLEURT (Sellam et al., 2020) have recently been in focus of research, including minimum Bayes risk decoding (Eikema and Aziz, 2020, 2021; Müller and Sennrich, 2021), MonteCarlo tree search (Leblond et al., 2021), and energybased (Bhattacharyya et al., 2021) or discriminatively trained (Lee et al., 2021) rerankers. Our work on how uncertainty determines the spread of probability mass is relevant to those approaches.
8 Conclusion
We identified a major culprit behind various inferencerelated issues in sequencetosequence models such as the intractability of the search space, degenerate large beam or exact search outputs and the large spread in probability mass over the output space. This factor is intrinsic uncertainty – the existence of multiple ways to correctly map an input sequence. We measured the intrinsic uncertainty of input sentences as the degree of agreement between multiple references and showed that ambiguous sentences typically result in a higher number of beam search errors and an exceedingly flat output distribution. We also find that known NMT pathologies such as the beam search curse or inadequate modes do not extend to less ambiguous tasks like GEC despite using the same neural architecture.
References
 Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
 Barrault et al. (2019) Loïc Barrault, Ondřej Bojar, Marta R. Costajussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, Santanu Pal, Matt Post, and Marcos Zampieri. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
 Bhattacharyya et al. (2021) Sumanta Bhattacharyya, Amirmohammad Rooshenas, Subhajit Naskar, Simeng Sun, Mohit Iyyer, and Andrew McCallum. 2021. Energybased reranking: Improving neural machine translation using energybased models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4528–4537, Online. Association for Computational Linguistics.
 Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye WandermanMilne, and Qiao Zhang. 2018. JAX: Composable transformations of Python+NumPy programs.
 Brockett et al. (2006) Chris Brockett, William B. Dolan, and Michael Gamon. 2006. Correcting ESL errors using phrasal SMT techniques. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 249–256, Sydney, Australia. Association for Computational Linguistics.
 Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are fewshot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
 Bryant and Ng (2015) Christopher Bryant and Hwee Tou Ng. 2015. How far are we from fully automatic high quality grammatical error correction? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 697–707, Beijing, China. Association for Computational Linguistics.
 Cohen (1960) Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
 Dahlmeier and Ng (2012) Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568–572, Montréal, Canada. Association for Computational Linguistics.
 Der Kiureghian and Ditlevsen (2009) Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? Does it matter? Structural safety, 31(2):105–112.
 Devlin et al. (2019) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
 Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
 Dreyer and Marcu (2012) Markus Dreyer and Daniel Marcu. 2012. HyTER: Meaningequivalent semantics for translation evaluation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 162–171, Montréal, Canada. Association for Computational Linguistics.
 Eikema and Aziz (2020) Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? the inadequacy of the mode in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4506–4520, Barcelona, Spain (Online). International Committee on Computational Linguistics.
 Eikema and Aziz (2021) Bryan Eikema and Wilker Aziz. 2021. Samplingbased minimum bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718.
 Freitag et al. (2020) Markus Freitag, George Foster, David Grangier, and Colin Cherry. 2020. Humanparaphrased references improve neural machine translation. In Proceedings of the Fifth Conference on Machine Translation, pages 1183–1192, Online. Association for Computational Linguistics.
 Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In International Conference on Learning Representations.

Karita et al. (2019)
Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma,
Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto,
Xiaofei Wang, et al. 2019.
A comparative study on Transformer vs RNN in speech applications.
In
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
, pages 449–456. IEEE.  Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
 Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
 Kumar and Sarawagi (2019) Aviral Kumar and Sunita Sarawagi. 2019. Calibration of encoder decoder models for neural machine translation. arXiv preprint arXiv:1903.00802.
 Kumar and Byrne (2004) Shankar Kumar and William Byrne. 2004. Minimum Bayesrisk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLTNAACL 2004, pages 169–176, Boston, Massachusetts, USA. Association for Computational Linguistics.
 Leblond et al. (2021) Rémi Leblond, JeanBaptiste Alayrac, Laurent Sifre, Miruna Pislar, JeanBaptiste Lespiau, Ioannis Antonoglou, Karen Simonyan, and Oriol Vinyals. 2021. Machine translation decoding beyond beam search. arXiv preprint arXiv:2104.05336.
 Lee et al. (2021) Ann Lee, Michael Auli, and Marc’Aurelio Ranzato. 2021. Discriminative reranking for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7250–7264, Online. Association for Computational Linguistics.
 Lichtarge et al. (2020) Jared Lichtarge, Chris Alberti, and Shankar Kumar. 2020. Data weighted training strategies for grammatical error correction. Transactions of the Association for Computational Linguistics, 8:634–646.
 Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
 Müller and Sennrich (2021) Mathias Müller and Rico Sennrich. 2021. Understanding the properties of minimum Bayes risk decoding in neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 259–272, Online. Association for Computational Linguistics.
 Murray and Chiang (2018) Kenton Murray and David Chiang. 2018. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 212–223, Brussels, Belgium. Association for Computational Linguistics.
 Napoles et al. (2015) Courtney Napoles, Keisuke Sakaguchi, Matt Post, and Joel Tetreault. 2015. Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 588–593, Beijing, China. Association for Computational Linguistics.
 Napoles et al. (2017) Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 229–234, Valencia, Spain. Association for Computational Linguistics.
 Ng et al. (2014) Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14, Baltimore, Maryland. Association for Computational Linguistics.
 Ott et al. (2018) Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Analyzing uncertainty in neural machine translation. In International Conference on Machine Learning, pages 3956–3965. PMLR.
 Padó et al. (2009) Sebastian Padó, Daniel Cer, Michel Galley, Dan Jurafsky, and Christopher D Manning. 2009. Measuring machine translation quality as semantic equivalence: A metric based on entailment features. Machine Translation, 23(23):181–193.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
 Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
 Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified texttotext Transformer. Journal of Machine Learning Research, 21(140):1–67.
 Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732.
 Rothe et al. (2021) Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 702–707, Online. Association for Computational Linguistics.
 Rozovskaya and Roth (2010) Alla Rozovskaya and Dan Roth. 2010. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 28–36, Los Angeles, California. Association for Computational Linguistics.
 Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
 Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1683–1692, Berlin, Germany. Association for Computational Linguistics.
 Sountsov and Sarawagi (2016) Pavel Sountsov and Sunita Sarawagi. 2016. Length bias in encoder decoder models and a case for global conditioning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1516–1525, Austin, Texas. Association for Computational Linguistics.
 Stahlberg and Byrne (2019) Felix Stahlberg and Bill Byrne. 2019. On NMT search errors and model errors: Cat got your tongue? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3356–3362, Hong Kong, China. Association for Computational Linguistics.
 Stahlberg et al. (2017) Felix Stahlberg, Adrià de Gispert, Eva Hasler, and Bill Byrne. 2017. Neural machine translation by minimising the Bayesrisk with respect to syntactic translation lattices. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 362–368, Valencia, Spain. Association for Computational Linguistics.
 Stahlberg and Kumar (2021) Felix Stahlberg and Shankar Kumar. 2021. Synthetic data generation for grammatical error correction with tagged corruption models. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 37–47, Online. Association for Computational Linguistics.
 Tetreault and Chodorow (2008) Joel Tetreault and Martin Chodorow. 2008. Native judgments of nonnative usage: Experiments in preposition error detection. In Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics, pages 24–32, Manchester, UK. Coling 2008 Organizing Committee.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256.
 Xia et al. (2019) Yingce Xia, Xu Tan, Fei Tian, Fei Gao, Di He, Weicong Chen, Yang Fan, Linyuan Gong, Yichong Leng, Renqian Luo, Yiren Wang, Lijun Wu, Jinhua Zhu, Tao Qin, and TieYan Liu. 2019. Microsoft Research Asia’s systems for WMT19. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 424–433, Florence, Italy. Association for Computational Linguistics.
 Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
 You et al. (2020) Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and ChoJui Hsieh. 2020. Large batch optimization for deep learning: Training BERT in 76 minutes. In International Conference on Learning Representations.