Log In Sign Up

Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.


page 1

page 2

page 3

page 4


Analyzing Uncertainty in Neural Machine Translation

Machine translation is a popular test bed for research in neural sequenc...

Seq2Edits: Sequence Transduction Using Span-level Edit Operations

We propose Seq2Edits, an open-vocabulary approach to sequence editing fo...

Rethinking the Evaluation of Neural Machine Translation

The evaluation of neural machine translation systems is usually built up...

Jam or Cream First? Modeling Ambiguity in Neural Machine Translation with SCONES

The softmax layer in neural machine translation is designed to model the...

Best-First Beam Search

Decoding for many NLP tasks requires a heuristic algorithm for approxima...

Sparse Sequence-to-Sequence Models

Sequence-to-sequence models are a powerful workhorse of NLP. Most varian...

Determinantal Beam Search

Beam search is a go-to strategy for decoding neural sequence models. The...

1 Introduction

With the advent of deep learning, many applications of machine learning have converged on a similar set of methods and models. For example, the Transformer

(Vaswani et al., 2017) sequence-to-sequence architecture is ubiquitous in various fields of natural language processing (NLP) such as machine translation (MT), grammatical error correction (GEC), speech recognition (Karita et al., 2019)

, etc., and has also been applied successfully to other tasks such as computer vision

(Dosovitskiy et al., 2021). Recent large pre-trained NLP models such as BERT (Devlin et al., 2019)

, GPT-3

(Brown et al., 2020), T5 (Raffel et al., 2020), RoBERTa (Liu et al., 2019), and XLNet (Yang et al., 2019) are all based on the Transformer, with relatively minor changes to the architecture itself.

We show that despite this architectural uniformity the learned distribution over sequences has strikingly different characteristics for different NLP tasks. Inspired by Ott et al. (2018) we identify intrinsic uncertainty – the nature of some NLP tasks to allow multiple viable outputs for a given input111This is sometimes referred to as aleatoric uncertainty in the literature (Der Kiureghian and Ditlevsen, 2009). – to be a major factor that shapes the search space of Transformer models and determines its tractability. In machine translation (MT) – a task known to have high intrinsic uncertainty (Padó et al., 2009; Dreyer and Marcu, 2012; Ott et al., 2018) – Transformer models suffer from a high number of beam search errors (Stahlberg and Byrne, 2019), an inadequacy of the mode (Eikema and Aziz, 2020), and translation performance degradation with large beam sizes (Koehn and Knowles, 2017) (also known as the “beam search curse”). In contrast, for the correction of writing errors in text (grammatical error correction – GEC) (Brockett et al., 2006), a task with a lower level of uncertainty (Bryant and Ng, 2015), none of these pathologies are evident. This pattern holds even at the sequence-level: input sentences with high uncertainty tend to result in more search errors and a less tractable search space. To study the influence of uncertainty on sequences around the mode, we propose an exact -best search algorithm for neural sequence models. We show that the probability mass covered by the -best candidates differs markedly between certain and uncertain tasks and sentences, which shows that intrinsic uncertainty also affects the spread of probability mass and thus the model uncertainty. We confirm recent work showing that beam search has drawbacks as a decoding scheme for MT. Nevertheless, it is effective for GEC, a problem where modes are adequate, search errors are rare, and the -best lists cover a large fraction of the probability mass.

2 Measuring Intrinsic Uncertainty

Intrinsic uncertainty refers to the inherent nature of some NLP tasks to allow for more than one feasible output for a given input. For example, intrinsic uncertainty in MT stems from the fact that there are often several semantically equivalent translations for the same source sentence, or that the translation into a highly inflected language is sometimes under-specified (Ott et al., 2018). Studies have shown that even for tasks like GEC, annotators do not always agree (Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010; Bryant and Ng, 2015), but the level of intrinsic uncertainty is arguably lower than for MT because there is a limited number of ways to correct an ungrammatical sentence.

Figure 1: Average uncertainty factor

for GEC (blue) and English-German MT (purple) grouped by the sentence length. The error bars show the standard error of the mean (SEM).

We propose a simple way to measure sentence-level output uncertainty by making use of multi-reference test sets. For an -way annotated sentence with references we define the uncertainty as the average relative edit distance between two references:


where denotes the Levenshtein distance. Fig. 1 presents this uncertainty score for one MT test set and two GEC test sets. MT-ende is the official WMT19 English-German test set (Barrault et al., 2019) paired with the additional human-annotated “newstest2019 AR” references provided by Freitag et al. (2020).222The AR references are created from scratch, unlike the other paraphrasing references by Freitag et al. (2020). GEC-conll14 uses the 10 references published by Bryant and Ng (2015) for the CoNLL-2014 shared task on GEC (Ng et al., 2014), and GEC-jfleg is a 4-reference GEC test set that represents “a broad range of language proficiency levels” (Napoles et al., 2017). Our uncertainty measure reflects our intuition that MT is a significantly more uncertain task than GEC.333The mean value differs significantly between GEC and MT in each length bucket (-test -value of less than ). For both tasks the uncertainty increases with the sentence length as longer sentences typically have more feasible mappings than shorter ones. We use the edit distance rather than task-specific metrics like BLEU (Papineni et al., 2002) or BLEURT (Sellam et al., 2020) since they are designed to be robust against uncertainty effects such as reordering or semantically equivalent references, precisely the kinds of effects we aim to capture with . We follow Bryant and Ng (2015) by not using inter-annotator agreement statistics like Cohen’s (Cohen, 1960) since they are more appropriate for the classification into single, well-defined categories.

3 Mode-seeking Search

Neural sequence-to-sequence models define a probability distribution

over target sequences given a source sequence :


Sequences are typically computed over a subword (Sennrich et al., 2016; Kudo and Richardson, 2018) vocabulary and end with a special end-of-sentence symbol </s>:


where is the Kleene closure over which includes the empty sequence . Since sequence models are usually trained to maximize the probability of the sequences in the training set, a common strategy to use such a model for inference is to search for the most likely output sequence , also known as the mode of the model distribution:444In a Bayesian framework this is often referred to as maximum a posteriori (MAP) inference.


Eq. 4 is usually approximated using beam search. For analysis purposes, Stahlberg and Byrne (2019) proposed an exact depth-first search (DFS) algorithm that is guaranteed to find the mode.

4 -best Search

0:  : Search for global best sequences      : Source sequence      : Target prefix (default: )      : (default: )      : Lower bound (default: )      : Priority queue (default: )
1:  if  then
3:     if  then
5:     end if
6:     return  
7:  end if
8:  for all  do
10:     if  then
12:     end if
13:  end for
14:  return  
Algorithm 1 NbestDFS

In addition to our investigations into the mode we also examine the cumulative probability mass that is covered by the best hypotheses. If a hypothesis set covers a large fraction of the entire probability mass it approximates the full model distribution well. Approximating the full model distribution is useful for various methods such as minimum risk training (Shen et al., 2016)

, reinforcement learning

(Williams, 1992; Ranzato et al., 2015), minimum Bayes risk decoding (Kumar and Byrne, 2004; Stahlberg et al., 2017; Eikema and Aziz, 2020), etc. Ott et al. (2018) argued that the fraction of probability mass which is covered by a fixed number of candidates reflects the model uncertainty on the sequence level. We show that this model uncertainty is in line with our notion of intrinsic uncertainty that we measure with (Sec. 2). To that end, we propose a generalization of the exact search algorithm of Stahlberg and Byrne (2019) that is able to find the global best hypotheses rather than the single best one. Similarly to the single-best algorithm, we use the monotonicity of neural sequence model scores:


Stahlberg and Byrne (2019) keep track of the best complete (i.e. ending with the end-of-sentence symbol </s>) hypothesis score during search, and use it to safely prune entire subspaces using Eq. 5. In contrast, we keep track of the -th best complete hypothesis score by keeping the best complete hypotheses in a priority queue. Our exact -best search algorithm is listed in Algorithm 1. Note that we recover the DFS scheme of Stahlberg and Byrne (2019) with .

5 Experimental Setup

Parameter Value
Attention dropout rate 0.1
Attention layer size 512
Batch size 256
Dropout rate 0.1
Embedding size 512
MLP dimension 2,048
Number of attention heads 8
Number of layers 6
Total number of parameters 121M
Table 1: Transformer hyper-parameters.
Language pair #Training sentence pairs
Unfiltered Filtered
German-English 39M 33M
Finnish-English 6.6M 5.5M
Lithuanian-English 2.3M 2.0M
Table 2: MT training set sizes.

We trained four Transformer neural machine translation (NMT) models (Table 

1) – English-German (MT-ende), German-English (MT-deen), Finnish-English (MT-fien), and Lithuanian-English (MT-lten) – on the WMT19 (Barrault et al., 2019)

training sets as provided by TensorFlow Datasets.

555 We selected these language pairs to experiment with different training set sizes (Table 2

). The MT training sets were filtered using language ID and simple length-based heuristics, and split into subwords using joint 32K SentencePiece

(Kudo and Richardson, 2018) models. For training our GEC model we used the hyper-parameters from Table 1 and followed the three-stage training recipe of Stahlberg and Kumar (2021) using the 32K SentencePiece model from Raffel et al. (2020). All our models were trained until convergence on the development set using the LAMB (You et al., 2020) optimizer in JAX (Bradbury et al., 2018) by minimizing cross-entropy without label smoothing. Our NMT models are evaluated on the WMT19 test sets (Barrault et al., 2019) using SacreBLEU (Post, 2018). Our GEC model is evaluated on the CoNLL14 (Ng et al., 2014, GEC-conll14) test set using F-scores computed with the M2 scorer (Dahlmeier and Ng, 2012) and on the JFLEG test set (Napoles et al., 2017, GEC-jfleg) using GLEU (Napoles et al., 2015).

6 Results

System ende deen fien lten
Xia et al. (2019) 44.9 42.8 31.9 35.6
Our baselines 39.6 39.7 27.7 26.9
Table 3: BLEU scores of our NMT baselines and one of the best systems in the WMT19 evaluation campaign – MSRA.MADL (Xia et al., 2019).
System conll14 (F) jfleg (GLEU)
Lichtarge et al. (2020) 66.8 64.9
Rothe et al. (2021) 68.9 -
Our baseline 60.0 62.1
Table 4: Comparison of our GEC baseline with the best results reported in the literature.

In this work our focus is to analyze the impact of intrinsic uncertainty on search. Thus we keep our setup simple, reproducible, and computationally economical rather than obtain new state-of-the-art results. Nevertheless, Tables 3 and 4 show that our baselines are not unreasonably far off from the best results in the literature given that the systems we compare with are often highly engineered and use many more parameters. Xia et al. (2019) used various techniques like back-translation, ensembling, dual learning, MASS pre-training, architecture search, larger models, etc. to improve their systems, and Rothe et al. (2021) used a 11B parameters T5 (Raffel et al., 2020) model.

6.1 Finding the Most Likely Hypothesis

Even though alternative decision rules like MBR have recently received some attention in the NMT literature (Eikema and Aziz, 2020; Müller and Sennrich, 2021), mode-seeking decoding schemes such as beam search or Nucleus sampling (Holtzman et al., 2020) are by far the most common choices. In this section we explore how uncertainty changes the mode and the ability of beam search to find it.

Figure 2: Relative beam search improvements over greedy search. MT quality degrades with large beam sizes, but GEC saturates after a beam size of 10.

A well-known pathology of NMT models is the “beam search curse” (Koehn and Knowles, 2017): Increasing the beam size improves the predictive log-probabilities of the hypotheses, but it leads to worse translation quality due to the NMT model error of preferring short translations. We replicate this result in Fig. 2: BLEU scores for MT initially improve over greedy search at smaller beam sizes but after reaching a peak at beam size of 4, we observe a dramatic drop in BLEU. The trajectory of the blue curves (GEC) is markedly different: the performance does not drop for large beams but saturates instead. The beam search curse affects tasks with high intrinsic uncertainty like MT but spares more certain tasks like GEC although both tasks use the same neural Transformer architecture.

To determine why the beam size affects NMT and GEC so differently we ran the exact decoding algorithm of Stahlberg and Byrne (2019) to find the global best hypotheses and counted search errors, i.e. the number of sentences in the test set for which beam search does not find the global best sequence. Our results confirm the findings of Stahlberg and Byrne (2019) that increasing the beam sizes leads to fewer NMT search errors (Fig. 3). Among our MT language pairs, English-German (MT-ende) suffers the most from the beam search curse and the proportion of search errors in the test set. This is possibly because translation from English to German typically results in a longer sequence and thus more uncertainty. GEC differs significantly from NMT in the total number of search errors. For MT, even with a very large beam size of 500, beam search does not find the mode for more than 20% of the sentences in any language pair. In contrast for GEC, we do not observe any search errors for beam sizes larger than 10. This suggests that task uncertainty determines the tractability of the search space and particularly the search for the mode.

Uncertainty also determines the computational costs of exact search. To abstract away from hardware and implementation details, we measure the time complexity of exact search by counting the number of explored states, i.e. the number of forward passes through the model, which is identical to the number of recursive calls of Algorithm 1.666For example, the number of explored states in standard beam search is the beam size times the target sequence length.

Figure 3: Number of beam search errors.
Figure 4: Number of states exact search needs to explore in order to find and verify the mode.
(a) Greedy search errors (GEC-conll14) (b) Greedy search errors (GEC-jfleg)
(c) Number of explored DFS states (GEC-conll14) (d) Number of explored DFS states (GEC-jfleg)
Figure 5: The impact of sentence length and uncertainty on the number of greedy search errors and the number of explored states by exact search for GEC. The error bars show the SEM.
(a) Greedy search errors (MT-ende)
(b) Number of explored DFS states (MT-ende)
Figure 6: The impact of sentence length and uncertainty on the number of greedy search errors and the number of explored states by exact search for MT. The error bars show the SEM.

Fig. 4 plots the fraction of sentences in the test set for which the exact search explores a certain maximum number of states to terminate. For example, exact search returned the mode for around 50% of the MT sentences after exploring no more than 1000 states. With the same computational budget, however, it was able to find the mode for nearly 100% of the GEC sentences (blue curves). For some of the MT sentences, exact search needed to explore around 100K states, or even more in the case of Lithuanian-English (orange curve).

Sentence-level uncertainty

In the previous paragraph we showed that MT, a task with high intrinsic uncertainty, suffers from more beam search errors and a less tractable search space than GEC, a task with relatively low intrinsic uncertainty. Figs. 5 and 6 demonstrate that this pattern is not only present at the task-level but also at the sentence-level. First, the bar charts show that there is a general trend towards more search errors and more explored states for longer sentences. Longer input sentences often result in higher entropy distributions (i.e. more uncertainty) since there are usually more ways to map a long sentence than a short one. We also see a pattern within each group, i.e. within a reference length interval, that shows that sentences with higher uncertainty result in more search errors and a longer exact search runtime even when compared to other sentences with similar lengths. Table 5 lists the test set level correlation coefficients.

6.2 The Spread of Probability Mass

Figure 7: Number of sentences for which exact -best search did not terminate before 1M explored states.
Figure 8: Number of states exact -best search needs to explore in order to terminate for GEC-jfleg and MT-deen.

We argued in Sec. 4 that the ability to approximate the entire search space with a fixed set of candidates can be useful in training (Shen et al., 2016; Williams, 1992; Ranzato et al., 2015) and decoding (Kumar and Byrne, 2004; Eikema and Aziz, 2020), and proposed an exact -best search algorithm. However, finding the exact -best hypotheses is computationally much more expensive than finding the single-best hypothesis (mode). Therefore, to keep the runtime under control, we stopped -best decoding after 1M explored states. Fig. 7 shows that the 1M threshold is not reached for for any sentence: it was always possible to find and verify the mode. We can guarantee that the best candidates returned by our algorithm are indeed the global best ones for around 90% of the MT-deen sentences (right end of the green curve in Fig. 7). The blue curves in Fig. 7 suggest that as before the GEC search space is much more tractable given that our exact -best search algorithm was able to find the 100 global best hypotheses for all GEC sentences before reaching 1M explored states. Indeed, Fig. 8 shows that exact 100-best search terminated with fewer than 10K explored states for almost all GEC sentences while the pruning criterion in Eq. 5 is much less effective for the NMT search space (green curves in Fig. 8).

Figure 9: Difference in cumulative probability mass between the global best hypothesis set returned by exact -best search and the -best list returned by beam search with different beam sizes.

The cumulative probability mass of the set returned by exact -best search is an upper bound for the cumulative probability mass of any hypothesis set with a cardinality of . Despite the high number of search errors (Fig. 3), the probability mass covered by the -best beam search hypotheses is very close to this upper bound. Fig. 9 shows that for that difference is less than 0.001 for all setups except MT-fien. Since the difference in probability mass is negligible we ran our subsequent investigations of probability mass with beam search instead of exact search to save computational costs.

Figure 10: Average probability mass covered by the -best list from beam search for beam sizes between 1 and 1000.

Fig. 10 visualizes the difference between NMT and GEC in terms of the probability mass covered by the beam search hypotheses. We confirm the finding of Ott et al. (2018); Eikema and Aziz (2020) that the NMT distribution is rather flat: even 1000 MT candidates cover only 20% of the probability mass on average. In GEC, however, the model assigns twice as much probability (40%) to the single best hypothesis on average (left end of the blue curves in Fig. 10). Fig. 11 provides even more insight: A beam size of 1000 covers 40% of the probability mass for nearly all sentences in the GEC test sets. Even more practical beam sizes of 10 cover more than half of the probability mass for around 75% of the GEC-conll14 sentences. The same plot looks very different for MT (Fig. 12): Covering half the probability mass is only possible for a tiny fraction of the MT sentences.

Figure 11: The number of sentences for which the total probability mass contained in a beam search -best list with beam sizes of 1, 10, 100, 1000 is a certain fraction of the total probability mass (GEC).
Figure 12: The number of sentences for which the total probability mass contained in a beam search -best list with beam sizes of 1, 10, 100, 1000 is a certain fraction of the total probability mass (MT).
(a) GEC-conll14 (b) GEC-jfleg
Figure 13: The impact of sentence length and uncertainty on the cumulative probability mass of the 100-best list from beam search for GEC. The error bars show the SEM.
Figure 14: The impact of sentence length and uncertainty on the cumulative probability mass of the 100-best list from beam search for MT. The error bars show the SEM.
between and… GEC MT
conll14 jfleg ende
Greedy search errors  0.18  0.19  0.24
#Explored DFS states  0.20  0.18  0.19
Cumul. prob. mass -0.23 -0.51 -0.53
Table 5: Spearman’s rank correlation coefficient between the uncertainty and the number of greedy search errors, the number of explored DFS states, and the 100-best cumulative probability mass. All correlations are significant with a -value of less than 0.00001.

Sentence-level uncertainty

In Sec. 6.1 we reported that the effects caused by intrinsic uncertainty on the ability to find the mode are visible at both the task- and the sentence-levels. Similarly, we can track down our observations about how uncertainty determines the probability mass of -best lists at the sentence level. Fig. 13 shows that the cumulative probability mass in the -best list decreases for longer sentences as the mappings of long sentences are more uncertain. Again, the trend within a group in Fig. 13 suggests that even among sentences with similar lengths, -best lists for uncertain sentences (higher ) accumulate less probability mass. We make analogous observations for NMT (Fig. 14), although the total -best probability mass is much smaller than for GEC.

7 Related Work

Ambiguity is one of the core challenges in MT, a fact that is supported (inter alia) by the long history of designing evaluation metrics that are robust against it

(Papineni et al., 2002; Banerjee and Lavie, 2005; Sellam et al., 2020). In this work we examine the impact of ambiguity on the NMT search space, and show how it is related to various well-known issues of NMT models like the beam search curse (Koehn and Knowles, 2017), a pathology that has also been linked to the local normalization in sequence models (Sountsov and Sarawagi, 2016; Murray and Chiang, 2018) or poor model calibration (Kumar and Sarawagi, 2019).

Our work is heavily inspired by Ott et al. (2018) who analyzed different kinds of uncertainty in NMT. In particular, they found that NMT spreads out the probability mass over a large number of candidates, and connected the beam search curse with uncertainty. We confirm their results and extend their line of research along the following directions: We introduce a measure for uncertainty in multi-reference test sets, and show that the negative effects of uncertainty are visible even on the sentence level. Second, we propose an exact -best search algorithm and demonstrate how it can be used to analyze the spread of probability mass. Third, we focus not only on MT but also on GEC.

Stahlberg and Byrne (2019) showed that beam search errors often obscure the length deficiency of the NMT modes, and reducing search errors by using large beams exposes this model error. In this work, we found that these mechanics are limited to NMT: GEC does not suffer from the beam search curse since search errors are rare and modes are not too short. Eikema and Aziz (2020) suggested that picking a hypothesis based solely on probability is erratic because NMT spreads out the probability mass over a large set of hypotheses with similar probabilities. Therefore, alternative approaches that in addition to the probabilities incorporate MT-specific metrics such as BLEU (Papineni et al., 2002) or BLEURT (Sellam et al., 2020) have recently been in focus of research, including minimum Bayes risk decoding (Eikema and Aziz, 2020, 2021; Müller and Sennrich, 2021), Monte-Carlo tree search (Leblond et al., 2021), and energy-based (Bhattacharyya et al., 2021) or discriminatively trained (Lee et al., 2021) rerankers. Our work on how uncertainty determines the spread of probability mass is relevant to those approaches.

8 Conclusion

We identified a major culprit behind various inference-related issues in sequence-to-sequence models such as the intractability of the search space, degenerate large beam or exact search outputs and the large spread in probability mass over the output space. This factor is intrinsic uncertainty – the existence of multiple ways to correctly map an input sequence. We measured the intrinsic uncertainty of input sentences as the degree of agreement between multiple references and showed that ambiguous sentences typically result in a higher number of beam search errors and an exceedingly flat output distribution. We also find that known NMT pathologies such as the beam search curse or inadequate modes do not extend to less ambiguous tasks like GEC despite using the same neural architecture.