The standard approach to natural language generation uses a search algorithm, guided by an autoregressive (conditional) language model, to search through the space of possible strings. Since this search space is immense, various pruning techniques have been introduced to facilitate tractable text generation. Beam search beamsearch is a deterministic algorithm that prunes the search space according to the relative rank of each prefix, keeping only the top prefixes at every step. Although rank-based pruning has no probabilistic justification – it is mainly motivated by its ability to limit memory consumption – beam search is an effective approach for conditional text generation tasks, such as machine translation and summarization. Nucleus sampling holtzman2019curious, on the other hand, is a stochastic algorithm, which prunes the bottom percentile of the model’s next-token distribution, thus eliminating bad candidates while retaining some degree of randomness, which is important for free-form text generation. What if we were to replace beam search’s rank-based pruning mechanism (top ) with the probabilistic mechanism of nucleus sampling (top )?
We experiment with two variants of this hypothetical nucleus search. The first algorithm, -exact search, locally prunes the search space by retaining only the top of every next-token distribution that the underlying language model produces. It then performs an exact search over the remaining space, guaranteeing the most probable sequence under the local pruning assumption. The second algorithm, dynamic beam search, selects the top beams at each step, according to their normalized probabilities (rather than top , by rank). This method can effectively shrink or enhance the number of beams to match the current step’s low or high entropy, respectively.
We evaluate both algorithms on three different conditional generation benchmarks: subword-level translation (WMT’14 EN-FR), character-level translation (IWSLT’14 DE-EN), and summarization (XSUM with BART pretraining). While we observe that both nucleus search algorithms produce competitive results with standard beam search, we do not find any empirical advantage to our probabilistically-motivated approach.
We further analyze the algorithms by isolating the impact of dynamically expanding or shrinking the number of candidates. Experiments show that expanding the beam, even when entropy is high, tends to decrease performance. Pruning candidates, on the other hand, appears to have no adverse effects, and may even have a marginal positive effect in certain cases, which possibly cancels out with the negative effects of beam expansion.
Natural language generation can be defined as a search problem in the space of possible sequences over a token vocabulary , where the goal is to find an optimal sequence according to some cost function. Typical search algorithms explore this infinite space via sequence prefixes, starting with the empty sequence, and incrementally appending one potential token at a time. Search terminates by returning a sequence (or a set of sequences) that ends with a special token that indicates the end of the sequence (EOS).
The cost function is based on an underlying language model that, given a prefix , induces a probability distribution over , which we denote .111The underlying model is often a conditional language model , which takes an additional sequence as part of its input. For brevity, we omit from our notation. We can thus compute the probability of an entire sequence (or prefix) as the product of token probabilities:
In practice, it is common to use the negative log probability instead:
This defines a monotonic additive cost function, where appending each token adds a positive cost to the total cost of the sequence.
2.1 Beam Search
In many natural language generation tasks, beam search beamsearch is the algorithm of choice. It extends the simple greedy algorithm by considering possible prefixes at each timestep. The beam size is constant throughout the search, guaranteeing a limit on memory consumption.
At every step , beam search ranks all the possible single-token extensions of the current prefixes, and then keeps only the best extensions according to their total cost:
Once a prefix is appended with an EOS token, it is considered a complete sequence, and remains fixed as long as its cost is among the best prefixes; if (or more) better prefixes are found, the sequence is discarded. The algorithm terminates when either the final token of all top sequences is EOS, or when exceeds the predefined maximum number of steps. In both cases, it returns all sequences in the beam that end with EOS.222Typically, a system will eventually select the top sequence in the set, or choose an alternative sequence via some reranking criterion.
Assuming the underlying models are well-calibrated, results should improve as the beam size increases. However, this assumption does not hold for contemporary models; in practice, text quality deteriorates when using large values of koehn-knowles-2017-six. Furthermore, decoding with exact search dijkstra1959note reveals that translation models often rank the empty string as the most probable sequence stahlberg-byrne-2019-nmt. Perhaps unintentionally, searching with small beam sizes mitigates this flaw, a phenomenon that has been referred to as the “blessing” of beam search meister-etal-2020-beam.
2.2 Nucleus Sampling
Deterministic search algorithms, such as beam search, try to generate the most probable sequence. This is a desirable property when we have many constraints regarding the target output, as in translation or question answering. However, tasks that require more creativity and diversity in language may benefit fromstochastic algorithms.
holtzman2019curious show that sampling directly from a language model’s raw distribution will eventually produce degenerate text, and instead, suggest to sample only from the nucleus,
: the smallest set of tokens whose sum of probabilities is larger than some hyperparameter. Specifically, nucleus sampling prunes the original distribution by assigning zero probability to every token outside the nucleus, and then renormalizes the probabilities to get a new distribution:
Here, we refer to this mechanism as tail pruning. Sampling from this renormalized distribution results in less degenerate and more human-like text than both full-distribution sampling and top- sampling fan-etal-2018-hierarchical, which do not account for the distribution’s entropy.
3 Nucleus Search
We combine the determinism of beam search with the probabilistic tail pruning of nucleus sampling, producing two variants of nucleus search: -exact search and dynamic beam search.
3.1 -Exact Search
stahlberg-byrne-2019-nmt show that exact search dijkstra1959note often produces extremely short and even empty sequences because the underlying language model assigns a non-zero probability to the EOS token at each step. We propose using tail pruning (Section 2.2) to round all near-zero probabilities (whether belonging to EOS or any other token) to an absolute zero. We apply exact search over the pruned space, guaranteeing the most probable sequence that contains only top- tokens at each step.
Given a hyperparameter , we apply tail pruning to the model’s predicted token distribution . The pruned distribution assigns zero probability to all tokens in the bottom of the original distribution, while inflating the probability of the remaining tokens when renormalizing. For example, if the model’s distribution over the first token assigns , and the hyperparameter , then the renormalized distribution will assign all its probability mass to the token George. Conversely, if the model predicts , and this event is not in the top of the distribution, then the new distribution will assign and effectively prune all sequences beginning with the token George from being generated. This same procedure also prunes the EOS token when it is unlikely, preventing empty sequences and reducing the brevity bias in general.
3.2 Dynamic Beam Search
Beam search keeps a fixed number () of prefixes according to their rank, regardless of their probability scores. In high-entropy situations, the difference between the -th most probable prefix and the one ranked might be minuscule, and we may want the search algorithm to consider such candidate prefixes as well. Conversely, when entropy is low (which is the case for most timesteps), the best prefix dominates the alternatives, making them redundant.
Dynamic beam search provides a mechanism for increasing the beam size when entropy is high, and pruning the number of prefixes when entropy is low. Let be the number of viable prefixes at step . The model predicts the next-token distribution for each prefix, creating candidates. Each candidate is scored according to its cumulative probability (Equation 1). To determine the beam size, we first normalize the probability scores within the set of candidates, and then apply tail pruning on the normalized probability:
As in -exact search (Section 3.1), we use a hyperparameter to determine the nucleus of , and thus the size of the next step’s beam . The normalized probability is only used to compute the dynamic beam; for computing each prefix’s cumulative score, we use the original probability .
We compare our search algorithms to beam search on a variety of tasks.333We do not compare to stochastic algorithms such as nucleus sampling holtzman2019curious, since those are more suited for free-form language generation, while we focus on conditional text generation. To control for the model, we use the same model across all search algorithms and hyperparameters, for each task.
We evaluate on the WMT’14 EN-FR dataset ws-2014-statistical, using the model of ott-etal-2018-scaling, a large Transformer NIPS2017_3f5ee243 with 6 encoder and decoder layers, trained on 36M bilingual sentences. The model uses BPE subword tokenization, with a joint vocabulary of 44k types. We evaluate the generated sequences using SacreBLEU post-2018-call, case-sensitive, with the 13a tokenizer.
Character-Level Machine Translation
To test the search algorithms’ behavior on longer sequences, we also compare their performance on character-tokenized machine translation. We train a model on the IWSLT’14 DE-EN dataset cettolo2014report, which contains approximately 172k bilingual sentences in its training set. We use the recommended settings and hyperparameters in Fairseq ott-etal-2019-fairseq to train a 6-layer encoder-decoder transformer. As with the subword-level dataset, performance is measured via SacreBLEU.
We evaluate on the XSUM dataset narayan-etal-2018-dont
. To alleviate memory issues and improve data quality, we remove examples where the source document is longer than 800 tokens (1,663 examples), or when the target summarization is longer than one quarter of the source document (698 examples). Our cleaned version of the XSUM test set contains 8,972 document-summarization pairs. We use the large fine-tuned BART modellewis-etal-2020-bart. ROUGE scores lin-hovy-2003-automatic are computed via compare-mt neubig-etal-2019-compare.
We implement our algorithms in the Fairseq framework ott-etal-2019-fairseq. Theoretically, the number of candidate prefixes may grow exponentially in both
-exact and dynamic beam search algorithms (for example, if the model always predicts a uniform distribution). To approximate these unbounded algorithms while keeping the GPU memory constraints tractable for any value of, we cap the number of candidate prefixes (beam size) by a large constant: 320 for WMT’14 and XSUM, and 160 for character-level translation.
We explore all values of in increments of 0.1 for both nucleus search algorithms. For beam search, we experiment with all beam sizes from 1 to 5, as well as exponentially increasing beam sizes from 5 to 320. To present a complete picture of the algorithms’ behaviors, we report results for all hyperparameter settings, rather than selecting the best configuration according to the validation set. This experiment design limits our ability to claim the superiority of one algorithm over another, but as we show Section 5, the performance differences are so small that no such claim will be made.
|Algorithm||( or )||EN-FR||DE-EN (Char)|
Table 1 shows the performance of each search algorithm across the different tasks.444This table shows performance without reranking (length normalization), to study the core algorithm. Appendix A contains the results with reranking, showing similar trends. In line with previously reported trends koehn-knowles-2017-six, we observe that increasing the beam size beyond can severely degrade performance, resulting in a drop of almost 30 BLEU on both translation tasks when . On the other hand, the probabilistic search algorithms appear to be more stable, with most hyperparameter settings achieving relatively high performance metrics until , where substantial performance degradation is evident.
Despite their increased stability, there appears to be no significant advantage to either -exact search or dynamic beam search over the original beam search algorithm. In fact, the performance differences between the best settings of each algorithm are always under 0.2 BLEU/ROUGE, and often zero. We find this trend counter-intuitive, since we originally assumed that expanding and trimming the beam based on entropy would benefit language generation. We further test these assumptions individually.
We compare the performance of static beam search () and dynamic beam search () on two subsets of the translation task’s test set:555We select since it is the maximal value that achieved the top score on the WMT’14 EN-FR benchmark. (1) examples where dynamic beam search always selects from its top 5 prefixes, and (2) the complement, where every generated output contains at least one prefix that was ranked 6th or worse. Table 2 shows that in those cases where dynamic beam search actually uses the expanded beam, i.e. it chooses prefixes that rank lower than 5, it performs worse than static top-5 beam search by 0.7 BLEU. This subset accounts for only 13% of examples – which are probably harder for the model, given the 10-point difference in BLEU – while the majority 87% of cases are always composed from the top 5 (or less) prefixes.
We isolate the effect of probabilistic trimming by applying a cap on the number of active beams, for both nucleus search variations. Table 3 shows that -exact and dynamic beam trimming strategies have no negative effects, and may have a marginal positive effect.
|Algorithm||( or )||EN-FR||DE-EN (Char)|
6 Related Work
As the standard decoding strategy for many conditional generation tasks, there is a significant body of literature on beam search. Recently, there has been more focus on the empty string problem stahlberg-byrne-2019-nmt, and the fact that increasing the beam size beyond a small constant typically hurts performance. meister-etal-2020-beam show that beam search optimize for sequences that distribute information uniformly, and therefore, using small beam sizes allows it to overcome the empty string problem. shi2020neural train models with multiple different EOS tokens based on their positions, instead of a single universal EOS token. peters-martins-2021-smoothing replace the softmax function with the sparse entmax transformation peters-etal-2019-sparse that can assign absolute zero probability to tokens. This method has a similar effect to our -exact search, but requires training the model with entmax, while our contribution only modifies the search algorithm.
massarelli-etal-2020-decoding also propose a combination of beam search and sampling methods, but with a different method and a different goal. They focus on free-form text generation, addressing two problems – repetition and halucination – by sampling the first few tokens, and then switching over to beam search.
Language models predict a distribution over their vocabulary, yet beam search only utilizes the rank of different candidates, not their actual probability scores. A natural assumption is that searching the space of prefixes with a constant number of options is not optimal. We hypothesize that using the probability scores to dynamically determine the number of candidates may benefit natural language generation. We test our hypothesis by introducing two nucleus search algorithms, which incorporate probabilistic tail pruning holtzman2019curious with beam search, but find that they perform on par with the baseline beam search algorithm when its beam is restricted to a small constant.
This work was supported by the Tel Aviv University Data Science Center, the Blavatnik Fund, the Alon Scholarship, and Intel Corporation. We would like to thank Ari Holtzman, Jonathan Berant, Ori Yoran, Lior Vassertail and Yuval Kirstain for their valuable feedback.
Appendix A Results with Reranking
When presenting our main results (Section 5), we follow related work peters-martins-2021-smoothing and focus on the outputs generated using the algorithms themselves, without reranking. For completeness, we also present the results of applying length normalization jean-etal-2015-montreal; murray-chiang-2018-correcting, i.e. reranking the set of sequences produced by beam search according to their average log-probability, rather than their cumulative log-probability (Equation 2):
Table 4 shows that length normalization improves stability, and slightly increases performance overall. However, it does not increase the performance gap between the different algorithms, with respect to the results in Section 5 (without reranking); all three variants produce text that scores within 0.2 BLEU/ROUGE from the best performing setting in every task.
|Algorithm||( or )||EN-FR||DE-EN (Char)|