## 1 Introduction

Although pre-trained text generation models
lewis-etal-2020-bart; RaffelEtAl2020 have achieved impressive results across a range of tasks, these models do not always deliver what system developers want. Machine generated text may be non-factual kryscinski-etal-2020-evaluating; maynez-etal-2020-faithfulness; goyal-durrett-2021-annotating or toxic gehman-etal-2020-realtoxicityprompts.
We might patch these problems by applying discriminators over the output holtzman-etal-2018-learning; yang-klein-2021-fudge to enforce these properties post-hoc; we could, for instance, apply a secondary model as a reranker over a small collection of outputs. However, if the generator returns a homogeneous set of outputs, we may fail to find *any* usable generation output.

What if generation models could return massive numbers of candidates rather than a few outputs with optimal score? With a large set of candidates, our secondary model could more easily find an acceptable one without having to take more extreme steps like re-training the initial generation model. Output diversity has separately been established as a useful goal for for applications such as dialogue and story generation li-etal-2016-diversity; fan-etal-2019-strategies; cao-wang-2021-inference.

Standard approaches including beam search (BS) and sampling methods fall short of our goal. Beam search uses significant computational resources to explore similar hypotheses, and much of the computation in the search process is invested into paths that could be acceptable generation outputs, but are ultimately pruned; we explore these issues in Section 3. Sampling approaches like nucleus sampling Holtzman2020The, although achieving better diversity than beam search, often re-discover seen hypotheses and can be harder to control for quality. A central problem with both methods is that they do not handle very similar hypotheses efficiently.

In this paper, we present a decoding framework with two key components. First, we argue that a modified best-first search (Bfs) is the right way to explore the search space. We augment standard best-first search with an ad-hoc path completion strategy: we eagerly expand each node until we reach an EOS token, thereby guaranteeing that each node is part of some completed path returned to the user. This generation strategy avoids exploring large numbers of states which end up being pruned. Bfs is also more flexible than static beam search and can prioritize exploration in more uncertain parts of the generation.

Second, our algorithm returns a massive number of generation options encoded in a lattice, with different hypotheses recombined in an approximate fashion. Recombination is a technique where similar decoder hypotheses are merged in the search.
In Figure 1, we show an illustration of the lattice structure this recombination can form for document summarization.
“*A Cardiff recycling company has gone into*” and “*A Cardiff waste management company has gone into*

” are preserved as different states in beam search, but actually have very similar distributions of following words under the model. If we can heuristically identify such states, we can merge them (Figure

2) and assume that any high-scoring continuation of one hypothesis can continue the other. We broaden a recombination method used previously in beam search for machine translation och-etal-2001-efficient; zhang-etal-2018-exploring, enabling us to compactly encode large number of generation candidates buckman-neubig-2018-neural and achieve dense lattices as shown in Figure 1.We show results for both document summarization and machine translation in three language pairs. For each setting, we show that our lattice encodes a large number of high-quality candidates, including good matches with annotated reference generations. We further show that a variant of our method can still achieve strong results with a lower number of nodes expanded than the baselines, suggesting that this can be a path towards saving computational resources. We believe that computing thousands of high-quality generation candidates within a single compact data structure can provide a powerful starting point for various downstream purposes: diversity, factuality, customizability, and more.

## 2 Problem & Setup

We define our algorithm in the context of conditional text generation sutskever2014sequence; bahdanau2014neural. Conditional text generation is formulated as sequence transformation from a source input

via a neural text generation model . Each is a symbol in a vocabulary. The probability of a decoded sequence is

. Decoding text from a model can be framed as a search problem, where the search objective is to find the output sequence that maximizes the conditional probability under the model: . Because depends on the entire generated sequence, this decoding problem is intractable to solve exactly.While typically the goal of decoding is to find the hypothesis with the highest possible model score, we instead focus on finding a large set of “good enough” hypotheses. That is, finding a set :

(1) |

for some threshold .

emerges naturally by adjusting search hyperparameters to control the number of returned hypotheses. Our goal in this paper is to design an algorithm that can efficiently find

.### 2.1 Notation

We encode predicted generation candidates in a lattice. A lattice is a directed graph where each node represent a word token and paths defined by directed edges encode candidates. A path in from to any node represents a (partially) decoded string, consisting of the words in that path. All completed paths start with a single start-of-sequence node and end at (potentially different) end-of-sequence nodes . In beam search or sampling, is strictly a tree, where each node has exactly one parent. However, our constructed lattices are no longer trees due to the recombination mechanism, which we will discuss in Sec. 5.

#### Search Graph

is constructed iteratively through a search procedure. We maintain the closed graph with explored nodes and edges as well as a search frontier , a set consisting of successors to nodes currently in the graph. For each node, there are possible successors.

We define the search budget as the number of nodes expanded from the search frontier. Our experiments will seek to compare different methods using the same search budget. We will define this more precisely in Sec. 7.

### 2.2 Baselines

We consider two categories of decoding methods as the baselines: beam search based and sampling based. Beam search is widely used to find near optimal solutions in NLP tillmann-ney-2003-word; meister-etal-2020-beam. We consider two variants, original beam search (BS) and diverse beam search (DBS), an improved version targeting diverse text generation vijayakumar2016diverse. Some implementation details of beam search are described in Appendix A. Sampling-based decoding methods involve randomness and construct candidates by sampling from the next-token distribution rather than maxing. Methods like temperature annealing (Temp) ficler-goldberg-2017-controlling, top- sampling fan-etal-2018-hierarchical and nucleus sampling (Ncls) Holtzman2020The are also widely used to find high-quality text from models. We compare with BS, DBS, Ncls, and Temp in our experiments.

## 3 Inadequacies of Beam Search

As we have alluded to, beam search is inadequate for our goal for several reasons.

#### Better Model Score Better Hypothesis

The most critical issue is that beam search is designed to efficiently approximate , but the optimal model score is neither our goal nor a guarantee of a good hypothesis. In Figure 3, we compare the correlation of model score and ROUGE under beam search for text summarization. The Pearson correlation between these two variables is very weak. Beyond ROUGE score, the example in Fig. 1 shows that the main differences between these summaries may be minor differences in surface realization that have little effect on our qualitative judgments of summary quality. Finding the best model score does not substantially improve the quality over a near-optimal model score. Allocating resources to eke out slight improvements over the greedy hypothesis, as beam search does, is a poor use of resources for most applications.

#### Lack of Diversity in (Diverse) Beam Search

Are the model outputs from BS and DBS diverse? We use Self-BLEU (sBl) zhu2018texygen to measure the BLEU score for randomly sampled pairs from each algorithm’s output. The lower the self-BLEU, the more dissimilar the pairs are. On decoding summaries on XSum, the sBl for BS/DBS are 87/79 while a nucleus sampling method can achieve 57/50 depending on configuration. Although DBS slightly improves the diversity compared to the original variant, the overlap of outputs from beam search based method is still very high, and the diversity remains a challenge.

#### Poor Scaling Behavior

In spite of these shortcomings, perhaps beam search could still be viable with larger beam sizes if more computational resources are available. We experiment with beam sizes of and see how diversity scales with beam size. In Figure 4, we found that the exponential increase of beam size does not scale with the increase of number of novel bigram in beam search. In DBS, the diversity does ramp up, but the quality of the generated text is getting very bad very soon. For BS and DBS, increasing beam size is not an effective solution for better diversity. We also show that increasing beam size does not scale well in terms of finding better hypotheses, which is shown in Appendix. B.

16 | 8 | |||
---|---|---|---|---|

XSum | zh-en | fr-en | en-fr | |

BS | 71.3% | 63.3% | 54.0% | 59.2% |

DBS | 71.2% | 56.1% | 50.4% | 55.7% |

#### Poor Efficiency from Pruning

One final issue with beam search is that most of its computation is not even useful in producing finished hypotheses; that is, the set of answers produced does not contain most of the nodes expanded in the typical course of operation. We conduct an empirical pruning study on a summarization dataset and three translation datasets and show the results in Table 1. For all studied cases, beam search and diverse beam search prunes over half of the expanded nodes. Many pruned hypotheses are not truly ungrammatical or low quality, but are merely slightly less likely than other nodes. How we can preserve more of the explored lattice and do so efficiently is addressed in the next section by our use of best-first search.

## 4 Best-first Search

As established in the previous section, beam search prunes many paths that would potentially yield high-quality summaries and wastes computational resources expanding nodes that aren’t included in a final search graph.
We tackle this issue by changing from beam search to *best-first search* (Bfs) hart1968formal; pearl1984heuristics. Bfs prioritizes searching over nodes according to a scoring function, giving us more flexibility in how we explore the space.
Our chief modification of the base algorithm is a heuristic called ad-hoc completion.

#### Ad-Hoc Path Completion

Neural text generation is a search problem with large branching factor () and deep search depth (sequence length). As a result, applying Bfs with the scoring function being the model score of a state often leads to a broad search that rarely returns a valid path.
One solution to this problem is to incorporate a heuristic based on length. Model score is monotonically decreasing as a sequence grows in length, so prior work wu2016google; zhang-etal-2018-exploring; meister-etal-2020-best has used a length reward term to alleviate this issue.^{2}^{2}2This can be considered a heuristic like in (weighted) A search, but it is not necessarily admissible or consistent, so we do not describe it this way.
We found that, even with a length heuristic, Bfs will still have “dangling” nodes that are not part of any path to an EOS (goal) token, and in some cases it will return few or no valid hypotheses.

Recognizing our objective from Equation 1, we can take a simple step to ensure that every node ends up on some completed path: eagerly do a greedy “rollout” from each node until we reach .^{3}^{3}3We can discard the path if it exceeds the maximum length and still does not terminate, but it was not observed in practice. In Algorithm 1, we implement this by modifying the priority of the highest scored token with (line 12), so it will be explored immediately after the current time step. In Figure 5, we show an illustrative example of ad-hoc completion.

#### Search Algorithm

We describe Bfs with ad-hoc completion in Algorithm 1. The algorithm is a slightly modified best-first search algorithm applied to text generation. is a function to evaluate the value of a path. Typically it is defined as . is the budget for total model calls to neural text generation model. Note that and do not invoke neural generation model so they do not count towards the computation budget we defined here. In practice, we only consider top 5 expansions rather than the whole vocabulary for line 10.

## 5 Path Recombination

Path recombination, also known as hypothesis recombination or state merging, was originally proposed and used in phrase-based machine translation och-etal-2001-efficient; koehn-etal-2003-statistical; zhang-etal-2018-exploring.
The idea of path recombination is to combine similar paths if what the model predicts for them in the future is the same,^{4}^{4}4In phrase-based MT, this procedure merges decoder hypotheses with identical -gram suffixes (among other conditions) that yield identical scores for all possible future due to the use of an -gram language model on the target side. However, because neural model scoring does not factor over -grams, no pair of hypotheses will have this property. reflecting a similar dynamic programming principle as the Viterbi algorithm. We focus on finding hypotheses which approximately exhibit this property, and show that merging them can yield high-quality outputs.
Figure 2 shows an example of recombination. The two hypotheses being merged here roughly convey the same intent; a neural model could prefer different continuations for these two based on latent attributes of the text, but we will show that the shared suffixes “*has gone into*” is a strong indicator that the model will not distinguish them strongly in the rest of the generation.

### 5.1 Prerequisites of Recombination

In the strictest form, recombining two hypotheses assumes the following equivalence between them:

###### Definition 5.1 (Strong equivalence).

Let and be two prefix strings starting with . and are strongly equivalent if holds for all .

Merging such states in the search tree is valid with no loss of information, as any expanded node will receive the same score under both prefixes. However, this assumption is not realistic since Transformer models condition on the entire sequence so far, and any tiny perturbation changes the predicted distribution. To relax the assumption, we then propose the weak alternative.

###### Definition 5.2 (Weak equivalence).

Let and be two prefix strings starting with . and are weakly equivalent if greedy completions of these two strings are the same: .

This criterion we can actually check empirically. However, it is still not practical to check during search itself, as it requires expanding a number of nodes. We will define even weaker criteria and show that these can be good proxies for identifying weakly equivalent nodes.

### 5.2 Canonical Paths

After recombination, a single node may represent multiple different possible sentence prefixes. We define the notion of a canonical path, which represents the single path used to score candidate expansions.

#### Type of Edge & Path

If the edge is created due to the extension of search graph via model’s prediction, we call it a Gen edge. Otherwise, the edge is created due to path recombination, and we call it a Mrg edge. In Fig 6, the edges in orange are Mrg edges.

###### Definition 5.3 (Canonical Path).

Let be a node. The canonical path to is defined as the unique path from to consisting only of Gen edges.

###### Theorem 5.1.

For any node in the graph except , there exists exactly one canonical path.

We present the proof in Appendix. C

In the rest of the paper, unless specified, the path of a node , returns the sequence of words corresponding to the canonical path of that node. Expanding computes under the neural model.

### 5.3 Merging Criteria

Strong and weak equivalence are too expensive to establish during inference. We instead rely on even simpler functions to approximate this further.
We define a similarity function to determine if an expanded node should be merged with an existing expanded node .
Note that to tractably implement the *isRecomb* step of Algorithm 1, we need this check to be very efficient, as it needs to check a node against the entire lattice thus far.

A similar recombination idea was explored in zhang-etal-2018-exploring. Following their work, we explore a family of rule-based heuristics for merging. There are two rules: (1) two strings share a common -gram suffix, (2) the length difference of two strings is less than . Assume that the canonical paths for and are lengths and . Then

(2) |

where and are hyper-parameters.^{5}^{5}5In zhang-etal-2018-exploring, there is one extra constraint requiring for Eq. 2, which requires that the path getting recombined has lower model score than the existing path. However, we found that model score is not always a good indicator for merging, as suggested in Fig. 3, partially because it is challenging to calibrate scores across different sequence lengths, so we disregard this constraint.
For a large enough value of , note that the shared suffixes encourage hypotheses like this in Figure 6 that share large parts of the structure already.

### 5.4 Prior Work: BSZBeam

zhang-etal-2018-exploring

use their merging criterion in the context of beam search for neural machine translation. Their work focuses on beam search, so the matching candidates are restricted to hypotheses of the latest and second latest time steps. If the merging criteria hold,

will be recombined with . However, will not be considered as a future merging candidate. We call this merging strategy ZBeam. We implement this model together with its merging criteria and denote it as BSZBeam. This strategy is tailored to beam search and, as we discuss later, explores a more limited set of merges than one might want to consider.## 6 Implementing Recombination

In this work, we present two path recombination strategies, Rcb and Zip. Rcb is a natural generalization of ZBeam, and Zip is a more aggressive recombination strategy which produces a denser lattice at the cost of potentially introducing more noise. We present an illustration of them in Figure 6.

#### Rcb: Generalization of ZBeam

ZBeam has a major limitation: a limited set of merging candidates.
The potential merge candidates in ZBeam are only nodes in the current beam hypotheses and their previous steps, so the method cannot merge with nodes from earlier timesteps. For example, “*A waste plant has gone into*” cannot be merged with the hypothesis with ending in node 4 shown in Figure 6. The proposed generalization, Rcb, addresses this limitation.
We index all of the nodes in the lattice across all timesteps by their -grams using a hash table, making it time to look up an -gram pattern and retrieve potential merge candidates if they exist.

#### Zip: Recombining More

If we take a closer look at Rcb in Figure 6, we see that even in the merged structure, nodes 3 and 7 and nodes 2 and 6 are preserved as separate. They do not pass the recombination criterion themselves, but these nodes are part of the suffix matched strings, still correspond to the same words, and have the same directly generated next word. There is reason to believe that these might be equivalent as well.

Hence, we explore a variant called Zip that propagates the merge backwards through the lattice. This change relaxes the merging criterion and potentially up to pairs of nodes are combined when a merge is identified, leading to a more compact lattice. We describe some of the implementation detail in the Appendix D.

#### Methods with Path Recombination

We implement path recombination on two of our baseline methods: beam search and nucleus sampling. For beam search, we implement BSZBeam following zhang-etal-2018-exploring. Due to the flaws of beam search as discussed earlier and the inherent complexity, we do not integrate Rcb and Zip with beam search. We integrate Rcb with nucleus sampling, and they are denoted as NclsRcb and NclsRcb. Finally, we use both merging variants in best-first search, yielding BfsRcb and BfsZip, respectively.

## 7 Evaluation

To evaluate the proposed methods, we conduct experiments on two conditional text generation tasks: abstractive text summarization and machine translation. Our evaluation focuses on two questions: (1) how large and diverse are our lattices; (2) are the candidates encoded in the lattices high quality and grammatical?

### 7.1 Datasets & Base Models

We obtain all the models and certain baseline decoding methods from the Transformers library wolf-etal-2020-transformers. Since our methods are inference techniques with rule based heuristics, we do not re-train any models.

#### Summarization

We use XSum narayan-etal-2018-dont, a popular English news summarization dataset. We sample 100 examples from the validation set. The base model we use is BART-large-XSum lewis-etal-2020-bart. We set the max length to 35 tokens according to the reference summaries.

#### Machine Translation

We study our models on the English-French (en-fr) pairs from WMT 2014 bojar-etal-2014-findings and Chinese-to-English (zh-en) pair from WMT 2019 barrault-etal-2019-findings. We use mBART liu-etal-2020-multilingual-denoising, a state-of-the-art neural machine translation model. We set the max decoding length to be twice the input length, so it varies per example.

### 7.2 Evaluation Details

#### Search Budget

We want to compare our different methods under a similar search budget, or number of nodes expanded. Each node expansion requires running the neural generation model to get a probability distribution over the vocabulary, which is the dominant factor in runtime; we incur negligible overhead from rule-based matching in the merging step, as well as the computational costs of computing diversity term in

DBS and modifying sampling distributions in sampling methods.We define the number of nodes expanded in terms of a quantity we call equivalent beam size. Recall that denotes the maximum length of decoded hypotheses for beam search. Let the total computation budget be . For beam search based methods, is the original beam size and the total budget covers running beam search once with beam size and maximum decoding length . For sampling based methods, is the number of independent samples we draw from the model. Each sample is a completed path starting with and ending with . After obtaining paths, we merge them into one graph (i.e., a trie) to compactly represent duplicate prefix strings. For best-first search methods, is the total number of nodes explored by the algorithm.

#### Enforcing a uniform search budget

Since hypotheses may terminate before they reach EOS, empirically there is a gap between effective length (the average generated hypothesis length) and max length for both beam search and sampling. Beam search can exit out before reaching the maximum hypothesis length if a sufficient number of finished paths are found, but this amounts to dynamically changing the budget for each instance. Running our method with a budget derived from beam search is artificial and not realistic in practice.

Instead, we apply a corpus-level correction factor so that the different methods are expanding the same number of nodes. We increase the beam size by 50% for translation and 25% for summarization for our baseline methods: to BS, DBS, Ncls, Temp, and BSZBeam. This correction balances the number of nodes expanded between our method and the baselines.

### 7.3 Search Algorithms

#### Greedy

is the deterministic greedy decoding method that always selects the highest probability token as prediction. The equivalent beam size for this approach is 1 since we only run one pass.

#### Bs & Dbs

stand for beam search and its variant diverse beam search vijayakumar2016diverse. In our configuration, we use Hamming distance as the diversity function and set the diversity strength to 1.5, following vijayakumar2016diverse.

#### Ncls

is the nucleus sampling method proposed in Holtzman2020The, which encourages quality by truncating the distribution over the vocabulary with a parameter before sampling. We experiment it with and .

#### Temp

changes the temperature of softmax function to reshape the prediction distribution ficler-goldberg-2017-controlling. We set the temperature parameter so the prediction picks more low-scored tokens than .

#### Bfs

is the standard best-first search method without path recombination. We use our ad-hoc path completion technique to ensure that finished hypotheses are produced. Our preliminary study shows a regular best-first search method can not always yield at least one valid hypothesis, even with a length-aware model score function.

#### Our Methods with Path Recombination

BSZBeam is our implementation of zhang-etal-2018-exploring. We integrate Rcb with nucleus sampling and best-first search as NclsRcb and BfsRcb. We also test Bfs with the Zip strategy. 🍂BfsZip is a resource-efficient version of BfsZip where only 25% of search budget is used, exploring what this method can achieve with a lower budget given its more aggressive merges.

Diversity | Or | Sp | Grm | |||||

Model | path | N1 | N2 | sBl | ED | R2 | R2 | Err |

Greedy | 1 | 22 | 23 | 100 | 0 | 17.3 | 17.3 | 0.5% |

BS | 20 | 42 | 61 | 87 | 31 | 26.3 | 17.7 | 0.3% |

DBS | 19 | 59 | 91 | 79 | 53 | 25.5 | 15.9 | 0.5% |

Ncls | 20 | 124 | 237 | 57 | 72 | 30.2 | 14.5 | 0.5% |

Ncls | 20 | 143 | 273 | 50 | 76 | 28.1 | 13.3 | 0.8% |

Temp | 20 | 170 | 319 | 51 | 82 | 26.6 | 11.6 | 1.4% |

Bfs | 30 | 88 | 167 | 68 | 60 | 30.1 | 15.6 | 0.4% |

+ Path Recombination | ||||||||

BSZBeam | 4,701 | 66 | 118 | 75 | 51 | 33.0 | 16.0 | 0.7% |

NclsRcb | 52 | 167 | 308 | 53 | 79 | 28.8 | 13.0 | 1.0% |

NclsRcb | 36 | 207 | 363 | 50 | 87 | 25.9 | 11.0 | 1.7% |

BfsRcb | 7,758 | 111 | 239 | 65 | 64 | 35.8 | 15.2 | 0.8% |

BfsZip | 95,744 | 124 | 274 | 53 | 77 | 36.8 | 13.2 | 1.4% |

🍂BfsZip | 297 | 58 | 92 | 80 | 49 | 29.2 | 15.2 | 0.8% |

### 7.4 Evaluation Metrics

We describe our metrics to evaluate both quality and diversity. Several of our methods build on ROUGE^{6}^{6}6https://github.com/google-research/google-research/tree/master/rouge lin-2004-rouge and BLEU papineni-etal-2002-bleu; post-2018-call for evaluating the generated text compared to reference summaries or translations.

#### Diversity-oriented Metrics

We evaluate the diversity of generated texts with the following metrics.
(1) path is the average number of unique paths in the produced lattice.^{7}^{7}7 Due to the exponentially growing number of paths in some of our models, we cap the number of incoming paths each node could hold to a constant . Incoming paths are the paths starting with and ending with the current node. (2) Number of unique -grams encoded in the lattice; this captures a different type of diversity than the number of paths, since there could be many paths reusing the same words. N1 and N2 are average number of novel unigrams and bigrams in the graph.
(3) sBl is the average self-BLEU among samples zhu2018texygen. The samples are drawn from a uniform random walk from . The range of sBl is .
(4) ED is the average edit-distance among samples. We set in our experiment.

#### Quality: Grammaticality

One concern about our method is that by recombining different hypotheses, we could introduce grammatical errors, e.g., if two hypotheses have different parses despite a shared -gram suffix. Following past work xu-durrett-2019-neural; xu-etal-2020-discourse, we opt for an automated approach to evaluating grammaticality. We adopt GECToR^{8}^{8}8https://github.com/grammarly/gector, a state-of-the-art neural grammatical error correction model omelianchuk-etal-2020-gector to automatically assess the grammaticality of generated texts.
We use the RoBERTa version of the model, which achieves an F of 71.8 on the test set of BEA-2019 bryant-etal-2019-bea.
We report GrmErr(%), the average number of grammar errors per token, for all English-output experiments.

#### Quality: Oracle Reference Match

Given the reference summary or translation, we find the path with highest ROUGE or BLEU over all found paths. Oracle ROUGE is defined as . For ROUGE metrics, we maximize over R2 and present R1, R2, and RL. This metric captures both quality and diversity: the algorithm needs to find something close to the reference, but a diverse lattice will have a higher chance of exhibiting a good candidate all else being equal. We denote this metric as Or.

zh-en | fr-en | |||||||||||||||

Diversity | Or | Sp | Grm | Diversity | Or | Sp | Grm | |||||||||

Model | path | N1 | N2 | sBl | ED | Bl | Bl | Err | path | N1 | N2 | sBl | ED | Bl | Bl | Err |

Greedy | 1 | 35 | 40 | 100 | 0 | 24.7 | 24.7 | 0.5% | 1 | 28 | 31 | 100 | 0 | 40.0 | 40.0 | 0.9% |

BS | 12 | 45 | 63 | 95 | 20 | 32.2 | 25.0 | 0.2% | 12 | 37 | 50 | 93 | 13 | 52.6 | 38.1 | 1.0% |

DBS | 11 | 55 | 84 | 89 | 59 | 29.7 | 20.5 | 0.5% | 11 | 45 | 67 | 88 | 37 | 46.4 | 30.5 | 1.1% |

Ncls | 12 | 94 | 188 | 72 | 82 | 31.5 | 17.5 | 0.7% | 11 | 62 | 107 | 80 | 46 | 51.0 | 31.2 | 1.0% |

Ncls | 12 | 110 | 226 | 67 | 94 | 30.4 | 15.8 | 0.9% | 12 | 75 | 134 | 77 | 57 | 48.3 | 27.4 | 1.2% |

Temp | 12 | 140 | 280 | 62 | 105 | 27.0 | 12.7 | 1.3% | 12 | 102 | 184 | 69 | 71 | 43.7 | 21.6 | 1.6% |

Bfs | 18 | 60 | 104 | 86 | 54 | 32.7 | 20.7 | 0.5% | 27 | 59 | 102 | 84 | 37 | 53.2 | 33.7 | 1.1% |

+ Path Recombination | ||||||||||||||||

BSZBeam | 18,336 | 64 | 117 | 77 | 65 | 40.1 | 19.1 | 0.8% | 16,729 | 59 | 107 | 77 | 43 | 61.2 | 28.2 | 1.3% |

NclsRcb | 81 | 138 | 263 | 67 | 91 | 26.8 | 13.9 | 1.1% | 344 | 140 | 246 | 64 | 67 | 48.2 | 26.6 | 1.2% |

NclsRcb | 38 | 188 | 343 | 58 | 114 | 23.9 | 10.6 | 1.7% | 123 | 205 | 352 | 55 | 92 | 41.1 | 20.2 | 2.1% |

BfsRcb | 17,535 | 81 | 171 | 76 | 72 | 42.1 | 19.4 | 0.9% | 47,577 | 85 | 193 | 68 | 52 | 64.6 | 25.3 | 1.6% |

BfsZip | 59,020 | 94 | 205 | 66 | 88 | 42.4 | 15.5 | 1.4% | 146,163 | 111 | 259 | 56 | 63 | 56.8 | 16.9 | 2.5% |

🍂BfsZip | 511 | 50 | 75 | 89 | 38 | 33.0 | 21.2 | 0.7% | 4,531 | 50 | 81 | 82 | 35 | 59.5 | 29.4 | 1.4% |

Diversity | Or | Sp | |||||

Model | path | N1 | N2 | sBl | ED | Bl | Bl |

Greedy | 1 | 32 | 35 | 100 | 0 | 28.5 | 28.5 |

BS | 12 | 42 | 57 | 93 | 13 | 37.8 | 27.5 |

DBS | 10 | 51 | 73 | 89 | 38 | 33.1 | 22.7 |

Ncls | 12 | 95 | 171 | 72 | 56 | 35.4 | 20.4 |

Ncls | 12 | 116 | 214 | 66 | 73 | 33.4 | 17.6 |

Temp | 12 | 150 | 274 | 61 | 89 | 28.4 | 13.1 |

Bfs | 17 | 62 | 98 | 85 | 35 | 38.8 | 25.0 |

+ Path Recombination | |||||||

BSZBeam | 17,508 | 67 | 117 | 78 | 40 | 46.4 | 21.2 |

NclsRcb | 59 | 151 | 261 | 67 | 78 | 29.3 | 16.3 |

NclsRcb | 32 | 190 | 317 | 53 | 101 | 26.9 | 12.6 |

BfsRcb | 18,663 | 90 | 180 | 74 | 42 | 46.6 | 20.8 |

BfsZip | 49,507 | 104 | 213 | 65 | 53 | 45.9 | 16.7 |

🍂BfsZip | 386 | 49 | 70 | 88 | 25 | 39.5 | 25.7 |

#### Quality: Average Reference Match

Although our method focuses on deriving diverse text summaries or translations, we aim to guarantee that the generated text is highly relevant to the generation target and is of high quality in general. We sample 1,000 paths from the lattice with replacement and evaluate the average ROUGE or BLEU compared to the reference. We denote this metric as Sp.

## 8 Results

### 8.1 Text Summarization

We present the experimental results on the dev set of XSum in Table 2. Full results are kept in Table 5 for reference. The top half of the table shows the results of non-recombination methods. Among non-recombination methods, BS and DBS are the least diverse methods compared to other methods. Sampling based methods including Temp are generally more diverse, but the oracle ROUGE is lower than that of Bfs. Given the sacrificed text quality (lower sample ROUGE and more grammar errors) of sampling based methods, we argue that best-first search is an ideal decoding strategy itself even without path recombination. It achieves a good balance of diversity and quality, and is more likely to find a candidate close to the reference under the same amount of computation resources.

The bottom half shows all methods with path recombination techniques. Recombination significantly improves the diversity of generated outputs, with a much higher number of paths. The self-BLEU of the recombination variants are lower than their non-recombination counterparts.

In terms of search quality, the proposed BfsRcb and BfsZip methods obtain significantly higher oracle ROUGE compared to all other methods. We show these results later in Figure 9: our approach can find much better oracle solutions, even compared with beam search method with quadruple the amount of computation resources. The design of the oracle ROUGE metric is also motivated by a real use case: if you want a specific summary (e.g., a summary covering a specific entity or topic), does it exist in the search graph? Higher oracle ROUGE indicates a closer match, meaning a strategy using some kind of reranking model could help find the user’s preferred summary.

NclsRcb generates more novel tokens than other methods, but the number of paths is very limited, indicating that these are largely disjoint paths that cannot be recombined. Moreover, the oracle and sample ROUGE are low, showing the lower quality of these outputs.

#### Comparison: Rcb & Zip

The Zip method yields even more diverse output at the cost of text quality. There are a few reasons for this: 1) recombination of more nodes makes the lattice denser, increasing the number of paths but also potential errors; 2) elimination of unexplored children from merged branch reduces the waste of exploration which means Zip can explore more varied hypothesis than Rcb. With the same amount of computational resources, Zip explores a larger search space while Rcb explores a smaller collection more reliably.

🍂Zip exploits the efficiency of Zip to achieve high diversity, and by searching through fewer states, it manages to achieve higher quality as well.

### 8.2 Machine Translation

We show the result on machine translation in Table 3 and 4. As im summarization, results on translation tasks shows the consistent gains of diversity from path recombination models. In Table 3, we show two translation task where the target language is English. BfsRcb works better than BfsZip because it disables some aggressive and bad merges which explores bad hypotheses. Compared to summarization, we found the search space in MT to be more constrained, so there was less room for aggressive merging and exploration to improve over Rcb.

Our lower-resource method, 🍂BfsZip approach, actually performs quite well on most metrics with only 25% of search budget. It has better diversity performance than any non-recombination methods, and comes with quality better than most of the recombination methods. The usage of Bfs and path recombination methods like BfsRcb and BfsZip is promising for being able to find a better cost-diversity tradeoff in MT.

### 8.3 Validating the Merging Criterion

Our merging criterion is fundamentally an approximation of the equivalence criteria described in Section 5. As described before, no pairs of generation candidates really follow the strong equivalence assumption, but we can explicitly check the weak equivalence assumption. Our question is: what fraction of nodes merged by our merging criterion satisfy the weak equivalence assumption? We conduct an experiment to verify this. We consider all merges on BfsRcb on XSum. For each pair, we compute the greedy completion for timesteps and check whether the continuation of the base candidates would be the same.

In Figure 7, we show the fraction of merged pairs for which the generations match exactly under three values of the recombination criterion (, and ). For BfsRcb, when using the greedy continuation over 4 timesteps is the same 71.2% of the time. For BfsZip, when using the greedy continuation over 4 timestpes is the same 62.5% of the time. Following the weak equivalence criterion is a strong indication that these hypotheses can admit many of the same continuations. Rcb is more reliable than Zip on recombination assumption, but both methods show moderate adherence to the equivalence criterion.

### 8.4 Error Analysis & Visualization

In Figure 8, we present two examples on XSum by 🍂BfsZip. The upper example has more word level recombination and paraphrasing during generation while the bottom one has more ways of ending and more diverse content coverage (e.g., Albania, 5 June, Leicester City, 1982, etc.). We show more examples on both summarization and translation in Appendix. E.

We manually examine the output and found a few common types of errors introduced by our algorithm.
(1) Factual errors at high entropy nodes. Our approach assumes that high-scoring candidates under the model are good quality, but this assumption is violated in certain cases, like when the model attempts to hallucinate information. For example, given the prefix “*The company, founded in*” will cause the model to guess answers like “*1989*” or “*1999*”. Encoding all of these in the lattice is incorrect. This can still happen in BS but is less likely due to pruning. We did not see significant factual errors introduced by merging specifically. (2) Aggressive bad merges. In the upper example in Figure 8, the cluster of “*GPs*”, “*nurses*”, “*paramedics*” is an example case. The lattice encodes paths like “*GPs, nurses and nurses should …*”. This could be fixed by heuristics or rules, but we leave it for future work.

## 9 Related Work

The techniques used in this work partially reflect an outgrowth of a few lines of literature: understanding the behavior of text generation models xu-etal-2020-understanding-neural; xu-durrett-2021-dissecting; zhong-etal-2021-adapting-language, investigations into beam search stahlberg-byrne-2019-nmt; meister-etal-2020-beam, and studies of diversity in generation.

#### Search Strategies in Neural Text Generation

Best-first beam search meister-etal-2020-best is a method integrating best-first search with beam search. Some other variants of search have also been studied in previous work meister-etal-2021-determinantal; meister-etal-2021-conditional. Beam search has been critically examined in some recent work huang-etal-2017-finish; stahlberg-byrne-2019-nmt, but largely of focused on machine translation and specific challenges in MT.

#### Diversity

Neural text degeneration has been observed and discussed in radford2019language; Holtzman2020The; Welleck2020Neural, which led to an interest in diverse generation models. Diverse text generation has been studied in previous work yu2017seqgan, including in dialogue li-etal-2016-diversity, story generation fan-etal-2019-strategies, and particularly paraphrasing iyyer-etal-2018-adversarial; goyal-durrett-2020-neural. Many of the diverse options we observe in summarization correspond to text compressions xu-durrett-2019-neural; desai-etal-2020-compressive, but our method can also diversify content coverage gehrmann-etal-2018-bottom and word choice cao-wang-2021-inference.

## 10 Discussion & Conclusion

We presented an algorithm for decoding in text generation with two main components: best-first search to more efficiently structure exploration of the space and hypothesis recombination to encode summaries in a lattice structure. We showed that across summarization and machine translation, these lattices successfully encode large numbers of high-quality generation options.

There are a few limitations of our method. First, we currently benchmark these techniques using number of nodes expanded, not wall clock time. There are strategies for parallelizing the Bfs expansion shu-nakayama-2018-improving, but it remains to be seen how this parallelism compares to the parallelism achieved by beam search. Regardless, the dramatically larger number of hypotheses we return outweighs efficiency differences for now. Second, we focus on auto-regressive methods in this paper. However, we believe our framework could also be applied and adopted to non auto-regressive generation models song-etal-2021-new.

## Acknowledgments

We would like to thank Sid J Reddy, Zhisong Zhang, Eunsol Choi, Yasumasa Onoe, Shuyang Cao, and Jonathan Kummerfeld for input and feedback on this work. This work was partially supported by a gift from Amazon, NSF Grant IIS-1814522, and a gift from Salesforce Inc.

## References

## Appendix A Implementation Details: Beam Search

In our beam search implementation, the size of the search frontier is up to the beam size . When a path is completed, we move it from the search frontier to a completed set to free up the beam for exploring unfinished hypotheses. Naturally, finished hypotheses in the end can be of variable length. After reaching the max generation step , we sort all hypotheses in according to the model score. Following common practice in libraries such as Transformers wolf-etal-2020-transformers, we return a number of completed hypotheses equal to the beam size.

Diversity | Oracle | Sample | Grm | |||||||||

Model | path | N1 | N2 | sBl | ED | R1 | R2 | RL | R1 | R2 | RL | Err |

Greedy | 1 | 22 | 23 | 100 | 0 | 41.4 | 17.3 | 33.5 | 41.4 | 17.3 | 33.5 | 0.5% |

BS | 20 | 42 | 61 | 87 | 31 | 47.6 | 26.3 | 40.3 | 41.5 | 17.7 | 33.6 | 0.3% |

DBS | 19 | 59 | 91 | 79 | 53 | 47.0 | 25.5 | 39.1 | 38.5 | 15.9 | 30.3 | 0.5% |

Ncls | 20 | 124 | 237 | 57 | 72 | 50.4 | 30.2 | 44.2 | 37.4 | 14.5 | 29.5 | 0.5% |

Ncls | 20 | 143 | 273 | 50 | 76 | 48.0 | 28.1 | 42.2 | 36.1 | 13.3 | 28.5 | 0.8% |

Temp | 20 | 170 | 319 | 51 | 82 | 45.0 | 26.6 | 38.5 | 34.1 | 11.6 | 26.3 | 1.4% |

Bfs | 30 | 88 | 167 | 68 | 60 | 50.8 | 30.1 | 44.0 | 39.0 | 15.6 | 30.8 | 0.4% |

+ Path Recombination | ||||||||||||

BSZBeam | 4,701 | 66 | 118 | 75 | 51 | 52.2 | 33.0 | 45.7 | 40.0 | 16.0 | 32.3 | 0.7% |

NclsRcb | 52 | 167 | 308 | 53 | 79 | 49.0 | 28.8 | 41.8 | 35.0 | 13.0 | 27.8 | 1.0% |

NclsRcb | 36 | 207 | 363 | 50 | 87 | 44.6 | 25.9 | 38.7 | 32.1 | 11.0 | 25.1 | 1.7% |

BfsRcb | 7,758 | 111 | 239 | 65 | 64 | 55.2 | 35.8 | 49.3 | 38.5 | 15.2 | 30.8 | 0.8% |

BfsZip | 95,744 | 124 | 274 | 53 | 77 | 55.6 | 36.8 | 48.8 | 36.8 | 13.2 | 28.7 | 1.4% |

🍂BfsZip | 297 | 58 | 92 | 80 | 49 | 49.6 | 29.2 | 42.8 | 38.8 | 15.2 | 31.0 | 0.8% |

## Appendix B Pool Scaling Behavior: Optimality

As a search algorithm, how do BS and DBS with larger beam size perform at finding solutions close to the reference? We compare the oracle R2 of BS/DBS with larger beam size in Figure 9. The oracle R2 increases slowly as the doubles, but our model BfsRcb with achieves 35.8, much higher than all BS/DBS cases.

Method | Algos | Cand | Len | Dedup |
---|---|---|---|---|

BSZBeam | BS | last step | 1 | N |

Rcb | any | all | 1 | N |

Zip | any | all | Y |

## Appendix C Proof of Theorem 5.1

Proof by induction. Base case: we begin with just in the lattice, which has exactly one canonical path consisting of itself.

Inductive case: assume every node in the lattice has exactly one canonical path. We have to consider two cases when expanding a node in the lattice:

(1) Expanding the node as normal. In this case, the node is on the search frontier due to its parent node being expanded, which establishes a Gen edge from to . Since has exactly one canonical path, then has exactly one canonical path consisting of the canonical path to extended to .

(2) Applying recombination. This operation only adds Mrg edges and deletes nodes, neither of which have any impact on the canonical paths.

## Appendix D Implementation Details: Zip

We summarize the key differences of ZBeam, Rcb and Zip in Table 6. In Zip, nodes that are already expanded might be removed from the lattice due to recombination. For example, in Figure 6, node 6 and 7 are removed in this fashion. In general, we handle this by re-mapping the eliminated node to its surviving counterpart. Any reference to node 7 is routed to node 3, or whatever node 3 is mapped to. This procedure is defined and implemented as a union-find data structure.

#### Deduplication of Unexplored Successors

After the Zip procedure, we also remove the unexplored successors of the merged nodes, like node 6, 7, and 8 in Fig. 6. We show a more detailed example in Figure 10. In Zip, we will merge node 3 and node 6. If we take a closer look at the successors of these two nodes, the distributions could be similar and in fact are expected to be if the equivalence criteria hold. We remove the unexplored direct successors of the merged node as part of the merging process, and the surviving node (node 3) captures these with similar probabilities regardless.

## Appendix E Examples

We show three examples with visualization in Figure 11,12 and 13. We use PyVis as the visualization tool.^{9}^{9}9https://github.com/WestHealth/pyvis
More examples are available at https://github.com/jiacheng-xu/lattice-generation.