Massive-scale Decoding for Text Generation using Lattices

12/14/2021
by   Jiacheng Xu, et al.
The University of Texas at Austin
0

Neural text generation models like those used for summarization and translation generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options. We present a search algorithm to construct lattices encoding a massive number of generation options. First, we restructure decoding as a best-first search, which explores the space differently than beam search and improves efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis recombination: we can identify pairs of similar generation candidates during search and merge them as an approximation. On both document summarization and machine translation, we show that our algorithm encodes hundreds to thousands of diverse options that remain grammatical and high-quality into one linear-sized lattice. This algorithm provides a foundation for building downstream generation applications on top of massive-scale diverse outputs.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/31/2018

When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)

In neural text generation such as neural machine translation, summarizat...
11/25/2016

A Simple, Fast Diverse Decoding Algorithm for Neural Generation

In this paper, we propose a simple, fast decoding algorithm that fosters...
06/14/2019

Comparison of Diverse Decoding Methods from Conditional Language Models

While conditional language models have greatly improved in their ability...
05/22/2020

Investigating Label Bias in Beam Search for Open-ended Text Generation

Beam search is an effective and widely used decoding algorithm in many s...
03/28/2022

A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

We propose Composition Sampling, a simple but effective method to genera...
10/07/2016

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Neural sequence models are widely used to model time-series data in many...
03/13/2021

Improving Diversity of Neural Text Generation via Inverse Probability Weighting

The neural network based text generation suffers from the text degenerat...

1 Introduction

Although pre-trained text generation models lewis-etal-2020-bart; RaffelEtAl2020 have achieved impressive results across a range of tasks, these models do not always deliver what system developers want. Machine generated text may be non-factual kryscinski-etal-2020-evaluating; maynez-etal-2020-faithfulness; goyal-durrett-2021-annotating or toxic gehman-etal-2020-realtoxicityprompts. We might patch these problems by applying discriminators over the output holtzman-etal-2018-learning; yang-klein-2021-fudge to enforce these properties post-hoc; we could, for instance, apply a secondary model as a reranker over a small collection of outputs. However, if the generator returns a homogeneous set of outputs, we may fail to find any usable generation output.

What if generation models could return massive numbers of candidates rather than a few outputs with optimal score? With a large set of candidates, our secondary model could more easily find an acceptable one without having to take more extreme steps like re-training the initial generation model. Output diversity has separately been established as a useful goal for for applications such as dialogue and story generation li-etal-2016-diversity; fan-etal-2019-strategies; cao-wang-2021-inference.

Standard approaches including beam search (BS) and sampling methods fall short of our goal. Beam search uses significant computational resources to explore similar hypotheses, and much of the computation in the search process is invested into paths that could be acceptable generation outputs, but are ultimately pruned; we explore these issues in Section 3. Sampling approaches like nucleus sampling Holtzman2020The, although achieving better diversity than beam search, often re-discover seen hypotheses and can be harder to control for quality. A central problem with both methods is that they do not handle very similar hypotheses efficiently.

Figure 1: A lattice of outputs yielded by our path recombination method is a more efficient way to represent and explore related generation outputs compared to beam search.

In this paper, we present a decoding framework with two key components. First, we argue that a modified best-first search (Bfs) is the right way to explore the search space. We augment standard best-first search with an ad-hoc path completion strategy: we eagerly expand each node until we reach an EOS token, thereby guaranteeing that each node is part of some completed path returned to the user. This generation strategy avoids exploring large numbers of states which end up being pruned. Bfs is also more flexible than static beam search and can prioritize exploration in more uncertain parts of the generation.

Second, our algorithm returns a massive number of generation options encoded in a lattice, with different hypotheses recombined in an approximate fashion. Recombination is a technique where similar decoder hypotheses are merged in the search. In Figure 1, we show an illustration of the lattice structure this recombination can form for document summarization. “A Cardiff recycling company has gone into” and “A Cardiff waste management company has gone into

” are preserved as different states in beam search, but actually have very similar distributions of following words under the model. If we can heuristically identify such states, we can merge them (Figure 

2) and assume that any high-scoring continuation of one hypothesis can continue the other. We broaden a recombination method used previously in beam search for machine translation och-etal-2001-efficient; zhang-etal-2018-exploring, enabling us to compactly encode large number of generation candidates buckman-neubig-2018-neural and achieve dense lattices as shown in Figure 1.

We show results for both document summarization and machine translation in three language pairs. For each setting, we show that our lattice encodes a large number of high-quality candidates, including good matches with annotated reference generations. We further show that a variant of our method can still achieve strong results with a lower number of nodes expanded than the baselines, suggesting that this can be a path towards saving computational resources. We believe that computing thousands of high-quality generation candidates within a single compact data structure can provide a powerful starting point for various downstream purposes: diversity, factuality, customizability, and more.

Figure 2: Our search algorithm. At each step, the algorithm pops a node from search frontier , checks for possible recombinations with existing nodes, and merges the nodes if a match is found. “waste company” and “recycling plant” are interchangeable paraphrases which do not affect the continuation from the model’s perspective.

2 Problem & Setup

We define our algorithm in the context of conditional text generation sutskever2014sequence; bahdanau2014neural. Conditional text generation is formulated as sequence transformation from a source input

to target output

via a neural text generation model . Each is a symbol in a vocabulary

. The probability of a decoded sequence is

. Decoding text from a model can be framed as a search problem, where the search objective is to find the output sequence that maximizes the conditional probability under the model: . Because depends on the entire generated sequence, this decoding problem is intractable to solve exactly.

While typically the goal of decoding is to find the hypothesis with the highest possible model score, we instead focus on finding a large set of “good enough” hypotheses. That is, finding a set :

(1)

for some threshold .

emerges naturally by adjusting search hyperparameters to control the number of returned hypotheses. Our goal in this paper is to design an algorithm that can efficiently find

.

2.1 Notation

We encode predicted generation candidates in a lattice. A lattice is a directed graph where each node represent a word token and paths defined by directed edges encode candidates. A path in from to any node represents a (partially) decoded string, consisting of the words in that path. All completed paths start with a single start-of-sequence node and end at (potentially different) end-of-sequence nodes . In beam search or sampling, is strictly a tree, where each node has exactly one parent. However, our constructed lattices are no longer trees due to the recombination mechanism, which we will discuss in Sec. 5.

Search Graph

is constructed iteratively through a search procedure. We maintain the closed graph with explored nodes and edges as well as a search frontier , a set consisting of successors to nodes currently in the graph. For each node, there are possible successors.

We define the search budget as the number of nodes expanded from the search frontier. Our experiments will seek to compare different methods using the same search budget. We will define this more precisely in Sec. 7.

2.2 Baselines

We consider two categories of decoding methods as the baselines: beam search based and sampling based. Beam search is widely used to find near optimal solutions in NLP tillmann-ney-2003-word; meister-etal-2020-beam. We consider two variants, original beam search (BS) and diverse beam search (DBS), an improved version targeting diverse text generation vijayakumar2016diverse. Some implementation details of beam search are described in Appendix A. Sampling-based decoding methods involve randomness and construct candidates by sampling from the next-token distribution rather than maxing. Methods like temperature annealing (Temp) ficler-goldberg-2017-controlling, top- sampling fan-etal-2018-hierarchical and nucleus sampling (Ncls) Holtzman2020The are also widely used to find high-quality text from models. We compare with BS, DBS, Ncls, and Temp in our experiments.

3 Inadequacies of Beam Search

As we have alluded to, beam search is inadequate for our goal for several reasons.

Figure 3: Correlation of ROUGE-2 and model score in beam search. For each example, we compare the hypothesis with the highest model score, , with all other hypotheses. and -axis show the gaps of R2 and model score. The Pearson’s is 0.092 which suggests very low correlation between R2 and model score.

Better Model Score Better Hypothesis

The most critical issue is that beam search is designed to efficiently approximate , but the optimal model score is neither our goal nor a guarantee of a good hypothesis. In Figure 3, we compare the correlation of model score and ROUGE under beam search for text summarization. The Pearson correlation between these two variables is very weak. Beyond ROUGE score, the example in Fig. 1 shows that the main differences between these summaries may be minor differences in surface realization that have little effect on our qualitative judgments of summary quality. Finding the best model score does not substantially improve the quality over a near-optimal model score. Allocating resources to eke out slight improvements over the greedy hypothesis, as beam search does, is a poor use of resources for most applications.

Figure 4: Results of BS/DBS on XSum with larger beam size , compared to a proposed model introduced later (blue star) with equivalent beam size . We consider sample ROUGE-2 lower than 13 as low relevance/quality generations. Diversity of BS does not scale well with and DBS generations become low quality.

Lack of Diversity in (Diverse) Beam Search

Are the model outputs from BS and DBS diverse? We use Self-BLEU (sBl) zhu2018texygen to measure the BLEU score for randomly sampled pairs from each algorithm’s output. The lower the self-BLEU, the more dissimilar the pairs are. On decoding summaries on XSum, the sBl for BS/DBS are 87/79 while a nucleus sampling method can achieve 57/50 depending on configuration. Although DBS slightly improves the diversity compared to the original variant, the overlap of outputs from beam search based method is still very high, and the diversity remains a challenge.

Poor Scaling Behavior

In spite of these shortcomings, perhaps beam search could still be viable with larger beam sizes if more computational resources are available. We experiment with beam sizes of and see how diversity scales with beam size. In Figure 4, we found that the exponential increase of beam size does not scale with the increase of number of novel bigram in beam search. In DBS, the diversity does ramp up, but the quality of the generated text is getting very bad very soon. For BS and DBS, increasing beam size is not an effective solution for better diversity. We also show that increasing beam size does not scale well in terms of finding better hypotheses, which is shown in Appendix. B.

16 8
XSum zh-en fr-en en-fr
BS 71.3% 63.3% 54.0% 59.2%
DBS 71.2% 56.1% 50.4% 55.7%
Table 1: Pruning ratio of BS and DBS on different tasks and datasets with beam size . We report the average percentage of explored nodes getting pruned and not appearing in a finished hypothesis.

Poor Efficiency from Pruning

One final issue with beam search is that most of its computation is not even useful in producing finished hypotheses; that is, the set of answers produced does not contain most of the nodes expanded in the typical course of operation. We conduct an empirical pruning study on a summarization dataset and three translation datasets and show the results in Table 1. For all studied cases, beam search and diverse beam search prunes over half of the expanded nodes. Many pruned hypotheses are not truly ungrammatical or low quality, but are merely slightly less likely than other nodes. How we can preserve more of the explored lattice and do so efficiently is addressed in the next section by our use of best-first search.

4 Best-first Search

As established in the previous section, beam search prunes many paths that would potentially yield high-quality summaries and wastes computational resources expanding nodes that aren’t included in a final search graph. We tackle this issue by changing from beam search to best-first search (Bfs) hart1968formal; pearl1984heuristics. Bfs prioritizes searching over nodes according to a scoring function, giving us more flexibility in how we explore the space. Our chief modification of the base algorithm is a heuristic called ad-hoc completion.

Ad-Hoc Path Completion

Neural text generation is a search problem with large branching factor () and deep search depth (sequence length). As a result, applying Bfs with the scoring function being the model score of a state often leads to a broad search that rarely returns a valid path. One solution to this problem is to incorporate a heuristic based on length. Model score is monotonically decreasing as a sequence grows in length, so prior work wu2016google; zhang-etal-2018-exploring; meister-etal-2020-best has used a length reward term to alleviate this issue.222This can be considered a heuristic like in (weighted) A search, but it is not necessarily admissible or consistent, so we do not describe it this way. We found that, even with a length heuristic, Bfs will still have “dangling” nodes that are not part of any path to an EOS (goal) token, and in some cases it will return few or no valid hypotheses.

Recognizing our objective from Equation 1, we can take a simple step to ensure that every node ends up on some completed path: eagerly do a greedy “rollout” from each node until we reach .333We can discard the path if it exceeds the maximum length and still does not terminate, but it was not observed in practice. In Algorithm 1, we implement this by modifying the priority of the highest scored token with (line 12), so it will be explored immediately after the current time step. In Figure 5, we show an illustrative example of ad-hoc completion.

1:Generation model with vocabulary , search budget , and denote open set (max priority queue) and closed set, and are functions checking and running path recombination.
2:All completed paths
3: {(,)}, , .
4:while   do
5:     .pop()
6:     if   then
7:          
8:          continue
9:     end if
10:     if  then
11:          
12:          for   do
13:               if   then
14:                    score // ad hoc completion
15:               else
16:                    score
17:               end if
18:               
19:          end for
20:          
21:     end if
22:     
23:end while
Algorithm 1 Best-first search with ad-hoc completion and path recombination

Search Algorithm

We describe Bfs with ad-hoc completion in Algorithm 1. The algorithm is a slightly modified best-first search algorithm applied to text generation. is a function to evaluate the value of a path. Typically it is defined as . is the budget for total model calls to neural text generation model. Note that and do not invoke neural generation model so they do not count towards the computation budget we defined here. In practice, we only consider top 5 expansions rather than the whole vocabulary for line 10.

5 Path Recombination

Path recombination, also known as hypothesis recombination or state merging, was originally proposed and used in phrase-based machine translation och-etal-2001-efficient; koehn-etal-2003-statistical; zhang-etal-2018-exploring. The idea of path recombination is to combine similar paths if what the model predicts for them in the future is the same,444In phrase-based MT, this procedure merges decoder hypotheses with identical -gram suffixes (among other conditions) that yield identical scores for all possible future due to the use of an -gram language model on the target side. However, because neural model scoring does not factor over -grams, no pair of hypotheses will have this property. reflecting a similar dynamic programming principle as the Viterbi algorithm. We focus on finding hypotheses which approximately exhibit this property, and show that merging them can yield high-quality outputs. Figure 2 shows an example of recombination. The two hypotheses being merged here roughly convey the same intent; a neural model could prefer different continuations for these two based on latent attributes of the text, but we will show that the shared suffixes “has gone into” is a strong indicator that the model will not distinguish them strongly in the rest of the generation.

Figure 5: Ad-hoc completion. The red node is the current node being expanded. We greedily rollout a sequence of nodes (in blue) to get a completed path.

5.1 Prerequisites of Recombination

In the strictest form, recombining two hypotheses assumes the following equivalence between them:

Definition 5.1 (Strong equivalence).

Let and be two prefix strings starting with . and are strongly equivalent if holds for all .

Merging such states in the search tree is valid with no loss of information, as any expanded node will receive the same score under both prefixes. However, this assumption is not realistic since Transformer models condition on the entire sequence so far, and any tiny perturbation changes the predicted distribution. To relax the assumption, we then propose the weak alternative.

Definition 5.2 (Weak equivalence).

Let and be two prefix strings starting with . and are weakly equivalent if greedy completions of these two strings are the same: .

This criterion we can actually check empirically. However, it is still not practical to check during search itself, as it requires expanding a number of nodes. We will define even weaker criteria and show that these can be good proxies for identifying weakly equivalent nodes.

5.2 Canonical Paths

After recombination, a single node may represent multiple different possible sentence prefixes. We define the notion of a canonical path, which represents the single path used to score candidate expansions.

Figure 6: Illustration of two path recombination strategies. We show the input and the recombination output from Rcb and Zip. Orange lines are the merging edges (Mrg) built by recombination. Dotted lines and circles are removed components after recombination. The key difference of Rcb and Zip is how much does the recombination propagate, 1 step or step.

Type of Edge & Path

If the edge is created due to the extension of search graph via model’s prediction, we call it a Gen edge. Otherwise, the edge is created due to path recombination, and we call it a Mrg edge. In Fig 6, the edges in orange are Mrg edges.

Definition 5.3 (Canonical Path).

Let be a node. The canonical path to is defined as the unique path from to consisting only of Gen edges.

Theorem 5.1.

For any node in the graph except , there exists exactly one canonical path.

We present the proof in Appendix. C

In the rest of the paper, unless specified, the path of a node , returns the sequence of words corresponding to the canonical path of that node. Expanding computes under the neural model.

5.3 Merging Criteria

Strong and weak equivalence are too expensive to establish during inference. We instead rely on even simpler functions to approximate this further. We define a similarity function to determine if an expanded node should be merged with an existing expanded node . Note that to tractably implement the isRecomb step of Algorithm 1, we need this check to be very efficient, as it needs to check a node against the entire lattice thus far.

A similar recombination idea was explored in zhang-etal-2018-exploring. Following their work, we explore a family of rule-based heuristics for merging. There are two rules: (1) two strings share a common -gram suffix, (2) the length difference of two strings is less than . Assume that the canonical paths for and are lengths and . Then

(2)

where and are hyper-parameters.555In zhang-etal-2018-exploring, there is one extra constraint requiring for Eq. 2, which requires that the path getting recombined has lower model score than the existing path. However, we found that model score is not always a good indicator for merging, as suggested in Fig. 3, partially because it is challenging to calibrate scores across different sequence lengths, so we disregard this constraint. For a large enough value of , note that the shared suffixes encourage hypotheses like this in Figure 6 that share large parts of the structure already.

5.4 Prior Work: BSZBeam

zhang-etal-2018-exploring

use their merging criterion in the context of beam search for neural machine translation. Their work focuses on beam search, so the matching candidates are restricted to hypotheses of the latest and second latest time steps. If the merging criteria hold,

will be recombined with . However, will not be considered as a future merging candidate. We call this merging strategy ZBeam. We implement this model together with its merging criteria and denote it as BSZBeam. This strategy is tailored to beam search and, as we discuss later, explores a more limited set of merges than one might want to consider.

6 Implementing Recombination

In this work, we present two path recombination strategies, Rcb and Zip. Rcb is a natural generalization of ZBeam, and Zip is a more aggressive recombination strategy which produces a denser lattice at the cost of potentially introducing more noise. We present an illustration of them in Figure 6.

Rcb: Generalization of ZBeam

ZBeam has a major limitation: a limited set of merging candidates. The potential merge candidates in ZBeam are only nodes in the current beam hypotheses and their previous steps, so the method cannot merge with nodes from earlier timesteps. For example, “A waste plant has gone into” cannot be merged with the hypothesis with ending in node 4 shown in Figure 6. The proposed generalization, Rcb, addresses this limitation. We index all of the nodes in the lattice across all timesteps by their -grams using a hash table, making it time to look up an -gram pattern and retrieve potential merge candidates if they exist.

Zip: Recombining More

If we take a closer look at Rcb in Figure 6, we see that even in the merged structure, nodes 3 and 7 and nodes 2 and 6 are preserved as separate. They do not pass the recombination criterion themselves, but these nodes are part of the suffix matched strings, still correspond to the same words, and have the same directly generated next word. There is reason to believe that these might be equivalent as well.

Hence, we explore a variant called Zip that propagates the merge backwards through the lattice. This change relaxes the merging criterion and potentially up to pairs of nodes are combined when a merge is identified, leading to a more compact lattice. We describe some of the implementation detail in the Appendix D.

Methods with Path Recombination

We implement path recombination on two of our baseline methods: beam search and nucleus sampling. For beam search, we implement BSZBeam following zhang-etal-2018-exploring. Due to the flaws of beam search as discussed earlier and the inherent complexity, we do not integrate Rcb and Zip with beam search. We integrate Rcb with nucleus sampling, and they are denoted as NclsRcb and NclsRcb. Finally, we use both merging variants in best-first search, yielding BfsRcb and BfsZip, respectively.

7 Evaluation

To evaluate the proposed methods, we conduct experiments on two conditional text generation tasks: abstractive text summarization and machine translation. Our evaluation focuses on two questions: (1) how large and diverse are our lattices; (2) are the candidates encoded in the lattices high quality and grammatical?

7.1 Datasets & Base Models

We obtain all the models and certain baseline decoding methods from the Transformers library wolf-etal-2020-transformers. Since our methods are inference techniques with rule based heuristics, we do not re-train any models.

Summarization

We use XSum narayan-etal-2018-dont, a popular English news summarization dataset. We sample 100 examples from the validation set. The base model we use is BART-large-XSum lewis-etal-2020-bart. We set the max length to 35 tokens according to the reference summaries.

Machine Translation

We study our models on the English-French (en-fr) pairs from WMT 2014 bojar-etal-2014-findings and Chinese-to-English (zh-en) pair from WMT 2019 barrault-etal-2019-findings. We use mBART liu-etal-2020-multilingual-denoising, a state-of-the-art neural machine translation model. We set the max decoding length to be twice the input length, so it varies per example.

7.2 Evaluation Details

Search Budget

We want to compare our different methods under a similar search budget, or number of nodes expanded. Each node expansion requires running the neural generation model to get a probability distribution over the vocabulary, which is the dominant factor in runtime; we incur negligible overhead from rule-based matching in the merging step, as well as the computational costs of computing diversity term in

DBS and modifying sampling distributions in sampling methods.

We define the number of nodes expanded in terms of a quantity we call equivalent beam size. Recall that denotes the maximum length of decoded hypotheses for beam search. Let the total computation budget be . For beam search based methods, is the original beam size and the total budget covers running beam search once with beam size and maximum decoding length . For sampling based methods, is the number of independent samples we draw from the model. Each sample is a completed path starting with and ending with . After obtaining paths, we merge them into one graph (i.e., a trie) to compactly represent duplicate prefix strings. For best-first search methods, is the total number of nodes explored by the algorithm.

Enforcing a uniform search budget

Since hypotheses may terminate before they reach EOS, empirically there is a gap between effective length (the average generated hypothesis length) and max length for both beam search and sampling. Beam search can exit out before reaching the maximum hypothesis length if a sufficient number of finished paths are found, but this amounts to dynamically changing the budget for each instance. Running our method with a budget derived from beam search is artificial and not realistic in practice.

Instead, we apply a corpus-level correction factor so that the different methods are expanding the same number of nodes. We increase the beam size by 50% for translation and 25% for summarization for our baseline methods: to BS, DBS, Ncls, Temp, and BSZBeam. This correction balances the number of nodes expanded between our method and the baselines.

7.3 Search Algorithms

Greedy

is the deterministic greedy decoding method that always selects the highest probability token as prediction. The equivalent beam size for this approach is 1 since we only run one pass.

Bs & Dbs

stand for beam search and its variant diverse beam search vijayakumar2016diverse. In our configuration, we use Hamming distance as the diversity function and set the diversity strength to 1.5, following vijayakumar2016diverse.

Ncls

is the nucleus sampling method proposed in Holtzman2020The, which encourages quality by truncating the distribution over the vocabulary with a parameter before sampling. We experiment it with and .

Temp

changes the temperature of softmax function to reshape the prediction distribution ficler-goldberg-2017-controlling. We set the temperature parameter so the prediction picks more low-scored tokens than .

Bfs

is the standard best-first search method without path recombination. We use our ad-hoc path completion technique to ensure that finished hypotheses are produced. Our preliminary study shows a regular best-first search method can not always yield at least one valid hypothesis, even with a length-aware model score function.

Our Methods with Path Recombination

BSZBeam is our implementation of zhang-etal-2018-exploring. We integrate Rcb with nucleus sampling and best-first search as NclsRcb and BfsRcb. We also test Bfs with the Zip strategy. 🍂BfsZip is a resource-efficient version of BfsZip where only 25% of search budget is used, exploring what this method can achieve with a lower budget given its more aggressive merges.

Diversity Or Sp Grm
Model path N1 N2 sBl ED R2 R2 Err
Greedy 1 22 23 100 0 17.3 17.3 0.5%
BS 20 42 61 87 31 26.3 17.7 0.3%
DBS 19 59 91 79 53 25.5 15.9 0.5%
Ncls 20 124 237 57 72 30.2 14.5 0.5%
Ncls 20 143 273 50 76 28.1 13.3 0.8%
Temp 20 170 319 51 82 26.6 11.6 1.4%
Bfs 30 88 167 68 60 30.1 15.6 0.4%
+ Path Recombination
BSZBeam 4,701 66 118 75 51 33.0 16.0 0.7%
NclsRcb 52 167 308 53 79 28.8 13.0 1.0%
NclsRcb 36 207 363 50 87 25.9 11.0 1.7%
BfsRcb 7,758 111 239 65 64 35.8 15.2 0.8%
BfsZip 95,744 124 274 53 77 36.8 13.2 1.4%
🍂BfsZip 297 58 92 80 49 29.2 15.2 0.8%
Table 2: Main results for all methods decoding text summaries on XSum. Diversity metrics are rounded to integer due to space limit. We use , and to denote the desired trend, the higher the better, the lower the better, or good if it passes some threshold. Among the methods with path recombination excluding 🍂BfsZip, we denote the best, second and third best, and the worst one in color.

7.4 Evaluation Metrics

We describe our metrics to evaluate both quality and diversity. Several of our methods build on ROUGE666https://github.com/google-research/google-research/tree/master/rouge lin-2004-rouge and BLEU papineni-etal-2002-bleu; post-2018-call for evaluating the generated text compared to reference summaries or translations.

Diversity-oriented Metrics

We evaluate the diversity of generated texts with the following metrics. (1) path is the average number of unique paths in the produced lattice.777 Due to the exponentially growing number of paths in some of our models, we cap the number of incoming paths each node could hold to a constant . Incoming paths are the paths starting with and ending with the current node. (2) Number of unique -grams encoded in the lattice; this captures a different type of diversity than the number of paths, since there could be many paths reusing the same words. N1 and N2 are average number of novel unigrams and bigrams in the graph. (3) sBl is the average self-BLEU among samples zhu2018texygen. The samples are drawn from a uniform random walk from . The range of sBl is . (4) ED is the average edit-distance among samples. We set in our experiment.

Quality: Grammaticality

One concern about our method is that by recombining different hypotheses, we could introduce grammatical errors, e.g., if two hypotheses have different parses despite a shared -gram suffix. Following past work xu-durrett-2019-neural; xu-etal-2020-discourse, we opt for an automated approach to evaluating grammaticality. We adopt GECToR888https://github.com/grammarly/gector, a state-of-the-art neural grammatical error correction model omelianchuk-etal-2020-gector to automatically assess the grammaticality of generated texts. We use the RoBERTa version of the model, which achieves an F of 71.8 on the test set of BEA-2019 bryant-etal-2019-bea. We report GrmErr(%), the average number of grammar errors per token, for all English-output experiments.

Quality: Oracle Reference Match

Given the reference summary or translation, we find the path with highest ROUGE or BLEU over all found paths. Oracle ROUGE is defined as . For ROUGE metrics, we maximize over R2 and present R1, R2, and RL. This metric captures both quality and diversity: the algorithm needs to find something close to the reference, but a diverse lattice will have a higher chance of exhibiting a good candidate all else being equal. We denote this metric as Or.

zh-en fr-en
Diversity Or Sp Grm Diversity Or Sp Grm
Model path N1 N2 sBl ED Bl Bl Err path N1 N2 sBl ED Bl Bl Err
Greedy 1 35 40 100 0 24.7 24.7 0.5% 1 28 31 100 0 40.0 40.0 0.9%
BS 12 45 63 95 20 32.2 25.0 0.2% 12 37 50 93 13 52.6 38.1 1.0%
DBS 11 55 84 89 59 29.7 20.5 0.5% 11 45 67 88 37 46.4 30.5 1.1%
Ncls 12 94 188 72 82 31.5 17.5 0.7% 11 62 107 80 46 51.0 31.2 1.0%
Ncls 12 110 226 67 94 30.4 15.8 0.9% 12 75 134 77 57 48.3 27.4 1.2%
Temp 12 140 280 62 105 27.0 12.7 1.3% 12 102 184 69 71 43.7 21.6 1.6%
Bfs 18 60 104 86 54 32.7 20.7 0.5% 27 59 102 84 37 53.2 33.7 1.1%
+ Path Recombination
BSZBeam 18,336 64 117 77 65 40.1 19.1 0.8% 16,729 59 107 77 43 61.2 28.2 1.3%
NclsRcb 81 138 263 67 91 26.8 13.9 1.1% 344 140 246 64 67 48.2 26.6 1.2%
NclsRcb 38 188 343 58 114 23.9 10.6 1.7% 123 205 352 55 92 41.1 20.2 2.1%
BfsRcb 17,535 81 171 76 72 42.1 19.4 0.9% 47,577 85 193 68 52 64.6 25.3 1.6%
BfsZip 59,020 94 205 66 88 42.4 15.5 1.4% 146,163 111 259 56 63 56.8 16.9 2.5%
🍂BfsZip 511 50 75 89 38 33.0 21.2 0.7% 4,531 50 81 82 35 59.5 29.4 1.4%
Table 3: Results on WMT14 Fr-En and WMT19 Zh-En. Columns are the same as for summarization, although BLEU is used instead of ROUGE. Trends in the system performance are roughly similar, with BfsRcb providing high diversity at good quality and 🍂BfsZip offering a strong tradeoff between computational resources and diversity.
Diversity Or Sp
Model path N1 N2 sBl ED Bl Bl
Greedy 1 32 35 100 0 28.5 28.5
BS 12 42 57 93 13 37.8 27.5
DBS 10 51 73 89 38 33.1 22.7
Ncls 12 95 171 72 56 35.4 20.4
Ncls 12 116 214 66 73 33.4 17.6
Temp 12 150 274 61 89 28.4 13.1
Bfs 17 62 98 85 35 38.8 25.0
+ Path Recombination
BSZBeam 17,508 67 117 78 40 46.4 21.2
NclsRcb 59 151 261 67 78 29.3 16.3
NclsRcb 32 190 317 53 101 26.9 12.6
BfsRcb 18,663 90 180 74 42 46.6 20.8
BfsZip 49,507 104 213 65 53 45.9 16.7
🍂BfsZip 386 49 70 88 25 39.5 25.7
Table 4: Results on machine translation WMT14 English to French. BfsRcb and BfsZip are strong in both diversity and quality.

Quality: Average Reference Match

Although our method focuses on deriving diverse text summaries or translations, we aim to guarantee that the generated text is highly relevant to the generation target and is of high quality in general. We sample 1,000 paths from the lattice with replacement and evaluate the average ROUGE or BLEU compared to the reference. We denote this metric as Sp.

8 Results

8.1 Text Summarization

We present the experimental results on the dev set of XSum in Table 2. Full results are kept in Table 5 for reference. The top half of the table shows the results of non-recombination methods. Among non-recombination methods, BS and DBS are the least diverse methods compared to other methods. Sampling based methods including Temp are generally more diverse, but the oracle ROUGE is lower than that of Bfs. Given the sacrificed text quality (lower sample ROUGE and more grammar errors) of sampling based methods, we argue that best-first search is an ideal decoding strategy itself even without path recombination. It achieves a good balance of diversity and quality, and is more likely to find a candidate close to the reference under the same amount of computation resources.

The bottom half shows all methods with path recombination techniques. Recombination significantly improves the diversity of generated outputs, with a much higher number of paths. The self-BLEU of the recombination variants are lower than their non-recombination counterparts.

In terms of search quality, the proposed BfsRcb and BfsZip methods obtain significantly higher oracle ROUGE compared to all other methods. We show these results later in Figure 9: our approach can find much better oracle solutions, even compared with beam search method with quadruple the amount of computation resources. The design of the oracle ROUGE metric is also motivated by a real use case: if you want a specific summary (e.g., a summary covering a specific entity or topic), does it exist in the search graph? Higher oracle ROUGE indicates a closer match, meaning a strategy using some kind of reranking model could help find the user’s preferred summary.

NclsRcb generates more novel tokens than other methods, but the number of paths is very limited, indicating that these are largely disjoint paths that cannot be recombined. Moreover, the oracle and sample ROUGE are low, showing the lower quality of these outputs.

Comparison: Rcb & Zip

The Zip method yields even more diverse output at the cost of text quality. There are a few reasons for this: 1) recombination of more nodes makes the lattice denser, increasing the number of paths but also potential errors; 2) elimination of unexplored children from merged branch reduces the waste of exploration which means Zip can explore more varied hypothesis than Rcb. With the same amount of computational resources, Zip explores a larger search space while Rcb explores a smaller collection more reliably.

🍂Zip exploits the efficiency of Zip to achieve high diversity, and by searching through fewer states, it manages to achieve higher quality as well.

8.2 Machine Translation

We show the result on machine translation in Table 3 and 4. As im summarization, results on translation tasks shows the consistent gains of diversity from path recombination models. In Table 3, we show two translation task where the target language is English. BfsRcb works better than BfsZip because it disables some aggressive and bad merges which explores bad hypotheses. Compared to summarization, we found the search space in MT to be more constrained, so there was less room for aggressive merging and exploration to improve over Rcb.

Our lower-resource method, 🍂BfsZip approach, actually performs quite well on most metrics with only 25% of search budget. It has better diversity performance than any non-recombination methods, and comes with quality better than most of the recombination methods. The usage of Bfs and path recombination methods like BfsRcb and BfsZip is promising for being able to find a better cost-diversity tradeoff in MT.

Figure 7: Empirical verification of merging criteria. We experiment with for -gram suffix matching. We sample 1,000 recombinations from BfsRcb and BfsZip respectively, and run greedy rollout based on merged hypotheses. We use Exact Match (EM) to measure whether two merged hypotheses are the same (1) or not (0) on average. L is the length of the first tokens starting from the merge token.
Figure 8: Two examples on XSum by 🍂BfsZip. The start of sentence is denoted in dark color, and all the endings are in gray. We combine tokens to phrases when possible for visualization purpose. More examples are presented in Appendix. E.

8.3 Validating the Merging Criterion

Our merging criterion is fundamentally an approximation of the equivalence criteria described in Section 5. As described before, no pairs of generation candidates really follow the strong equivalence assumption, but we can explicitly check the weak equivalence assumption. Our question is: what fraction of nodes merged by our merging criterion satisfy the weak equivalence assumption? We conduct an experiment to verify this. We consider all merges on BfsRcb on XSum. For each pair, we compute the greedy completion for timesteps and check whether the continuation of the base candidates would be the same.

In Figure 7, we show the fraction of merged pairs for which the generations match exactly under three values of the recombination criterion (, and ). For BfsRcb, when using the greedy continuation over 4 timesteps is the same 71.2% of the time. For BfsZip, when using the greedy continuation over 4 timestpes is the same 62.5% of the time. Following the weak equivalence criterion is a strong indication that these hypotheses can admit many of the same continuations. Rcb is more reliable than Zip on recombination assumption, but both methods show moderate adherence to the equivalence criterion.

8.4 Error Analysis & Visualization

In Figure 8, we present two examples on XSum by 🍂BfsZip. The upper example has more word level recombination and paraphrasing during generation while the bottom one has more ways of ending and more diverse content coverage (e.g., Albania, 5 June, Leicester City, 1982, etc.). We show more examples on both summarization and translation in Appendix. E.

We manually examine the output and found a few common types of errors introduced by our algorithm. (1) Factual errors at high entropy nodes. Our approach assumes that high-scoring candidates under the model are good quality, but this assumption is violated in certain cases, like when the model attempts to hallucinate information. For example, given the prefix “The company, founded in” will cause the model to guess answers like “1989” or “1999”. Encoding all of these in the lattice is incorrect. This can still happen in BS but is less likely due to pruning. We did not see significant factual errors introduced by merging specifically. (2) Aggressive bad merges. In the upper example in Figure 8, the cluster of “GPs”, “nurses”, “paramedics” is an example case. The lattice encodes paths like “GPs, nurses and nurses should …”. This could be fixed by heuristics or rules, but we leave it for future work.

9 Related Work

The techniques used in this work partially reflect an outgrowth of a few lines of literature: understanding the behavior of text generation models xu-etal-2020-understanding-neural; xu-durrett-2021-dissecting; zhong-etal-2021-adapting-language, investigations into beam search stahlberg-byrne-2019-nmt; meister-etal-2020-beam, and studies of diversity in generation.

Search Strategies in Neural Text Generation

Best-first beam search meister-etal-2020-best is a method integrating best-first search with beam search. Some other variants of search have also been studied in previous work meister-etal-2021-determinantal; meister-etal-2021-conditional. Beam search has been critically examined in some recent work huang-etal-2017-finish; stahlberg-byrne-2019-nmt, but largely of focused on machine translation and specific challenges in MT.

Diversity

Neural text degeneration has been observed and discussed in radford2019language; Holtzman2020The; Welleck2020Neural, which led to an interest in diverse generation models. Diverse text generation has been studied in previous work yu2017seqgan, including in dialogue li-etal-2016-diversity, story generation fan-etal-2019-strategies, and particularly paraphrasing iyyer-etal-2018-adversarial; goyal-durrett-2020-neural. Many of the diverse options we observe in summarization correspond to text compressions xu-durrett-2019-neural; desai-etal-2020-compressive, but our method can also diversify content coverage gehrmann-etal-2018-bottom and word choice cao-wang-2021-inference.

10 Discussion & Conclusion

We presented an algorithm for decoding in text generation with two main components: best-first search to more efficiently structure exploration of the space and hypothesis recombination to encode summaries in a lattice structure. We showed that across summarization and machine translation, these lattices successfully encode large numbers of high-quality generation options.

There are a few limitations of our method. First, we currently benchmark these techniques using number of nodes expanded, not wall clock time. There are strategies for parallelizing the Bfs expansion shu-nakayama-2018-improving, but it remains to be seen how this parallelism compares to the parallelism achieved by beam search. Regardless, the dramatically larger number of hypotheses we return outweighs efficiency differences for now. Second, we focus on auto-regressive methods in this paper. However, we believe our framework could also be applied and adopted to non auto-regressive generation models song-etal-2021-new.

Acknowledgments

We would like to thank Sid J Reddy, Zhisong Zhang, Eunsol Choi, Yasumasa Onoe, Shuyang Cao, and Jonathan Kummerfeld for input and feedback on this work. This work was partially supported by a gift from Amazon, NSF Grant IIS-1814522, and a gift from Salesforce Inc.

References

Appendix A Implementation Details: Beam Search

In our beam search implementation, the size of the search frontier is up to the beam size . When a path is completed, we move it from the search frontier to a completed set to free up the beam for exploring unfinished hypotheses. Naturally, finished hypotheses in the end can be of variable length. After reaching the max generation step , we sort all hypotheses in according to the model score. Following common practice in libraries such as Transformers wolf-etal-2020-transformers, we return a number of completed hypotheses equal to the beam size.

Diversity Oracle Sample Grm
Model path N1 N2 sBl ED R1 R2 RL R1 R2 RL Err
Greedy 1 22 23 100 0 41.4 17.3 33.5 41.4 17.3 33.5 0.5%
BS 20 42 61 87 31 47.6 26.3 40.3 41.5 17.7 33.6 0.3%
DBS 19 59 91 79 53 47.0 25.5 39.1 38.5 15.9 30.3 0.5%
Ncls 20 124 237 57 72 50.4 30.2 44.2 37.4 14.5 29.5 0.5%
Ncls 20 143 273 50 76 48.0 28.1 42.2 36.1 13.3 28.5 0.8%
Temp 20 170 319 51 82 45.0 26.6 38.5 34.1 11.6 26.3 1.4%
Bfs 30 88 167 68 60 50.8 30.1 44.0 39.0 15.6 30.8 0.4%
+ Path Recombination
BSZBeam 4,701 66 118 75 51 52.2 33.0 45.7 40.0 16.0 32.3 0.7%
NclsRcb 52 167 308 53 79 49.0 28.8 41.8 35.0 13.0 27.8 1.0%
NclsRcb 36 207 363 50 87 44.6 25.9 38.7 32.1 11.0 25.1 1.7%
BfsRcb 7,758 111 239 65 64 55.2 35.8 49.3 38.5 15.2 30.8 0.8%
BfsZip 95,744 124 274 53 77 55.6 36.8 48.8 36.8 13.2 28.7 1.4%
🍂BfsZip 297 58 92 80 49 49.6 29.2 42.8 38.8 15.2 31.0 0.8%
Table 5: Full results for all methods decoding text summaries on XSum.
Figure 9: Oracle R2 of BS/DBS with larger beam size . Blue star represents BfsRcb with equivalent .

Appendix B Pool Scaling Behavior: Optimality

As a search algorithm, how do BS and DBS with larger beam size perform at finding solutions close to the reference? We compare the oracle R2 of BS/DBS with larger beam size in Figure 9. The oracle R2 increases slowly as the doubles, but our model BfsRcb with achieves 35.8, much higher than all BS/DBS cases.

Method Algos Cand Len Dedup
BSZBeam BS last step 1 N
Rcb any all 1 N
Zip any all Y
Table 6: Key differences in path recombination methods. rBS is the recombination method used in zhang-etal-2018-exploring. Algos shows which searching or decoding methods this method is used with. Cand is where the merge candidates come from in the lattice. Len reflects how many nodes are recombined per operation. Dedup denotes whether duplicates on the merged branch will be removed from heap.
Figure 10: An illustration of removing unexplored hypotheses from search frontier in Zip.

Appendix C Proof of Theorem 5.1

Proof by induction. Base case: we begin with just in the lattice, which has exactly one canonical path consisting of itself.

Inductive case: assume every node in the lattice has exactly one canonical path. We have to consider two cases when expanding a node in the lattice:

(1) Expanding the node as normal. In this case, the node is on the search frontier due to its parent node being expanded, which establishes a Gen edge from to . Since has exactly one canonical path, then has exactly one canonical path consisting of the canonical path to extended to .

(2) Applying recombination. This operation only adds Mrg edges and deletes nodes, neither of which have any impact on the canonical paths.

Appendix D Implementation Details: Zip

We summarize the key differences of ZBeam, Rcb and Zip in Table 6. In Zip, nodes that are already expanded might be removed from the lattice due to recombination. For example, in Figure 6, node 6 and 7 are removed in this fashion. In general, we handle this by re-mapping the eliminated node to its surviving counterpart. Any reference to node 7 is routed to node 3, or whatever node 3 is mapped to. This procedure is defined and implemented as a union-find data structure.

Deduplication of Unexplored Successors

After the Zip procedure, we also remove the unexplored successors of the merged nodes, like node 6, 7, and 8 in Fig. 6. We show a more detailed example in Figure 10. In Zip, we will merge node 3 and node 6. If we take a closer look at the successors of these two nodes, the distributions could be similar and in fact are expected to be if the equivalence criteria hold. We remove the unexplored direct successors of the merged node as part of the merging process, and the surviving node (node 3) captures these with similar probabilities regardless.

Appendix E Examples

Figure 11: Visualization of one example output for beam search on XSum. is labeled. Each color represents one unique ending.
Figure 12: Visualization of one example output for BfsRcb on XSum.
Figure 13: Visualization of one example output for BfsZip on XSum.

We show three examples with visualization in Figure 11,12 and 13. We use PyVis as the visualization tool.999https://github.com/WestHealth/pyvis More examples are available at https://github.com/jiacheng-xu/lattice-generation.