A Simple, Fast Diverse Decoding Algorithm for Neural Generation

11/25/2016 ∙ by Jiwei Li, et al. ∙ Stanford University 0

In this paper, we propose a simple, fast decoding algorithm that fosters diversity in neural generation. The algorithm modifies the standard beam search algorithm by adding an inter-sibling ranking penalty, favoring choosing hypotheses from diverse parents. We evaluate the proposed model on the tasks of dialogue response generation, abstractive summarization and machine translation. We find that diverse decoding helps across all tasks, especially those for which reranking is needed. We further propose a variation that is capable of automatically adjusting its diversity decoding rates for different inputs using reinforcement learning (RL). We observe a further performance boost from this RL technique. This paper includes material from the unpublished script "Mutual Information and Diverse Decoding Improve Neural Machine Translation" (Li and Jurafsky, 2016).



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural generation models [Sutskever et al.2014, 1, Cho et al.2014, Kalchbrenner and Blunsom2013] are of growing interest for various applications such as machine translation [Sennrich et al.2015b, Gulcehre et al.2015], conversational response generation [Vinyals and Le2015, Sordoni et al.2015, Luan et al.2016], abstractive summarization [Nallapati et al.2016, Rush et al.2015, Chopra et al.2016], and image caption generation [Chen et al.2015]. Such models are trained by learning to predict an output sequence, and then at test time, the model chooses the best sequence given the input, usually using beam search.

One long-recognized issue with beam search is lack of diversity in the beam: candidates often differ only by punctuation or minor morphological variations, with most of the words overlapping [Macherey et al.2008, Tromble et al.2008, Kumar and Byrne2004]. Lack of diversity can hinder sequence generation quality. For tasks like conversational response generation or image caption generation, there is no one correct answer; the decoder thus needs to explore different paths to various sequences to avoid local minima [Vijayakumar et al.2016].

Lack of diversity causes particular problems in two-stage re-ranking approaches, in which an N-best list or lattice of candidates is generated using beam search and then re-ranked using features too global or expensive to include in the first beam decoding pass 222E.g., position bias or bilingual attention symmetry in MT [Cohn et al.2016] or global discourse features in summarization.. In neural response generation, a re-ranking step helps avoid generating dull or generic responses [Li et al.2015a, Sordoni et al.2015, Shao et al.2016]. Lack of diversity in the N-best list significantly decreases the impact of reranking.333

shao15, for example, find that re-ranking heuristics in conversational response generation work for shorter responses but not for long responses, since the beam N-best list for long responses are mostly identical even with a large beam.

Figure 1: An illustration of standard beam search and the proposed diversity-promoting beam search.

denotes the hyperparameter for penalizing intra-sibling ranking. Scores are made up for illustration purposes.

In this paper, we propose a simple, fast, diversity-fostering beam search model for neural decoding; the model can be obtained by changing just one line of beam search code in MATLAB. The algorithm uses standard beam search as its backbone but adds an additional term penalizing siblings—expansions of the same parent node in the search— thus favoring choosing hypotheses from diverse parents (as demonstrated in Figure 1).

The proposed model supports batched decoding using GPUs, significantly speeding up the decoding process compared to other diversity fostering models for phrase-based MT systems [Macherey et al.2008, Tromble et al.2008, Kumar and Byrne2004, Devlin and Matsoukas2012]. To show the generality of the model, we evaluate it on three neural generation tasks—conversational response generation, abstractive summarization and machine translation– demonstrating that the algorithm generates better outputs due to considering more diverse sequences. We discuss which properties of these various tasks make them more or less likely to be helped by diverse decoding. We also propose a more sophisticated variant that uses reinforcement learning to automatically adjust the diversity rate for different inputs, yielding an additional performance boost.

2 Related Work

Diverse decoding has been sufficiently explored in phrase-based MT [Huang2008, Finkel et al.2006], including use of compact representations like lattices and hypergraphs [Macherey et al.2008, Tromble et al.2008, Kumar and Byrne2004], “traits” like translation length [Devlin and Matsoukas2012], bagging/boosting [Xiao et al.2013], blending multiple systems [Cer et al.2013]

, and sampling translations proportional to their probability

[Chatterjee and Cancedda2010]

. The most relevant is work from gimpel2013systematic and batra2012diverse that produces diverse N-best lists by adding a dissimilarity function based on N-gram overlaps, distancing the current translation from already-generated ones by choosing translations that have high scores but are distinct from previous ones. While we draw on these intuitions, these existing diversity-promoting algorithms are tailored to phrase-based translation frameworks and not easily transplanted to neural MT decoding, which requires batched computation.

Some recent work has looked at decoding for neural generation. cho2016noisy proposed a meta-algorithm that runs in parallel many chains of the noisy version of an inner decoding algorithm. vijayakumar2016diverse proposed a diversity-augmented objective for image caption generation akin to a neural version of gimpel2013systematic. shao15 used a stochastic search algorithm that reranks the hypothesis segment by segment, which injects diversity earlier in the decoding process.

The proposed RL based algorithm is inspired by a variety of recent reinforcement learning approaches in NLP for tasks such as dialogue [Dhingra et al.2016], word compositions [Yogatama et al.2016], machine translation [Ranzato et al.2015], neural model visualization [Lei et al.2016], and coreference [Clark and Manning2016].

3 Diverse Beam Decoding

In this section, we introduce the proposed algorithm. We first go over the vanilla beam search method and then detail the proposed algorithm which fosters diversity during decoding.

3.1 Basics

Let denote the source input, which is the input dialogue history for conversational response generation or a source sentence for machine translation. The input

is mapped to a vector representation, which is used as the initial input to the decoder. Each

is paired with a target sequence , which corresponds to a dialogue utterance in response generation or a target sentence in machine translation. consists a sequence of words. A neural generation model defines a distribution over outputs and sequentially predicts tokens using a softmax function:

At test time, the goal is to find the sequence that maximizes the probability given input :


3.2 Standard Beam Search for N-best lists

N-best lists are standardly generated from a model of using a beam search decoder. As illustrated in Figure 1, at time step in decoding, the decoder keeps track of hypotheses, where denotes the beam size, and their scores . As it moves on to time step , it expands each of the hypotheses (denoted as , ) by selecting the top candidate expansions, each expansion denoted as , , leading to the construction of new hypotheses:

The score for each of the hypotheses is computed as follows:


In a standard beam search model, the top hypotheses are selected (from the hypotheses computed in the last step) based on the score . The remaining hypotheses are ignored when the algorithm proceeds to the next time step.

3.3 Generating a Diverse N-best List

Unfortunately, the N-best lists outputted from standard beam search are a poor surrogate for the entire search space [Finkel et al.2006, Huang2008]. The beam search algorithm can only keep a small proportion of candidates in the search space, and most of the generated translations in N-best list are similar. Our proposal is to increase diversity by changing the way is computed, as shown in Figure 1. For each of the hypotheses (he and it), we generate the top translations , as in the standard beam search model. Next, we rank the translated tokens generated from the same parental hypothesis based on in descending order: he is ranks first among he is and he has, and he has ranks second; similarly for it is and it has.

We then rewrite the score for by adding an additional term , where denotes the ranking of the current hypothesis among its siblings (1 for he is and it is, 2 for he has and it has).


We call the diversity rate; it indicates the degree of diversity one wants to integrate into the beam search model.

The top hypotheses are selected based on as we move on to the next time step. By adding the additional term , the model punishes lower-ranked hypotheses among siblings (hypotheses descended from the same parent). When we compare newly generated hypotheses descended from different ancestors, the model gives more credit to top hypotheses from each of the different ancestors. For instance, even though the original score for it is is lower than he has, the model favors the former as the latter is more severely punished by the intra-sibling ranking part . The model thus generally favors choosing hypotheses from diverse parents, leading to a more diverse N-best list. The proposed model is straightforwardly implemented with a minor adjustment to the standard beam search.

4 Automatically Learning Diversity Rate

One disadvantage of the algorithm described above is that a fixed diversity rate is applied to all examples. Yet the optimal diversity rate could vary from instance to instance, and too high a diversity could even be detrimental if it pushes the decoding model too far from the beam search scores. Indeed, shao15 find that in response generation, standard beam search works well for short responses but deteriorates as the sequence gets longer, while vijayakumar2016diverse argue in image caption generation that diverse decoding is beneficial for images with many objects, but not images with few objects.

A good diverse decoding algorithm should have the ability to automatically adjust its diversity rates for different inputs—for example, using small diversity rates for images with fewer objects but larger rates for those with more objects. We propose a reinforcement learning-based algorithm called diverseRL that is capable of learning different values for different inputs.

4.1 Model

We first define a list that contains the values that can take. For example, might consist of the 21 values in the range [0,1] at regularly spaced intervals 0.05 apart.444This is just for illustration purpose. One can define any set of diversity rate values. Our main idea is to use reinforcement learning (policy gradient methods) to discover the best diversity rate for a given input

with respect to the final evaluation metric. For each input

, we parameterize the action of choosing an associated diversity rate by a policy network which is a distribution over the classes.555 This means the probability of taking on two similar values (e.g., 0.05 and 0.1) are independent. An alternative is to make continuous. However, we find that using discrete values is good enough because of the large amount of training data, and discrete values are easier to implement. We first map the input to a vector representation using a recurrent net666This recurrent net shares parameters with the standard generation model., and then map to a policy distribution over different values of using a softmax function:


Given an action, namely a choice of for , we start decoding using the proposed diverse decoding algorithm and obtain an N-best list. Then we pick the best output—the output with the largest reranking score, or the output with the largest probability if no reranking is needed. Using the selected output, we compute the evaluation score (e.g., BLEU) denoted , and this score is used as the reward777This idea is inspired by recent work [Ranzato et al.2015] that uses BLEU score as reward in reinforcement learning for machine translation. Our focus is different since we are only interested in learning the policy to obtain diversity rates . for the action of choosing diversity rate .

We use the REINFORCE algorithm [Williams1992], a kind of policy gradient method, to find the optimal diversity rate policy by maximizing the expectation of the final reward, denoted as follows:


The expectation is approximated by sampling from and the gradient is computed based on the likelihood ratio [Glynn1987, VM et al.1968]:


where denotes the baseline value.888

The baseline value is estimated using another neural model that takes as input

and outputs a scalar denoting the estimation of the reward. The baseline model is trained by minimizing the mean squared loss between the estimated reward and actual cumulative reward , . We refer the readers to [Ranzato et al.2015, Zaremba and Sutskever2015]

for more details. The baseline estimator model is independent from the policy models and the error is not backpropagated back to them.

The model is trained to take actions (choosing a diversity rate ) that will lead to the highest value of final rewards.

4.2 Training and Testing

To learn the policy

, we take a pretrained encoder-decoder model and run additional epochs over the training set in which we keep the encoder-decoder parameters fixed and do the diverse beam search using

sampled from the policy distribution. Because decoding is needed for every training sample, training is extremely time-consuming. Luckily, since the sentence composition model involved in shares parameters with the pre-trained encoder-decoder model, the only parameters needed to be learned are those in the softmax function in Eq. 4, the number of which is relatively small. We therefore only take a small fraction of training examples (around 100,000 instances).

Special attention is needed for tasks in which feature-based reranking is used for picking the final output: feature weights will change as changes because those weights are tuned based on the decoded N-best list for dev set, usually using MERT [Och2003]. Different will lead to different dev set N-best lists and consequently different feature weights. We thus adjust feature weights using the dev set after every 10,000 instances. Training takes roughly 1 day.

5 Experiments

We conduct experiments on three different sequence generation tasks: conversational response generation, abstractive summarization and machine translation, the details of which are described below.

5.1 Conversational Response Generation

5.1.1 Dataset and Training

We used the OpenSubtitles (OSDb) dataset [Tiedemann2009], an open-domain movie script dataset containing roughly 60M-70M scripted lines spoken by movie characters. Our models are trained to predict the current turn given the preceding ones. We trained a two-layer encoder-decoder model with attention [1, Luong et al.2015], with 512 units in each layer. We treat the two preceding dialogue utterances as dialogue history, simply concatenating them to form the source input.

To better illustrate in which scenarios the proposed algorithm offers the most help, we construct three different dev-test splits based on reference length, specifically:

  • [topsep=0pt, partopsep=0pt]

  • Natural: Instances randomly sampled from the dataset.

  • Short: Instances with target reference length no greater than 6.

  • Long: Instances with target reference length greater than 16.

Each set contains roughly 2,000 instances.

5.1.2 Decoding and Reranking

We consider the following two settings:


No reranking is needed. We simply pick the output with highest probability using standard and diverse beam search.


Following li2015diversity, we first generate an N-best list using vanilla or diverse beam search and rerank the generated responses by combining likelihood , backward likelihood ,999 is trained in a similar way as standard Seq2Seq models with only sources and targets being swapped. sequence length , and language model likelihood . The linear combination of and is a generalization of the mutual information between the source and the target, which dramatically decreases the rate of dull and generic responses. Feature weights are optimized using MERT [Och2003] on N-best lists of response candidates.101010We set the minimum length and maximum length of a decoded target to 0.75 and 1.5 times the length of sources. Beam size K is set to 10. We then rerank the N-best list and pick the output target with largest final ranking score.

5.1.3 Evaluation

For automatic evaluations, we report (1) BLEU [Papineni et al.2002], which has widely been used in response generation evaluation; and (2) diversity, which is the number of distinct unigrams and bigrams in generated responses scaled by the total number of generated tokens. This scaling is to avoid favoring long sentences, as described in Li et al. (2016a).

Natural Short Long
vanilla 1.30 1.59 0.89
diverse 1.42 1.63 1.06
(+) (+) (+)
diverseRL 1.49 1.67 1.03
(+) (+) (+)
vanilla 1.88 2.52 1.17
diverse 2.21 2.75 1.68
(+) (+) (+)
diverseRL 2.32 2.79 1.70
(+) (+) (+)
Table 1: Response generation: BLEU scores from the vanilla beam search model, the proposed diversity-promoting beam search model and the diverse Beam search + Reinforcement Learning (diverseRL) model on various datasets.
Natural Short Long
Vanilla 0.032 0.079 0.009
Diverse 0.049 0.101 0.049
(+) (+) (+)
Vanilla 0.068 0.252 0.025
Diverse 0.095 0.282 0.070
(+) (+) (+)
Table 2: Response generation: Diversity scores from standard beam search model and the proposed diversity-promoting beam search model on various datasets.
Diverse-Win Diverse-Lose Tie
62% 16% 22%
Table 3: Response generation: The gains over the standard beam search model from diverseRL based on pairwise human judgments.

We also do human evaluation, as suggested by liu2016not. We employ crowdsourced judges to provide evaluations for a random sample of 200 items. Each output pair was ranked by 3 judges, who were asked to decide which of the two outputs was better. They were instructed to prefer outputs that were more specific (relevant) to the context. Ties were permitted. Identical strings were automatically assigned the same score.

Input As the joint chief of the eight clans, you should do something.
N-best List If you want to fight, you must go to the temple.
(Diverse Beam) If you want to fight , you can do it yourself .
No, i’m not going to do anything.
I’m not going to let you go, i’m not going to let you do that.
But i don ’t want to be a monk
N-best List If you want to fight, you must go to the temple . . . .
(Vanilla Beam) if you want to fight , you must go to the temple to fight .
If you want to fight , you must go to the temple . . .
If you want to fight , you must go to the police station .
If you want to fight , you must go to the palace .
Input Well , why can ’t you just believe us?
N-best List I mean, it ’s not like we ’re going to be together forever .
(Diverse Beam) I mean, you ’re the one who ’s been stealing my stuff .
Because i ’m not a good person .
Why can ’t we just go to the police and get the hell out of here ?

This is a very important moment for us .

N-best List I mean , it ’s not like we ’re going to be together forever .
(Vanilla Beam) I mean , it ’s not like we ’re gonna be together forever .
I mean , it ’s not like we ’re going to be together any more .
I mean , it ’s not like we ’re going to be together any more , is it ?
I mean , it ’s not like we ’re going to be friends forever .
Input Senator , why didn ’t you come to me with this information ?
N-best List I didn ’t want you to think i was crazy .
(Diverse Beam) We could have been killed .
Because you ’re not a threat to me .
You know , i was thinking about the way you handled the situation .
If you ’re not gonna tell me, i’m not gonna tell you .
N-best List I didn ’t want you to think i was crazy .
(Vanilla Beam) I didn ’t want to upset you .
I didn ’t want you to worry .
I didn ’t want you to find out.
I didn ’t want you to worry about me .
Table 4: Response generation: Sample responses using the diversity-promoting beam search and vanilla beam search.

5.1.4 Results

Results for BLEU scores and diversity scores are presented in Tables 1 and 2. Diverse decoding boosts performance across all settings, but more so in the reranking settings than the non-reranking ones. For reranking, the improvement is much larger for longer responses than shorter ones. We suspect that this is because short responses have a smaller search space, and so the vanilla beam search has a sufficiently diverse beam. For longer responses, the generator may get trapped in a local decoding minimum, and as [Shao et al.2016] pointed out can generate incoherent or even contradictory responses like “I like fish but I don’t like fish” . Table 4 shows that for these longer responses, these hypotheses generated by standard beam search are nearly identical, dramatically decreasing the impact of the reranking process. 111111Sampling could be another way of generating diverse responses. However, responses from sampling are usually incoherent, especially for long sequences. This is because the error accumulates as decoding goes on. A similar phenomenon has also been observed by shao15.

On the Natural set, which has the same distribution of length as the full data, we see a performance boost from the diverseRL model’s ability to dynamically adjust diversity. For the short and long sets, the improvement from the diverseRL model is less significant, presumably because the dataset has been pre-processed in such a way that examples are more similar to each other in terms of length.

In terms of human evaluation, we find that the diverseRL model produces responses with better general quality, winning 62 percent of the time when compared to standard beam search.

5.2 Abstractive Summarization

5.2.1 Training and Dataset

We consider two settings for abstractive summarization. For the first setting (denoted single

), we follow the protocols described in rush2015neural, in which the source input is the first sentence of the document to summarize. We train a word-level attention model for this setting. Our training dataset consists of 800K pairs.

A good summarization system should have the ability to summarize a large chunk of text, separating wheat from chaff. We thus consider another setting in which each input consists of multiple sentences (denoted multi).121212 We consider 10 sentences at most for each input. Sentence position treated as a word feature which is associated with an embedding to be learned as suggested in nallapatiabstractive We train a hierarchical model with attention at sentence level [Li et al.2015b] for generation, which has been shown to yield better results than word-level encoder-decoder models for multi-sentence summarization [Nallapati et al.2016].

For reranking, we employ various global features taken from or inspired by prior non-neural work in summarization [Daumé III and Marcu2006, Nenkova and Vanderwende2005, McKeown et al.1999]. Features we consider include (1) average tf-idf score of constituent words in the generated output; (2) KLSum [Haghighi and Vanderwende2009], which reflects the topic distribution overlap between the entire input document and the generated output;131313 Specifically, the KL divergence between the topic distributions assigned by a variant of the LDA model [Blei et al.2003] that identifies general, document-specific and topic-specific word clusters. and (3) backward probability , i.e., the probability of generating the entire document given the summary. For evaluation, we report ROUGE-2 scores [Lin2004] in Table 5.

Single Multi
vanilla 11.2 8.1
diverse 12.1 9.0
vanilla 12.4 9.3
diverse 14.0 11.5
Table 5: ROUGE-2 scores from vanilla beam search model and the proposed diversity-promoting beam search model for abstractive summarization.

The proposed diverse reranking algorithm helps in both cases, but more significantly in the reranking setting than the non-reranking one. This is because the mechanisms by which diverse decoding helps in the two settings are different: for the non-reranking setting, if the conditional entropy in the target distribution is low enough, the decoded string from standard beam search will be close to the global optimum, and thus there won’t be much space for improvement. For the reranking setting, however, the story is different: it is not just about finding the global optimal output based on the target distribution, but also about incorporating different criteria to make up for facets that are missed by the encoder-decoder model. This requires the N-best list to be diverse for the reranking models to make a significant difference.

We also find an improvement from the reranking model using document-level features over the non-reranking model. We did not observe a big performance boost from the diverseRL model over the standard diverse model (around 0.3 ROUGE score boost) on this task. This is probably because the diverseRL model helps most when inputs are different and thus need different diverse decoding rates. In this task, the input documents are all news articles and share similar properties, so a unified diversity rate tuned on the dev set might already be good enough. Results for diverseRL are thus omitted for brevity.

Interestingly, we find that the result for the multi setting is significantly worse than the single setting, which is also observed by nallapatiabstractive: adding more sentences leads to worse results. This illustrates the incapability of neural generation models to date to summarize long documents. The proposed diverse decoding model produces a huge performance boost in the multi setting.

5.3 Machine Translation

5.3.1 Dataset and Training

The models are trained on the WMT’14 training dataset containing 4.5 million pairs for English-German. We limit our vocabulary to the top 50K most frequent words for both languages. Words not in the vocabulary are replaced by a universal unknown token. We use newstest2013 (3000 sentence pairs) as the development set and report translation performances in BLEU [Papineni et al.2002] on newstest2014 (2737 sentences). We trained neural Seq2Seq models [Sutskever et al.2014] with attention [Luong et al.2015, Cho et al.2014]. Unknown words are replaced using methods similar to those of luong2015effective.

5.3.2 Decoding and Reranking

Again we consider both reranking settings and non-reranking settings, where for non-reranking we select the best output using standard beam search. For reranking, we first generate a large N-best list using the beam search algorithm.141414We use beam size both for standard beam search and diverse beam search. At each time step of decoding, we are presented with word candidates. We first add all hypotheses with an EOS token generated at the current time step to the N-best list. Next, we preserve the top K unfinished hypotheses and move to the next time step. We therefore maintain constant batch size as hypotheses are completed and removed, by adding in more unfinished hypotheses. This allows the size of final N-best list for each input to be much larger than the beam size. We then rerank the hypotheses using features that have been shown to be useful in neural machine translation [Sennrich et al.2015a, Gulcehre et al.2015, Cohn et al.2016, Cheng et al.2015], including (1) backward probability ; (2) language model probability trained from monolingual data [Gulcehre et al.2015, Gulcehre et al.2015];151515 is trained using a single-layer LSTM recurrent models using monolingual data. We use News Crawl corpora from WMT13 (http://www.statmt.org/wmt13/translation-task.html) as additional training data to train monolingual language models. We used a subset of the original dataset which roughly contains 60 millions sentences. (3) bilingual symmetry: the agreement between attentions in German-English and English-German; and (4) target length. Again feature weights are optimized using MERT [Och2003] on N-best lists of response candidates in the dev set.

Vanilla beam 19.8
Diverse 19.8 (+0.04)
Diverse+RL 20.2 (+0.25)
Vanilla 21.5
Diverse 22.1 (+0.6)
Diverse+RL 22.4 (+0.9)
Table 6: BLEU scores from vanilla beam search model and the proposed diversity-promoting beam search model on WMT’14 English-German translation.

5.3.3 Results

Experimental results are shown in Table 6. First, no significant improvement is observed for the proposed diverse beam search model in the non-reranking setting. The explanation is similar to the abstractive summarization task: the vanilla beam search algorithm with large beam size is already good enough at finding the global optimum, and the small benefit from diverse decoding for some examples might even be canceled out by others where diversity rate value is too large. The diverseRL model addresses the second issue, leading to a performance boost of +0.25. For the reranking setting, the performance boost is more significant, +0.6 for diverse beam search and +0.9 for DiverseRL.

Overall, diverse decoding doesn’t seem to improve machine translation as much as it does summarization and response generation. We suspect this is due to the very low entropy of the target distribution (perplexity less than 6), so standard beam search is already fairly strong.161616The difference between beam size 2 and beam size 12 in decoding for French-English translation is only around 0.3 BLEU score [Sutskever et al.2014], which also confirms that the standard decoding algorithm is sufficiently powerful for MT.

6 Discussion

In this paper, we introduce a general diversity-promoting decoding algorithm for neural generation. The model adds an intra-sibling ranking term to the standard beam search algorithm, favoring choosing hypotheses from diverse parents. The proposed model is a general, simple and fast algorithm that will bring a performance boost to all neural generation tasks for which a diverse N-best list is needed.

On top of this approach, we build a more sophisticated algorithm that is capable of automatically adjusting diversity rates for different inputs using reinforcement learning. We find that, at the expense of model complexity and training time (compared with the basic RL-free diverse decoding algorithm), the model is able to adjust its diversity rate to better values, yielding generation quality better than either standard beam search or the basic diverse decoding approach.

Diverse decoding doesn’t help all tasks equally, contributing the most in two kinds of tasks: (1) Those with a very diverse space of ground truth outputs (e.g., tasks like response generation), rather than those in which the conditional entropy of the target distribution is already low enough (e.g., machine translation) (2) Tasks for which reranking is needed to incorporate features not included in the first decoder pass, such as document-level abstractive summarization.


We would especially like to thank Thang Luong Minh for insightful discussions and releasing relevant code, as well as Sida Wang, Ziang Xie, Chris Manning and other members of the Stanford NLP group for helpful comments and suggestions. We also thank Michel Galley, Bill Dolan, Chris Brockett, Jianfeng Gao and other members of the NLP group at Microsoft Research for helpful discussions. Jiwei Li is supported by a Facebook Fellowship, which we gratefully acknowledge. This work is partially supported by the DARPA Communicating with Computers (CwC) program under ARO prime contract no. W911NF- 15-1-0462 and the NSF via IIS-1514268. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, the NSF, or Facebook.