Investigating Label Bias in Beam Search for Open-ended Text Generation

05/22/2020 ∙ by Liang Wang, et al. ∙ Fenbi Technology 0

Beam search is an effective and widely used decoding algorithm in many sequence-to-sequence (seq2seq) text generation tasks. However, in open-ended text generation, beam search is often found to produce repetitive and generic texts, sampling-based decoding algorithms like top-k sampling and nucleus sampling are more preferred. Standard seq2seq models suffer from label bias due to its locally normalized probability formulation. This paper provides a series of empirical evidence that label bias is a major reason for such degenerate behaviors of beam search. By combining locally normalized maximum likelihood estimation and globally normalized sequence-level training, label bias can be reduced with almost no sacrifice in perplexity. To quantitatively measure label bias, we test the model's ability to discriminate the groundtruth text and a set of context-agnostic distractors. We conduct experiments on large-scale response generation datasets. Results show that beam search can produce more diverse and meaningful texts with our approach, in terms of both automatic and human evaluation metrics. Our analysis also suggests several future working directions towards the grand challenge of open-ended text generation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural text generation usually involves transforming some inputs into text outputs. In directed generation  Holtzman et al. (2019) tasks including machine translation, summarization, and data-to-text, etc, the output space is highly constrained by the given input. Beam search  111Unless explicitly specified, we use length normalization by default. is the de facto sequence decoding algorithm, and provides good performance empirically. By contrast, in the more challenging open-ended text generation scenarios, such as response generation and story generation, there exist many plausible outputs given an input. The outputs of beam search are often generic, repetitive, and meaningless. Top-k sampling  Radford et al. (2019) and nucleus sampling (also referred to as top-p sampling)  Holtzman et al. (2019) are much more widely adopted.

Figure 1: An illustrative example of label bias. Even though all hypotheses are plausible, the right one will be preferred because of the larger local probability.

Label bias  Hannun (2020); Lafferty et al. (2001) refers to the phenomenon that locally normalized models for structured prediction often prefer output states with fewer outgoing transitions. From the perspective of information theory, models favor output states whose next-state distributions have low conditional entropy. As shown in Figure  1, “Mr” can be followed by many plausible words, since the probability is locally normalized, each word can only receive a small proportion of probability mass. In the extreme case, if the state transitions are deterministic, the inputs will be completely ignored. Label bias also makes it difficult to correct past mistakes given new observations  Lafferty et al. (2001).

Seq2seq models with MLE factorize the probability of a sequence into products of locally normalized probabilities. As a result, they also suffer from label bias. Nonetheless, it has been unclear whether label bias has any connection with the degenerate behaviors of beam search or not. Previous works  Li et al. (2016); Xu et al. (2019)

propose some heuristic methods to mitigate this issue. In this paper, we evaluate the likelihood distribution of human texts and generated texts, and show that beam search outputs are heavily biased towards the low-perplexity region.

To reduce label bias, one can replace local normalization with globally normalized training. In the seq2seq literature, it is also called sequence-level training. There are some successes in applying global normalization to part-of-speech tagging  Andor et al. (2016), machine translation  Edunov et al. (2018)

, etc. Existing methods use average log-probabilities as the unnormalized score, which are still based on local probabilities and often severely hurt perplexity on held-out datasets. In this paper, we use unnormalized logits instead. By combining token-level likelihood loss and sequence-level loss, the logits can be calibrated while keeping the local probabilities unchanged.

Evaluating open-ended text generation systems is non-trivial  Liu et al. (2016). To verify the effectiveness of the proposed method, it is important to be able to measure label bias. We propose a heuristic ranking based metric. First, a set of low-perplexity texts are selected based on a pre-trained language model, then these distractors and the groundtruth are ranked based on the predicted model scores. A model with less label bias should rank the groundtruth before context-agnostic distractors.

We conduct experiments on a large-scale response generation dataset ConvAI2  Dinan et al. (2019). In terms of automatic evaluation metrics, our method can produce significantly more diverse texts than standard token-level MLE training and has less negative impacts on perplexity. Human evaluation gives a higher specificity and sensibleness score  Adiwardana et al. (2020) to our model’s outputs. Yet there is also some evidence showing the label bias issue is still far from being solved. We discuss several limitations of our work and provide possible future directions to better understand beam search in open-ended text generation.

2 Likelihood Distribution Evaluation

To better understand the degenerate behaviors of beam search, we analyze the properties of perplexity distribution for both human texts and generated texts. Our analysis is based on the validation dataset of ConvAI2  Dinan et al. (2019). An off-the-shelf GPT2-117M model  222https://github.com/openai/gpt-2 is used to evaluate the unconditional language model perplexity. We train a Transformer-based seq2seq model with standard MLE to evaluate the perplexity of responses. Please check out Section  4.1 for more details about the dataset and our model.

2.1 Beam Search Outputs are Heavily Biased

Standard seq2seq models are trained under the principle of maximum likelihood: maximize the probability of human texts given input contexts. Naturally, one would expect that the texts generated by a well-trained model should share similar characteristics with human texts.

Figure 2: Perplexity distribution of human texts and generated texts under the pre-trained GPT2-117M language model.

In Figure  2, we show the perplexity distribution under GPT2-117M. The texts generated by beam search are heavily biased towards the low-perplexity region. In contrast, the distribution of human texts is flat and has a long tail in the high-perplexity region. Perplexity can be intuitively interpreted as the expected number of plausible next tokens. Lower perplexity means beam search favors output states with fewer outgoing transitions, which is a typical symptom of label bias.

2.2 Search Errors or Model Errors?

Figure 3: Perplexity distribution of human texts and generated texts under our trained response generation model. With access to input contexts, the perplexity is lower than the counterpart in Figure  2.

There are two types of errors in seq2seq models using beam search decoder: search errors and model errors. For search errors, a trained model assigns a higher score to the groundtruth text but beam search fails to find it. For model errors, beam search outputs indeed have higher scores under the model distribution. With MLE training, we use length-normalized log-probability as the score.

In Figure  3, the average perplexity of generated texts is extremely low. We can conclude empirically that the degenerate behaviors of beam search are mainly attributed to model errors instead of search errors. Designing better search algorithms to find hypotheses closer to the global optimum is unlikely to help. A state-of-the-art conversational model Meena  Adiwardana et al. (2020) uses sample-then-rerank as decoding algorithm. It first samples several candidates from the model distribution and then reranks candidates based on perplexity. Our findings imply that sample-then-rerank may risk producing degenerate outputs.

3 Method

3.1 Token-Level Training and Inference

Given input

, and target output

, seq2seq models aim to maximize the conditional probability , where denotes model parameters. Standard token-level MLE training with teacher forcing factorizes this objective in an auto-regressive way:

(1)

We omit to make the equation less cluttered. Thus, token-level cross entropy loss for an input-output pair can be defined as follows:

(2)

At inference stage, given an input , decoding algorithm attempts to search with highest average log-probability score:

(3)

The search space grows exponentially to the output length, heuristic algorithms like beam search are often used. Beam search with beam size keeps -best hypotheses at each time step. The search procedure stops when there are complete hypotheses available and it is impossible to get better hypotheses by expanding beams.

3.2 Sequence-Level Training

Token-level training suffers from label bias, since the next-token probability are locally normalized over the vocabulary : . For a specific timestep, assume there are equally plausible tokens, due to the constraint of local normalization, each token will receive probability mass . Outputs with smaller have lower entropy for next-token distribution, and will receive higher scores. In open-ended text generation, often varies in a large range. Thus, beam search will favor generic texts that have smaller than human texts.

Sequence-level training explicitly maximizes the global score of groundtruth . There are several possible formulations, such as empirical risk minimization, margin-based sequence loss  Edunov et al. (2018), etc. Most formulations require an automatic metric such as BLEU to calculate the score of a hypothesis, which is not available in open-ended generation scenarios. In this paper, we cast sequence-level training as multi-class classification, given a score function , sequence cross entropy loss is defined as:

(4)

In Equation  4, the denominator is often called “partition function” in the literature of Markov Random Fields  Koller and Friedman (2009), the denominator is a sum over all possible output sequences. The score function is learned by seq2seq models. One common choice of score function is the average log-probability as shown in Equation  3. However, this score function still builds upon the locally normalized probability and often results in much worse perplexity.

Assume is the unnormalized logit for token . We propose to use the logits as scores:

(5)

One advantage is can vary within a larger range than the log-probability. Also, for any real number ,

(6)

which indicates it is possible to calibrate the logits flexibly by learning without affecting .

The final loss function

is a linear combination of token-level loss and sequence-level loss :

(7)

Empirically, we set and .

3.3 Partition Function Estimation

Computing the partition function in Equation  4 requires summing over all possible output sequences, which is practically impossible. Given a model, we can first apply decoding algorithm to get high-score hypotheses , then approximate the partition function with :

(8)

This is equivalent to a -class classification problem. We explore two decoding algorithms to get hypotheses set : standard beam search and diverse beam search  Vijayakumar et al. (2016). Standard beam search is widely used but often produces highly similar hypotheses. Diverse beam search can more effectively explore the search space, and produces diverse hypotheses.

The aforementioned estimation of the partition function is biased, in the sense that it is a strict lower bound of the actual value, and is likely to be a very loose bound. Noise contrastive estimation (NCE)  

Gutmann and Hyvärinen (2010); Deng et al. (2020)

provides an unbiased gradient estimator, but the variance is expected to be high. We leave further investigation of NCE as future work.

3.4 Quantify Label Bias

In this section, we present a simple ranking-based automatic evaluation metric. Intuitively, a model with less label bias should give a higher score to the groundtruth text, and lower scores to generic texts. A piece of text is “generic” if it has low perplexity. We evaluate the perplexity of all the sentences from ConvAI2 training set with GPT2 and use the top sentences as distractors. Some examples are shown in Table  1.

Text ppl
What do you want to be when you grow up? 9.94
Can you tell me a little about yourself? 11.98
What do you do for a living? 12.22
I do not know what to say to you. 12.24
What do you do in your spare time? 13.95
Table 1: Example sentences with low perplexity under pre-trained GPT2-117M language model.

Given a trained model, the groundtruth together with these distractors are ranked based on model-predicted scores in descending order. We use mean rank as a metric to measure label bias. Since distractors are selected without any prior knowledge about the input context, they are unlikely to be appropriate outputs. In the following sections, experiments will show that seq2seq models completely fail at this simple ranking task.

4 Experiments

Method Decoding ppl BLEU distinct-1(%) distinct-2(%) distinct-3(%)
MLE BS 19.69 3.91 1.20 5.54 10.62
MLE diverse BS 19.69 3.97 1.20 5.87 11.86
LogProb Avg BS 21.73 3.12 1.89 10.84 22.32
LogProb Avg diverse BS 21.06 3.16 1.87 11.03 23.37
Logits Avg BS 20.72 2.43 1.90 12.27 26.85
Logits Avg diverse BS 20.16 2.43 1.91 12.48 27.98
Human - - - 3.66 28.89 60.67
Table 2: Automatic evaluation results. “BS” is short for “beam search”. “MLE” uses token-level training as stated in Section  3.1. “LogProb Avg” uses average log-probability as the score for sequence-level training, while “Logits Avg” uses average logits as the score (Equation  5).

4.1 Setup

Datasets We use Reddit dataset  Dziri et al. (2018) 333Available at https://github.com/nouhadziri/THRED to pre-train our models. It consists of more than million dialogue turns. Reddit dataset is noisy and contains some offensive languages. ConvAI2 dataset  Dinan et al. (2019) 444http://convai.io/ is of high quality, and used for model fine-tuning. To make the task more open-ended, we discard the persona information in the ConvAI2 dataset. Official dataset split is adopted, dialogue turns in the training set and nearly turns in the validation set.

Model Configuration Our seq2seq network uses bert-base-uncased  Devlin et al. (2019) as encoder, decoder consists of layers of Transformer blocks with hidden size

. We tie the parameters of encoder word embeddings, decoder word embeddings, and the output softmax layer. Adam optimizer is used with learning rate

and batch size . We linearly warmup the learning rates in the first updates. The vocabulary is the same as BERT. Dropout of is applied for self-attention layers, feedforward layers, and input embedding layers. Gradients are clipped to have a maximum L2 norm of . We set beam size to for both hypotheses generation in sequence-level training and model inference. Diverse beam search uses groups. The input context is the concatenation of the last dialogue turns. When pre-training on Reddit, all parameters are updated. When fine-tuning on ConvAI2, only decoder parameters are updated to reduce over-fitting. Our implementation is based on fairseq555https://github.com/pytorch/fairseq.

Evaluation We use both automatic evaluation metrics and human evaluation to get a comprehensive view. Automatic evaluation metrics include perplexity (ppl), BLEU-4  Papineni et al. (2001), and distinct-n (n=1,2,3)  Li et al. (2016)

. Distinct-n is a measure of diversity, it computes the number of distinct n-grams, normalized by the number of all n-grams. Mean rank among pre-defined distractors is also reported, as stated in Section  

3.4. For human evaluation, we use the Specificity and Sensibleness Average (SSA) proposed by Meena  Adiwardana et al. (2020). Two annotators are asked to score each response for randomly chosen dialogue turns from the validation set on a scale of to (-bad, -ok, -good). SSA score is the arithmetic mean of specificity score and sensibleness score, it measures the quality of generated texts based on two complementary dimensions.

4.2 Results

Input A: Hello what are you doing today?
MLE I’m doing well. How are you?
Logits Avg Hi! I’m doing some shopping and enjoying a good steak! you?
Human I am good, I just got off work and tired. I have two jobs.
Input
A: I have a turtle his name is speedy, kitties are nice too, tho!
B: That is an adorable name for a turtle. I have 2 cats.
MLE What do you do for a living?
Logits Avg They’re so cute! Do you have a favorite band?
Human What are your kitties names?
Input
A: That’s okay we all have our differences in likes.
B: That is very true
MLE What do you do for a living?
Logits Avg What do you do for a living?
Human I write poetry and then make them into a song.
Input
A: I do speak multiple languages. Which ones do you speak?
B: I speak, English, Spanish and French.
MLE Do you speak any other languages?
Logits Avg What language do you speak?
Human Those happen to all be languages I speak. I want to visit France sometime.
Table 3: Examples from ConvAI2 validation dataset. Input only lists the last dialogue turns. Both “MLE” and “Logits Avg” use diverse beam search decoding. Outputs from “LogProb Avg” are not shown due to space limit. “A” and “B” are used to denote different persons.

Table  2 shows the main results for automatic evaluation. The token-level cross entropy loss used by “MLE” is specifically targeted for optimizing perplexity. Not surprisingly, “MLE” achieves the lowest perplexity of . “MLE” also has a higher BLEU score of than sequence-level training methods. One possible explanation is that there are many more ways to be specific than to be generic. Producing a generic output is more likely to match the groundtruth, and get a higher BLEU score. Though BLEU is widely used in evaluating machine translation systems, previous work  Liu et al. (2016) suggests that BLEU only has a weak correlation with human evaluation results for response generation.

Distinct-n metric measures the diversity of generated texts. Based on distinct-n (n=1,2,3) in Table  2, both sequence-level training methods “LogProb Avg” and “Logits Avg” produce significantly more diverse results than baseline “MLE” methods. In terms of decoding algorithm, diverse beam search shows consistent improvements across nearly all automatic metrics.

“LogProb Avg” uses average log-probability as score, token-level loss and sequence-level loss may compete for the same probability mass. Perplexity increases from to . “Logits Avg” can calibrate the logits while keeping the local probability relatively unchanged as shown in Equation  6. Perplexity only slightly increases from to .

MLE LogProb Avg Logits Avg
Mean Rank 44.7 43.9 35.1
Table 4: Mean rank of groundtruth among context-agnostic distractors. Lower mean rank indicates the model has less label bias. See Section  3.4 for more details.

In Table  4, “MLE” fails miserably at discriminating groundtruth from pre-defined distractors with a mean rank of . “Logits Avg” performs best among 3 methods with a mean rank of . However, a naive baseline that randomly shuffles all the candidates would have a mean rank of , which is far better than our best model. This is evidence that our proposed method only reduces label bias to some degree instead of eliminating it.

Method Specificity Sensibleness SSA
MLE 0.54 0.88 0.71
LogProb Avg 0.68 1.00 0.84
Logits Avg 1.06 1.24 1.15
Human 1.60 1.47 1.53
Table 5: Human evaluation results. The scores are averaged over two annotators and dialogue turns, and are in the range of to . All methods adopt diverse beam search as the decoding algorithm since it shows slightly better performance on automatic evaluation metrics. “SSA” is the arithmetic mean of specificity score and sensibleness score.

We conduct a human evaluation on random dialogue turns from the validation dataset. Results are in Table  5. Models tend to generate sensible but not very specific texts, the sensibleness score for all models are higher than the corresponding specificity score in Table  5, while human texts are much more specific. “MLE” produces generic texts with a very low specificity score of . Both “LogProb Avg” (SSA ) and “Logits Avg” (SSA ) improves over the “MLE” baseline (SSA ), showing sequence-level training can indeed lead to more specific and sensible outputs. Using unnormalized logits as the score is more effective than using log-probabilities. Also, sequence-level training has a larger impact on specificity ( relative increase from to ) than sensibleness ( relative increase from to ).

4.3 Analysis

Some typical examples are given in Table  3. The first two examples showcase that “MLE” often generates generic texts such as “I’m doing well”, “What do you do for a living?”, etc. Many previous works also reported similar findings  Dziri et al. (2018); Adiwardana et al. (2020). Our proposed method “Logits Avg” can generate meaningful and specific words such as “enjoying a good steak”, “favorite band”, etc. It also illustrates why BLEU may not be a good metric to evaluate open-ended text generation systems. Though outputs by “Logits Avg” are of high quality, there are not many overlapping words with groundtruth.

The last two examples in Table  3 show some existing limitations and difficulties for open-ended generation. In the third example, both “MLE” and “Logits Avg” produce the same generic response, another evidence that sequence-level training does not completely solve the label bias problem in seq2seq networks. Beam search is not guaranteed to find the optimal output sequence, but this may be a good thing for promoting response diversity. In the fourth example, “Logits Avg” asks a question that has already been answered in previous dialogue turns. Generating semantically consistent responses is still an open problem.

Figure 4: Perlexity distribution of texts generated by different models under GPT2-117M. Better viewed in color.

In Figure  4, we additionally show the perplexity distribution of texts from two sequence-level training models “LogProb Avg” and “Logits Avg”. The distributions of both models are flatter than “MLE” baseline, and the peaks move to the right. The perplexity distribution of “Logits Avg” is slightly closer to humans than “LogProb Avg”. Though our proposed methods are less biased, they still prefer low-perplexity texts compared to humans.

4.4 Discussion

Label bias arises when different output states have very different numbers of outgoing transitions. In directed generation tasks such as machine translation and abstractive summarization, there is a nearly one-to-one mapping between the input and the output. The transitions between the output states are almost deterministic, thereby label bias exists but is not a serious issue. Previous work  Andor et al. (2016); Edunov et al. (2018) observes some moderate improvements with globally normalized training. It remains to be seen how state-of-the-art text generation models based on BERT and GPT are affected by label bias.

In linear-chain CRF  Lafferty et al. (2001), partition function can be accurately and efficiently computed with the Viterbi algorithm based on dynamic programming. However, in seq2seq networks, outputs at each timestep couples with each other, and can not fit into the framework of the Viterbi algorithm. In this paper, we use beam search results to estimate the partition function. Such inaccuracy may be one major reason why our proposed model still favors generic texts to a large degree.

In token-level MLE training, each update requires one forward pass and one backward pass. In sequence-level training, an extra decoding step is required. Auto-regressive decoding is a sequential process, and therefore is pretty slow. It prevents us from fully exploiting the computation power of modern GPUs and the inherent parallelizability of Transformers. Common practices  Edunov et al. (2018) first pre-train the network with token-level MLE, and then finetune with sequence-level loss.

5 Related Work

Neural Text Generation

with seq2seq models has been a popular paradigm for many generation tasks in recent years, such as neural machine translation  

Wu et al. (2016), abstractive summarization  See et al. (2017), and grammatical error correction  Zhao et al. (2019), etc. Most existing models use token-level maximum likelihood estimation as optimization objective, and beam search as sequence decoding algorithm. The backbone architecture includes LSTM, CNN  Gehring et al. (2017) and Transformers  Vaswani et al. (2017). Since Transformers are highly parallelizable and have the ability to model long-term dependencies, they have become a core component for many state-of-the-art models  Radford (2018). Exposure bias  Bengio et al. (2015); Zhang et al. (2019) is widely studied in seq2seq models trained with teacher forcing. With the emergence of various powerful pre-trained models like BERT  Devlin et al. (2019) and GPT-2  Radford et al. (2019), there are growing interests in improving text generation with language model pre-training  Song et al. (2019); Wang et al. (2019).

Beam Search with length normalization is a widely used heuristic sequence decoding algorithm for many structured prediction models  Wu et al. (2016); Bahdanau et al. (2015). It has several known deficiencies, including length bias  Yang et al. (2018); Huang et al. (2017), lack of diversity within beams  Vijayakumar et al. (2016), and performance degradation with larger beams  Cohen and Beck (2019)  Stahlberg and Byrne (2019), etc. In open-ended text generation such as story generation  Fan et al. (2018), conditional language modeling  Holtzman et al. (2019), standard beam search is found to often produce degenerate outputs and therefore are rarely used. In sampling-based decoding algorithms, tricks like adjusting temperature and explicitly blocking duplicate n-grams work well  Fan et al. (2018). Some heuristic methods are proposed to promote the diversity of beam search outputs.  Xu et al. incorporate additional meta-words into the context,  Gao et al. jointly optimize both diversity and relevance with variational auto-encoders, and  Li et al. rerank beam search outputs based on Maximum Mutual Information (MMI).

Label Bias

is usually associated with locally normalized models for structured prediction, such as Maximum Entropy Markov Models (MEMM)  

McCallum et al. (2000). Label bias  Hannun (2020) makes the model prefer states with fewer outgoing transitions and makes it difficult to correct past mistakes. Conditional random field (CRF)  Lafferty et al. (2001); Koller and Friedman (2009) eliminates label bias with global normalization. More generally, undirected graphical models  Koller and Friedman (2009) do not suffer label bias like most directed graphical models do. However, computing the partition function can be difficult without strong conditional independence assumptions. Sequence-level training approximates the partition function with decoded hypotheses  Andor et al. (2016); Collobert et al. (2019), and proves to be effective in neural machine translation  Edunov et al. (2018), part-of-speech tagging  Andor et al. (2016); Le et al. (2013), speech recognition  Collobert et al. (2019), and summarization  Wiseman and Rush (2016), etc.  Deng et al. adopt noise contrastive estimation to train residual energy models for text generation. Yet little attention has been paid to the effect of label bias for seq2seq models in open-ended text generation scenarios.

6 Conclusion and Future Work

The degenerate behaviors of beam search in open-ended generation have been long recognized. This paper empirically investigates the effects of label bias for beam search based on the response generation task. Likelihood distribution evaluation shows beam search outputs are biased towards low-perplexity generic texts, and this phenomenon is mostly attributed to model errors. Globally-normalized sequence-level training can help reduce label bias. Using logits as scores is more effective than using log-probabilities. We also propose a simple ranking-based metric to measure label bias. Experiments show beam search can produce more diverse outputs with our proposed method. Due to the difficulty of estimating partition function, more research efforts are still needed to eliminate label bias.

For future work, we would like to investigate label bias in other open-ended generation tasks like conditional language modeling, and story generation. Another important research direction is to explore more effective and efficient methods for globally normalized training.

References

  • D. D. F. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, and Q. V. Le (2020) Towards a human-like open-domain chatbot. ArXiv abs/2001.09977. Cited by: §1, §2.2, §4.1, §4.3.
  • D. Andor, C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and M. Collins (2016)

    Globally normalized transition-based neural networks

    .
    ArXiv abs/1603.06042. Cited by: §1, §4.4, §5.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §5.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015)

    Scheduled sampling for sequence prediction with recurrent neural networks

    .
    ArXiv abs/1506.03099. Cited by: §5.
  • E. Cohen and J. C. Beck (2019) Empirical analysis of beam search performance degradation in neural sequence models. In ICML, Cited by: §5.
  • R. Collobert, A. Hannun, and G. Synnaeve (2019) A fully differentiable beam search decoder. In ICML, Cited by: §5.
  • Y. Deng, A. Bakhtin, M. Ott, and A. Szlam (2020)

    Residual energy-based models for text generation

    .
    In ICLR 2020, Cited by: §3.3, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805. Cited by: §4.1, §5.
  • E. Dinan, V. Logacheva, V. Malykh, A. H. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, S. Prabhumoye, A. W. Black, A. I. Rudnicky, J. Williams, J. Pineau, M. Burtsev, and J. Weston (2019) The second conversational intelligence challenge (convai2). ArXiv abs/1902.00098. Cited by: §1, §2, §4.1.
  • N. Dziri, E. Kamalloo, K. W. Mathewson, and O. R. Zaiane (2018) Augmenting neural response generation with context-aware topical attention. ArXiv abs/1811.01063. Cited by: §4.1, §4.3.
  • S. Edunov, M. Ott, M. Auli, D. Grangier, and M. Ranzato (2018) Classical structured prediction losses for sequence to sequence learning. ArXiv abs/1711.04956. Cited by: §1, §3.2, §4.4, §4.4, §5.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, Cited by: §5.
  • X. Gao, S. Lee, Y. Zhang, C. Brockett, M. Galley, J. Gao, and W. B. Dolan (2019) Jointly optimizing diversity and relevance in neural response generation. In NAACL-HLT, Cited by: §5.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin (2017) Convolutional sequence to sequence learning. ArXiv abs/1705.03122. Cited by: §5.
  • M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §3.3.
  • A. Hannun (2020) The label bias problem. Cited by: §1, §5.
  • A. Holtzman, J. Buys, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. ArXiv abs/1904.09751. Cited by: §1, §5.
  • L. Huang, K. Zhao, and M. Ma (2017) When to finish? optimal beam search for neural text generation (modulo beam size). In EMNLP, Cited by: §5.
  • D. Koller and N. Friedman (2009) Probabilistic graphical models - principles and techniques. Cited by: §3.2, §5.
  • J. Lafferty, A. Mccallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In

    Proc. 18th International Conf. on Machine Learning

    ,
    Cited by: §1, §4.4, §5.
  • H. P. Le, X. Phan, and T. Tran (2013) On the effect of the label bias problem in part-of-speech tagging. The 2013 RIVF International Conference on Computing and Communication Technologies - Research, Innovation, and Vision for Future (RIVF), pp. 103–108. Cited by: §5.
  • J. Li, M. Galley, C. Brockett, J. Gao, and W. B. Dolan (2016) A diversity-promoting objective function for neural conversation models. ArXiv abs/1510.03055. Cited by: §1, §4.1, §5.
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP, Cited by: §1, §4.2.
  • A. McCallum, D. Freitag, and F. C. Pereira (2000) Maximum entropy markov models for information extraction and segmentation. In ICML, Cited by: §5.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2001) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §4.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1, §5.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: §5.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. ArXiv abs/1704.04368. Cited by: §5.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In ICML, Cited by: §5.
  • F. Stahlberg and B. Byrne (2019) On nmt search errors and model errors: cat got your tongue?. In EMNLP/IJCNLP, Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. ArXiv abs/1706.03762. Cited by: §5.
  • A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. H. Sun, S. Lee, D. J. Crandall, and D. Batra (2016) Diverse beam search: decoding diverse solutions from neural sequence models. ArXiv abs/1610.02424. Cited by: §3.3, §5.
  • L. Wang, W. Zhao, R. Jia, S. Li, and J. Liu (2019) Denoising based sequence-to-sequence pre-training for text generation. In EMNLP/IJCNLP, Cited by: §5.
  • S. Wiseman and A. M. Rush (2016) Sequence-to-sequence learning as beam-search optimization. In EMNLP, Cited by: §5.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. S. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. ArXiv abs/1609.08144. Cited by: §5, §5.
  • C. Xu, W. Wu, C. Tao, H. Hu, M. Schuerman, and Y. Wang (2019) Neural response generation with meta-words. In ACL, Cited by: §1, §5.
  • Y. Yang, L. Huang, and M. Ma (2018) Breaking the beam search curse: a study of (re-)scoring methods and stopping criteria for neural machine translation. In EMNLP, Cited by: §5.
  • W. Zhang, Y. Feng, F. Meng, D. You, and Q. Liu (2019) Bridging the gap between training and inference for neural machine translation. In ACL, Cited by: §5.
  • W. Zhao, L. Wang, K. Shen, R. Jia, and J. Liu (2019) Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In NAACL-HLT, Cited by: §5.