Neural text generation usually involves transforming some inputs into text outputs. In directed generation Holtzman et al. (2019) tasks including machine translation, summarization, and data-to-text, etc, the output space is highly constrained by the given input. Beam search 111Unless explicitly specified, we use length normalization by default. is the de facto sequence decoding algorithm, and provides good performance empirically. By contrast, in the more challenging open-ended text generation scenarios, such as response generation and story generation, there exist many plausible outputs given an input. The outputs of beam search are often generic, repetitive, and meaningless. Top-k sampling Radford et al. (2019) and nucleus sampling (also referred to as top-p sampling) Holtzman et al. (2019) are much more widely adopted.
Label bias Hannun (2020); Lafferty et al. (2001) refers to the phenomenon that locally normalized models for structured prediction often prefer output states with fewer outgoing transitions. From the perspective of information theory, models favor output states whose next-state distributions have low conditional entropy. As shown in Figure 1, “Mr” can be followed by many plausible words, since the probability is locally normalized, each word can only receive a small proportion of probability mass. In the extreme case, if the state transitions are deterministic, the inputs will be completely ignored. Label bias also makes it difficult to correct past mistakes given new observations Lafferty et al. (2001).
Seq2seq models with MLE factorize the probability of a sequence into products of locally normalized probabilities. As a result, they also suffer from label bias. Nonetheless, it has been unclear whether label bias has any connection with the degenerate behaviors of beam search or not. Previous works Li et al. (2016); Xu et al. (2019)
propose some heuristic methods to mitigate this issue. In this paper, we evaluate the likelihood distribution of human texts and generated texts, and show that beam search outputs are heavily biased towards the low-perplexity region.
To reduce label bias, one can replace local normalization with globally normalized training. In the seq2seq literature, it is also called sequence-level training. There are some successes in applying global normalization to part-of-speech tagging Andor et al. (2016), machine translation Edunov et al. (2018)
, etc. Existing methods use average log-probabilities as the unnormalized score, which are still based on local probabilities and often severely hurt perplexity on held-out datasets. In this paper, we use unnormalized logits instead. By combining token-level likelihood loss and sequence-level loss, the logits can be calibrated while keeping the local probabilities unchanged.
Evaluating open-ended text generation systems is non-trivial Liu et al. (2016). To verify the effectiveness of the proposed method, it is important to be able to measure label bias. We propose a heuristic ranking based metric. First, a set of low-perplexity texts are selected based on a pre-trained language model, then these distractors and the groundtruth are ranked based on the predicted model scores. A model with less label bias should rank the groundtruth before context-agnostic distractors.
We conduct experiments on a large-scale response generation dataset ConvAI2 Dinan et al. (2019). In terms of automatic evaluation metrics, our method can produce significantly more diverse texts than standard token-level MLE training and has less negative impacts on perplexity. Human evaluation gives a higher specificity and sensibleness score Adiwardana et al. (2020) to our model’s outputs. Yet there is also some evidence showing the label bias issue is still far from being solved. We discuss several limitations of our work and provide possible future directions to better understand beam search in open-ended text generation.
2 Likelihood Distribution Evaluation
To better understand the degenerate behaviors of beam search, we analyze the properties of perplexity distribution for both human texts and generated texts. Our analysis is based on the validation dataset of ConvAI2 Dinan et al. (2019). An off-the-shelf GPT2-117M model 222https://github.com/openai/gpt-2 is used to evaluate the unconditional language model perplexity. We train a Transformer-based seq2seq model with standard MLE to evaluate the perplexity of responses. Please check out Section 4.1 for more details about the dataset and our model.
2.1 Beam Search Outputs are Heavily Biased
Standard seq2seq models are trained under the principle of maximum likelihood: maximize the probability of human texts given input contexts. Naturally, one would expect that the texts generated by a well-trained model should share similar characteristics with human texts.
In Figure 2, we show the perplexity distribution under GPT2-117M. The texts generated by beam search are heavily biased towards the low-perplexity region. In contrast, the distribution of human texts is flat and has a long tail in the high-perplexity region. Perplexity can be intuitively interpreted as the expected number of plausible next tokens. Lower perplexity means beam search favors output states with fewer outgoing transitions, which is a typical symptom of label bias.
2.2 Search Errors or Model Errors?
There are two types of errors in seq2seq models using beam search decoder: search errors and model errors. For search errors, a trained model assigns a higher score to the groundtruth text but beam search fails to find it. For model errors, beam search outputs indeed have higher scores under the model distribution. With MLE training, we use length-normalized log-probability as the score.
In Figure 3, the average perplexity of generated texts is extremely low. We can conclude empirically that the degenerate behaviors of beam search are mainly attributed to model errors instead of search errors. Designing better search algorithms to find hypotheses closer to the global optimum is unlikely to help. A state-of-the-art conversational model Meena Adiwardana et al. (2020) uses sample-then-rerank as decoding algorithm. It first samples several candidates from the model distribution and then reranks candidates based on perplexity. Our findings imply that sample-then-rerank may risk producing degenerate outputs.
3.1 Token-Level Training and Inference
, and target output, seq2seq models aim to maximize the conditional probability , where denotes model parameters. Standard token-level MLE training with teacher forcing factorizes this objective in an auto-regressive way:
We omit to make the equation less cluttered. Thus, token-level cross entropy loss for an input-output pair can be defined as follows:
At inference stage, given an input , decoding algorithm attempts to search with highest average log-probability score:
The search space grows exponentially to the output length, heuristic algorithms like beam search are often used. Beam search with beam size keeps -best hypotheses at each time step. The search procedure stops when there are complete hypotheses available and it is impossible to get better hypotheses by expanding beams.
3.2 Sequence-Level Training
Token-level training suffers from label bias, since the next-token probability are locally normalized over the vocabulary : . For a specific timestep, assume there are equally plausible tokens, due to the constraint of local normalization, each token will receive probability mass . Outputs with smaller have lower entropy for next-token distribution, and will receive higher scores. In open-ended text generation, often varies in a large range. Thus, beam search will favor generic texts that have smaller than human texts.
Sequence-level training explicitly maximizes the global score of groundtruth . There are several possible formulations, such as empirical risk minimization, margin-based sequence loss Edunov et al. (2018), etc. Most formulations require an automatic metric such as BLEU to calculate the score of a hypothesis, which is not available in open-ended generation scenarios. In this paper, we cast sequence-level training as multi-class classification, given a score function , sequence cross entropy loss is defined as:
In Equation 4, the denominator is often called “partition function” in the literature of Markov Random Fields Koller and Friedman (2009), the denominator is a sum over all possible output sequences. The score function is learned by seq2seq models. One common choice of score function is the average log-probability as shown in Equation 3. However, this score function still builds upon the locally normalized probability and often results in much worse perplexity.
Assume is the unnormalized logit for token . We propose to use the logits as scores:
One advantage is can vary within a larger range than the log-probability. Also, for any real number ,
which indicates it is possible to calibrate the logits flexibly by learning without affecting .
The final loss functionis a linear combination of token-level loss and sequence-level loss :
Empirically, we set and .
3.3 Partition Function Estimation
Computing the partition function in Equation 4 requires summing over all possible output sequences, which is practically impossible. Given a model, we can first apply decoding algorithm to get high-score hypotheses , then approximate the partition function with :
This is equivalent to a -class classification problem. We explore two decoding algorithms to get hypotheses set : standard beam search and diverse beam search Vijayakumar et al. (2016). Standard beam search is widely used but often produces highly similar hypotheses. Diverse beam search can more effectively explore the search space, and produces diverse hypotheses.
The aforementioned estimation of the partition function is biased, in the sense that it is a strict lower bound of the actual value, and is likely to be a very loose bound. Noise contrastive estimation (NCE)Gutmann and Hyvärinen (2010); Deng et al. (2020)
provides an unbiased gradient estimator, but the variance is expected to be high. We leave further investigation of NCE as future work.
3.4 Quantify Label Bias
In this section, we present a simple ranking-based automatic evaluation metric. Intuitively, a model with less label bias should give a higher score to the groundtruth text, and lower scores to generic texts. A piece of text is “generic” if it has low perplexity. We evaluate the perplexity of all the sentences from ConvAI2 training set with GPT2 and use the top sentences as distractors. Some examples are shown in Table 1.
|What do you want to be when you grow up?||9.94|
|Can you tell me a little about yourself?||11.98|
|What do you do for a living?||12.22|
|I do not know what to say to you.||12.24|
|What do you do in your spare time?||13.95|
Given a trained model, the groundtruth together with these distractors are ranked based on model-predicted scores in descending order. We use mean rank as a metric to measure label bias. Since distractors are selected without any prior knowledge about the input context, they are unlikely to be appropriate outputs. In the following sections, experiments will show that seq2seq models completely fail at this simple ranking task.
|LogProb Avg||diverse BS||21.06||3.16||1.87||11.03||23.37|
|Logits Avg||diverse BS||20.16||2.43||1.91||12.48||27.98|
We use Reddit dataset Dziri et al. (2018) 333Available at https://github.com/nouhadziri/THRED
to pre-train our models.
It consists of more than million dialogue turns.
Reddit dataset is noisy and contains some offensive languages.
ConvAI2 dataset Dinan et al. (2019) 444http://convai.io/ is of high quality,
and used for model fine-tuning.
To make the task more open-ended,
we discard the persona information in the ConvAI2 dataset.
Official dataset split is adopted,
dialogue turns in the training set
and nearly turns in the validation set.
Model Configuration Our seq2seq network uses bert-base-uncased Devlin et al. (2019) as encoder, decoder consists of layers of Transformer blocks with hidden size
. We tie the parameters of encoder word embeddings, decoder word embeddings, and the output softmax layer. Adam optimizer is used with learning rateand batch size . We linearly warmup the learning rates in the first updates. The vocabulary is the same as BERT. Dropout of is applied for self-attention layers, feedforward layers, and input embedding layers. Gradients are clipped to have a maximum L2 norm of . We set beam size to for both hypotheses generation in sequence-level training and model inference. Diverse beam search uses groups. The input context is the concatenation of the last dialogue turns. When pre-training on Reddit, all parameters are updated. When fine-tuning on ConvAI2, only decoder parameters are updated to reduce over-fitting. Our implementation is based on fairseq555https://github.com/pytorch/fairseq.
Evaluation We use both automatic evaluation metrics and human evaluation to get a comprehensive view. Automatic evaluation metrics include perplexity (ppl), BLEU-4 Papineni et al. (2001), and distinct-n (n=1,2,3) Li et al. (2016)
. Distinct-n is a measure of diversity, it computes the number of distinct n-grams, normalized by the number of all n-grams. Mean rank among pre-defined distractors is also reported, as stated in Section3.4. For human evaluation, we use the Specificity and Sensibleness Average (SSA) proposed by Meena Adiwardana et al. (2020). Two annotators are asked to score each response for randomly chosen dialogue turns from the validation set on a scale of to (-bad, -ok, -good). SSA score is the arithmetic mean of specificity score and sensibleness score, it measures the quality of generated texts based on two complementary dimensions.
|Input||A: Hello what are you doing today?|
|MLE||I’m doing well. How are you?|
|Logits Avg||Hi! I’m doing some shopping and enjoying a good steak! you?|
|Human||I am good, I just got off work and tired. I have two jobs.|
|MLE||What do you do for a living?|
|Logits Avg||They’re so cute! Do you have a favorite band?|
|Human||What are your kitties names?|
|MLE||What do you do for a living?|
|Logits Avg||What do you do for a living?|
|Human||I write poetry and then make them into a song.|
|MLE||Do you speak any other languages?|
|Logits Avg||What language do you speak?|
|Human||Those happen to all be languages I speak. I want to visit France sometime.|
Table 2 shows the main results for automatic evaluation. The token-level cross entropy loss used by “MLE” is specifically targeted for optimizing perplexity. Not surprisingly, “MLE” achieves the lowest perplexity of . “MLE” also has a higher BLEU score of than sequence-level training methods. One possible explanation is that there are many more ways to be specific than to be generic. Producing a generic output is more likely to match the groundtruth, and get a higher BLEU score. Though BLEU is widely used in evaluating machine translation systems, previous work Liu et al. (2016) suggests that BLEU only has a weak correlation with human evaluation results for response generation.
Distinct-n metric measures the diversity of generated texts. Based on distinct-n (n=1,2,3) in Table 2, both sequence-level training methods “LogProb Avg” and “Logits Avg” produce significantly more diverse results than baseline “MLE” methods. In terms of decoding algorithm, diverse beam search shows consistent improvements across nearly all automatic metrics.
“LogProb Avg” uses average log-probability as score, token-level loss and sequence-level loss may compete for the same probability mass. Perplexity increases from to . “Logits Avg” can calibrate the logits while keeping the local probability relatively unchanged as shown in Equation 6. Perplexity only slightly increases from to .
|MLE||LogProb Avg||Logits Avg|
In Table 4, “MLE” fails miserably at discriminating groundtruth from pre-defined distractors with a mean rank of . “Logits Avg” performs best among 3 methods with a mean rank of . However, a naive baseline that randomly shuffles all the candidates would have a mean rank of , which is far better than our best model. This is evidence that our proposed method only reduces label bias to some degree instead of eliminating it.
We conduct a human evaluation on random dialogue turns from the validation dataset. Results are in Table 5. Models tend to generate sensible but not very specific texts, the sensibleness score for all models are higher than the corresponding specificity score in Table 5, while human texts are much more specific. “MLE” produces generic texts with a very low specificity score of . Both “LogProb Avg” (SSA ) and “Logits Avg” (SSA ) improves over the “MLE” baseline (SSA ), showing sequence-level training can indeed lead to more specific and sensible outputs. Using unnormalized logits as the score is more effective than using log-probabilities. Also, sequence-level training has a larger impact on specificity ( relative increase from to ) than sensibleness ( relative increase from to ).
Some typical examples are given in Table 3. The first two examples showcase that “MLE” often generates generic texts such as “I’m doing well”, “What do you do for a living?”, etc. Many previous works also reported similar findings Dziri et al. (2018); Adiwardana et al. (2020). Our proposed method “Logits Avg” can generate meaningful and specific words such as “enjoying a good steak”, “favorite band”, etc. It also illustrates why BLEU may not be a good metric to evaluate open-ended text generation systems. Though outputs by “Logits Avg” are of high quality, there are not many overlapping words with groundtruth.
The last two examples in Table 3 show some existing limitations and difficulties for open-ended generation. In the third example, both “MLE” and “Logits Avg” produce the same generic response, another evidence that sequence-level training does not completely solve the label bias problem in seq2seq networks. Beam search is not guaranteed to find the optimal output sequence, but this may be a good thing for promoting response diversity. In the fourth example, “Logits Avg” asks a question that has already been answered in previous dialogue turns. Generating semantically consistent responses is still an open problem.
In Figure 4, we additionally show the perplexity distribution of texts from two sequence-level training models “LogProb Avg” and “Logits Avg”. The distributions of both models are flatter than “MLE” baseline, and the peaks move to the right. The perplexity distribution of “Logits Avg” is slightly closer to humans than “LogProb Avg”. Though our proposed methods are less biased, they still prefer low-perplexity texts compared to humans.
Label bias arises when different output states have very different numbers of outgoing transitions. In directed generation tasks such as machine translation and abstractive summarization, there is a nearly one-to-one mapping between the input and the output. The transitions between the output states are almost deterministic, thereby label bias exists but is not a serious issue. Previous work Andor et al. (2016); Edunov et al. (2018) observes some moderate improvements with globally normalized training. It remains to be seen how state-of-the-art text generation models based on BERT and GPT are affected by label bias.
In linear-chain CRF Lafferty et al. (2001), partition function can be accurately and efficiently computed with the Viterbi algorithm based on dynamic programming. However, in seq2seq networks, outputs at each timestep couples with each other, and can not fit into the framework of the Viterbi algorithm. In this paper, we use beam search results to estimate the partition function. Such inaccuracy may be one major reason why our proposed model still favors generic texts to a large degree.
In token-level MLE training, each update requires one forward pass and one backward pass. In sequence-level training, an extra decoding step is required. Auto-regressive decoding is a sequential process, and therefore is pretty slow. It prevents us from fully exploiting the computation power of modern GPUs and the inherent parallelizability of Transformers. Common practices Edunov et al. (2018) first pre-train the network with token-level MLE, and then finetune with sequence-level loss.
5 Related Work
Neural Text Generation
with seq2seq models has been a popular paradigm for many generation tasks in recent years, such as neural machine translationWu et al. (2016), abstractive summarization See et al. (2017), and grammatical error correction Zhao et al. (2019), etc. Most existing models use token-level maximum likelihood estimation as optimization objective, and beam search as sequence decoding algorithm. The backbone architecture includes LSTM, CNN Gehring et al. (2017) and Transformers Vaswani et al. (2017). Since Transformers are highly parallelizable and have the ability to model long-term dependencies, they have become a core component for many state-of-the-art models Radford (2018). Exposure bias Bengio et al. (2015); Zhang et al. (2019) is widely studied in seq2seq models trained with teacher forcing. With the emergence of various powerful pre-trained models like BERT Devlin et al. (2019) and GPT-2 Radford et al. (2019), there are growing interests in improving text generation with language model pre-training Song et al. (2019); Wang et al. (2019).
with length normalization is a widely used heuristic sequence decoding algorithm
for many structured prediction models Wu et al. (2016); Bahdanau et al. (2015).
It has several known deficiencies,
including length bias Yang et al. (2018); Huang et al. (2017),
lack of diversity within beams Vijayakumar et al. (2016),
and performance degradation with larger beams Cohen and Beck (2019) Stahlberg and Byrne (2019), etc.
In open-ended text generation such as story generation Fan et al. (2018),
conditional language modeling Holtzman et al. (2019),
standard beam search is found to often produce degenerate outputs
and therefore are rarely used.
In sampling-based decoding algorithms,
tricks like adjusting temperature and
explicitly blocking duplicate n-grams work well Fan et al. (2018).
Some heuristic methods
are proposed to promote the diversity of beam search outputs.
Xu et al. incorporate additional meta-words into the context,
Gao et al. jointly optimize both diversity and relevance with variational auto-encoders,
and Li et al. rerank beam search outputs based on Maximum Mutual Information (MMI).
is usually associated with locally normalized models for structured prediction, such as Maximum Entropy Markov Models (MEMM)McCallum et al. (2000). Label bias Hannun (2020) makes the model prefer states with fewer outgoing transitions and makes it difficult to correct past mistakes. Conditional random field (CRF) Lafferty et al. (2001); Koller and Friedman (2009) eliminates label bias with global normalization. More generally, undirected graphical models Koller and Friedman (2009) do not suffer label bias like most directed graphical models do. However, computing the partition function can be difficult without strong conditional independence assumptions. Sequence-level training approximates the partition function with decoded hypotheses Andor et al. (2016); Collobert et al. (2019), and proves to be effective in neural machine translation Edunov et al. (2018), part-of-speech tagging Andor et al. (2016); Le et al. (2013), speech recognition Collobert et al. (2019), and summarization Wiseman and Rush (2016), etc. Deng et al. adopt noise contrastive estimation to train residual energy models for text generation. Yet little attention has been paid to the effect of label bias for seq2seq models in open-ended text generation scenarios.
6 Conclusion and Future Work
The degenerate behaviors of beam search in open-ended generation have been long recognized. This paper empirically investigates the effects of label bias for beam search based on the response generation task. Likelihood distribution evaluation shows beam search outputs are biased towards low-perplexity generic texts, and this phenomenon is mostly attributed to model errors. Globally-normalized sequence-level training can help reduce label bias. Using logits as scores is more effective than using log-probabilities. We also propose a simple ranking-based metric to measure label bias. Experiments show beam search can produce more diverse outputs with our proposed method. Due to the difficulty of estimating partition function, more research efforts are still needed to eliminate label bias.
For future work, we would like to investigate label bias in other open-ended generation tasks like conditional language modeling, and story generation. Another important research direction is to explore more effective and efficient methods for globally normalized training.
- Towards a human-like open-domain chatbot. ArXiv abs/2001.09977. Cited by: §1, §2.2, §4.1, §4.3.
Globally normalized transition-based neural networks. ArXiv abs/1603.06042. Cited by: §1, §4.4, §5.
- Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. Cited by: §5.
Scheduled sampling for sequence prediction with recurrent neural networks. ArXiv abs/1506.03099. Cited by: §5.
- Empirical analysis of beam search performance degradation in neural sequence models. In ICML, Cited by: §5.
- A fully differentiable beam search decoder. In ICML, Cited by: §5.
Residual energy-based models for text generation. In ICLR 2020, Cited by: §3.3, §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805. Cited by: §4.1, §5.
- The second conversational intelligence challenge (convai2). ArXiv abs/1902.00098. Cited by: §1, §2, §4.1.
- Augmenting neural response generation with context-aware topical attention. ArXiv abs/1811.01063. Cited by: §4.1, §4.3.
- Classical structured prediction losses for sequence to sequence learning. ArXiv abs/1711.04956. Cited by: §1, §3.2, §4.4, §4.4, §5.
- Hierarchical neural story generation. In ACL, Cited by: §5.
- Jointly optimizing diversity and relevance in neural response generation. In NAACL-HLT, Cited by: §5.
- Convolutional sequence to sequence learning. ArXiv abs/1705.03122. Cited by: §5.
- Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §3.3.
- The label bias problem. Cited by: §1, §5.
- The curious case of neural text degeneration. ArXiv abs/1904.09751. Cited by: §1, §5.
- When to finish? optimal beam search for neural text generation (modulo beam size). In EMNLP, Cited by: §5.
- Probabilistic graphical models - principles and techniques. Cited by: §3.2, §5.
Conditional random fields: probabilistic models for segmenting and labeling sequence data.
Proc. 18th International Conf. on Machine Learning, Cited by: §1, §4.4, §5.
- On the effect of the label bias problem in part-of-speech tagging. The 2013 RIVF International Conference on Computing and Communication Technologies - Research, Innovation, and Vision for Future (RIVF), pp. 103–108. Cited by: §5.
- A diversity-promoting objective function for neural conversation models. ArXiv abs/1510.03055. Cited by: §1, §4.1, §5.
- How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In EMNLP, Cited by: §1, §4.2.
- Maximum entropy markov models for information extraction and segmentation. In ICML, Cited by: §5.
- Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §4.1.
- Language models are unsupervised multitask learners. Cited by: §1, §5.
- Improving language understanding by generative pre-training. Cited by: §5.
- Get to the point: summarization with pointer-generator networks. ArXiv abs/1704.04368. Cited by: §5.
- MASS: masked sequence to sequence pre-training for language generation. In ICML, Cited by: §5.
- On nmt search errors and model errors: cat got your tongue?. In EMNLP/IJCNLP, Cited by: §5.
- Attention is all you need. ArXiv abs/1706.03762. Cited by: §5.
- Diverse beam search: decoding diverse solutions from neural sequence models. ArXiv abs/1610.02424. Cited by: §3.3, §5.
- Denoising based sequence-to-sequence pre-training for text generation. In EMNLP/IJCNLP, Cited by: §5.
- Sequence-to-sequence learning as beam-search optimization. In EMNLP, Cited by: §5.
- Google’s neural machine translation system: bridging the gap between human and machine translation. ArXiv abs/1609.08144. Cited by: §5, §5.
- Neural response generation with meta-words. In ACL, Cited by: §1, §5.
- Breaking the beam search curse: a study of (re-)scoring methods and stopping criteria for neural machine translation. In EMNLP, Cited by: §5.
- Bridging the gap between training and inference for neural machine translation. In ACL, Cited by: §5.
- Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In NAACL-HLT, Cited by: §5.