DeepAI
Log In Sign Up

Improved Natural Language Generation via Loss Truncation

04/30/2020
by   Daniel Kang, et al.
0

Neural language models are usually trained to match the distributional properties of a large-scale corpus by minimizing the log loss. While straightforward to optimize, this approach forces the model to reproduce all variations in the dataset, including noisy and invalid references (e.g., misannotation and hallucinated facts). Worse, the commonly used log loss is overly sensitive to such phenomena and even a small fraction of noisy data can degrade performance. In this work, we show that the distinguishability of the models and reference serves as a principled and robust alternative for handling invalid references. To optimize distinguishability, we propose loss truncation, which adaptively removes high loss examples during training. We show this is as easy to optimize as log loss and tightly bounds distinguishability under noise. Empirically, we demonstrate that loss truncation outperforms existing baselines on distinguishability on a summarization task, and show that samples generated by the loss truncation model have factual accuracy ratings that exceed those of baselines and match human references.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/08/2020

Discriminatively-Tuned Generative Classifiers for Robust Natural Language Inference

While discriminative neural network classifiers are generally preferred,...
08/19/2021

Language Model Augmented Relevance Score

Although automated metrics are commonly used to evaluate NLG systems, th...
09/24/2018

Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Motivated by recent findings on the probabilistic modeling of acceptabil...
05/24/2019

Curriculum Loss: Robust Learning and Generalization against Label Corruption

Generalization is vital important for many deep network models. It becom...
10/25/2021

No News is Good News: A Critique of the One Billion Word Benchmark

The One Billion Word Benchmark is a dataset derived from the WMT 2011 Ne...
10/20/2020

Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

Neural image-to-text radiology report generation systems offer the poten...
04/13/2022

Learning to Revise References for Faithful Summarization

In many real-world scenarios with naturally occurring datasets, referenc...

1 Introduction

Learning to generate text is a core part of many NLP tasks, including summarization nallapati2016abstractive

, image captioning

lin2014microsoft, and story generation roemmele2016writing. A common challenge to all these tasks is that references from the training distribution are not unique and contain substantial variations in phrasing and content wiseman2017challenges; dhingra2019handling. Learning to generate under a set of diverse and noisy references is challenging as some variations ought to be learned (e.g., paraphrasing) while others should not (e.g., hallucinated facts, ignoring prompts).

Existing training procedures for models seek to match the underlying distribution, leading to models that replicate and sometimes even amplify unwanted behaviors such as hallucination during generation. For example, neural language models often produce fluent text that is unfaithful to the source tian2019sticking; wiseman2017challenges; lee2018hallucinations. Existing work fan2018hierarchical; holtzman2019curious has primarily addressed these issues by constructing decoders that implicitly remove unwanted variation when generating (see §6 for a detailed discussion of task-specific losses).

In this work, we argue that this phenomenon is not model specific, but is due to the widely-used log loss: we demonstrate that log loss is not robust to noisy and invalid references (§2

). In particular, log loss requires that models assign probabilities to

all

potential test reference sequences. As a result, log loss is sensitive to outliers: invalid or noisy references with small probability mass can cause large changes in model behavior. We show that the brittleness of log loss, together with the noise in existing generation datasets, lead to low-quality and unfaithful generated text.

Instead of optimizing log loss, which has little correlation with model output quality theis2016note; hashimoto2019unifying; gamon2005sentence, recent work on diverse generation models has proposed optimizing for the distinguishability of samples from the model and the reference. Distinguishability provides a natural and appealing guarantee: samples that are indistinguishable from human generated text will be as high quality as human generated text. Furthermore, we show that optimizing for distinguishability is robust in the face of noisy and even invalid data. Despite its appeal, distinguishability has not been widely used due to statistical and computational challenges. For example, existing methods that directly optimize for distinguishability have yet to match even naive log loss based baselines caccia2018language.

We propose a modification to the log loss, loss truncation, that has the benefits of distinguishability while being efficient to train. Loss truncation is as efficient to train as log loss, nearly as robust as distinguishability, and provides distinguishability guarantees via an upper bound. It achieves these properties by modifying the standard log loss to adaptively remove examples with high log loss. We additionally extend loss truncation with a sequence-level rejection sampling scheme that generates higher quality sequences by restricting the outputs to be high probability sequences.

We show that loss truncation with direct and rejection sampling outperforms standard log loss based generation methods (beam search, full sampling, top-, and top- sampling) on distinguishability, as measured by the HUSE score hashimoto2019unifying. We additionally study the factual accuracy of a summarization system trained on loss truncation and show that our proposed approach produces summaries which improve upon all baselines (including beam searched models) and match references on factual accuracy.

2 Motivation and Problem Statement

Task and Background. We consider a natural language generation task with a conditional language model, where we are given a context drawn from and our probabilistic model produces an output by approximating a (usually human) reference distribution .

In order to achieve this, many existing models are trained to minimize the Kullback-Leibler (KL) divergence,

(1)

We refer to the first term of this divergence as the log loss

of a model. The second term is commonly ignored as it is a constant with respect to the model. Minimizing the log loss has several practical benefits: 1) it is written as an expected loss (and is thus straightforward to optimize via stochastic gradient descent), 2) it factorizes across tokens in autoregressive modeling, and 3) it provides a guarantee on a model’s goodness of fit (Eq (

1)).

Unfortunately, log loss also suffers from several drawbacks. It is known to have little correlation with a model’s sample quality and it can be brittle to invalid references in the training data.

Log loss is not robust to noise. The KL divergence has intuitively correct behavior when each input has a single correct reference : it will maximize the probability of the single correct reference. However, log loss can be problematic when there are multiple correct references, of which some are invalid or difficult to model.

Figure 1: Fitting a mixture of Gaussians with a single Gaussian using distinguishability (TV) and log loss (KL). As shown, log loss is extremely sensitive to outliers, resulting in poor estimation.

In particular, log loss is sensitive to invalid or noisy data because it requires that the model assign high probabilities to all potential references. Log loss is unbounded above: a model assigning zero probability to even a single reference makes the model incur an infinite overall loss.

We show a well-known example of this behavior with synthetic data. We consider fitting a single Gaussian to a mixture of two Gaussian in Figure 1. The reference distribution (blue) has a valid set of references at zero as well as variation that the model does not expect (e.g., invalid or noisy references) on the right. Minimizing the log loss results in a suboptimal model that is forced to span both groups. Furthermore, post-hoc processing the model does not help, as even the most likely output under the log loss trained model (~3) has low probability under the reference distribution.

In natural language generation, training sets can contain invalid or poor quality references. As such, these types of problems manifest themselves in tasks such as summarization (hallucinating facts), story generation (ignoring prompts and constraints), and captioning (ignoring parts of the image).

Much of the existing literature on faithful generation has focused on designing better models for valid references (via copying or attention constraints), but the example in Figure 1 shows that this alone may not be sufficient. The Gaussian ‘model’ in this case perfectly fits the mixture component at zero but is still brittle because it cannot simultaneously fit the other group of (invalid) samples. Resolving this will require either a model which is designed explicitly to capture invalid

references or a loss function that can ignore them.

Case Study: Hallucination in Summarization

We show that low-probability reference sequences (e.g., Figure 1) are pervasive by examining the Gigaword summarization dataset rush2017neural

. We manually classified 300 titles into two categories: 1) requires hallucinating new facts and 2) directly entailed from the context. We show an example of a reference that requires hallucination in Figure 

2. In this example, a model that assigns high probability to the new fact (Thursday) must also frequently hallucinate dates on other examples.

Context: For the first time in five years, Microsoft corp. is finally unveiling a new system for operating personal computers.
Title: Microsoft Makes Long-Awaited Software Upgrade Available to Businesses Thursday.

Figure 2: Example of an article title from the Gigaword dataset that requires hallucinating new facts such as ‘Thursday’ (colored red).

We show the fraction of examples in each category in Table 1. As shown, 35% of titles require hallucinating new facts. Others have found this phenomenon to be pervasive in other datasets kryscinski2019neural, including the CNN/DM dataset see2017get.

Studying the log loss of these examples111The log loss was computed from a standard language model, see §5 for details., we note that the average log loss of titles that require new facts is over 1.7 the average loss of the titles that are directly entailed (Table 1) and the high-loss examples are clearly dominated by examples which require hallucination (Figure 3). In fact, we find that over 80% of examples with greater than 40 log loss requires some form of hallucination.

These statistics are similar to the toy example we presented earlier in Figure 1. A small but nontrivial fraction of invalid and unexpected data force the model to incur high losses. Much like in the earlier example, we can see that a model which aims to have low log loss on this dataset must spend a substantial amount of effort learning to hallucinate.

New facts Directly entailed
Percent 35% 65%
Avg. log loss 34.3 20.5
Table 1: Fraction of the data and log loss of titles that require hallucinating new facts (left column) and titles that are entailed from the context (right column). As shown, 35% of titles require hallucinating new facts and the average log loss of titles requiring new facts is over 1.7 the loss of the directly entailed sequences.
Figure 3: Normalized histogram of log losses for titles that require hallucinating new facts compared to those that can be directly entailed. As shown, titles requiring new facts incur significantly higher loss and more than 80% of examples with greater than 40 log loss require hallucinating new facts.

Distinguishability. Given that large-scale data will inevitably contain annotation errors and noise, we might ask whether there are effective alternatives to the KL divergence for training models. The distinguishability of samples from a model compared to the reference is one such objective. Distinguishability has recently gained attention as a way to learn and evaluate models based on both sample quality and diversity hashimoto2019unifying; zhou2019hype; zellers2019defending; gehrmann2019gltr. We show that this objective also serves as a naturally robust alternative to the KL divergence for learning language models. Unfortunately, directly optimizing for distinguishability (e.g., via generative adversarial networks) is challenging caccia2018language and we show this works poorly in practice (§5).

Distinguishability is defined as the error rate of an optimal classifier which seeks to distinguish samples from both the model and reference, and we will formally define this via the mixture

where . We can now define to be twice the optimal error in identifying samples from the model

(2)

Our measure of distinguishability, the total variation (TV) distance, is a linear function of this error

where and

refer to the joint distributions

and for brevity. Note that distinguishability is inherently robust to the addition of any small fraction of noisy data donoho1988automatic. Unlike the log loss, the model’s loss on an example for TV is upper bounded by (Eq 2). We show an example of TV’s robustness in Figure 1, where a small amount of noise does not substantially affect the learned distribution.

Log loss as a surrogate for distinguishability. Distinguishability is both robust and provides sample quality guarantees, but is challenging to optimize caccia2018language. One approach to optimize for distinguishability is to find an appropriate surrogate loss which serves as an upper bound. This is analogous to the use of logistic or hinge losses as a way to optimize for classification accuracy. For log loss, Pinsker’s inequality csiszar2011information relates the KL divergence and distinguishability as

(3)

This explains the empirical success of log loss in low-uncertainty situations, where KL is sufficiently small and this bound becomes tight.

Our approach will be to modify the log loss slightly by truncating the distribution. This truncated loss will be as easy to optimize as log loss, while being more robust and providing a tighter variant of Pinsker’s inequality.

3 Loss Truncation

Intuition. We would like the model to ignore data that would force it to unnecessarily hallucinate at test time. Concretely, recall the toy example (Figure 1); there is a set of invalid references that force the model to be degenerate. If we could remove these these invalid references by truncating the distribution, the resulting model would be high quality. We can show that this intuition is theoretically justified, and that truncating (i.e., removing) an appropriate -fraction of the data provides tighter bounds on the distinguishability of the model.

Improved log losses for distinguishability. We will demonstrate that log loss with an appropriate -fraction of the data removed provides guarantees on distinguishability. We will define the set of truncated distributions as the set of distributions with any -fraction of data removed

A simple lemma shows that that all elements in are -close to in TV (Appendix B).

Now we state our main result,

Proposition 1.

For any and ,

See Appendix B for the proof. Namely, distinguishability is bounded by the log loss with respect to the truncated distribution and a small constant. Furthermore, this upper bound is valid for any , although different will change the tightness of the bound and produce different models.

This truncated bound can be substantially tighter than Pinsker’s inequality. Consider for example a model that can perfectly capture fraction of the data, but -fraction of the reference outputs cannot be generated by the model and receive probability zero. In this case, the distinguishability (TV) is , the KL divergence is infinite, while our truncated bound is . This suggests that appropriately truncating high-loss examples makes log loss robust and allows us to use log loss as a surrogate for distinguishability, even in the presence of invalid and noisy references.

Loss truncation. Given that the log loss on any -fraction of the data is a surrogate loss for distinguishability (Eq (6)), a key parameter to optimize is the truncated distribution . An oracle solution would exhaustively search over and which data to drop. However, exhaustively searching through

is a combinatorial optimization problem and infeasible. Our approach will be to optimize

with a heuristic. The truncated objective takes the form of a log loss and negative entropy term,

and we will select by dropping the examples with the highest log loss, treating the negative entropy term as being upper bounded by zero.

This heuristic is straightforward to compute, provides an upper bound on distinguishability, and matches our earlier observation that high-loss examples are correlated with invalid examples we would like the model to ignore (see Table 1).

Figure 4: Pinsker’s inequality, our bound, and the total variation squared of parameter estimates for different parameter estimates (). As shown, loss truncation can significantly improve bounds over Pinsker’s inequality and, in this case, has a nearly identical minimizer to directly minimizing total variation.

As an example of how our heuristic can improve estimation and tightness in bounds, consider the earlier toy example in Figure 1

. In this example, we find the optimal mean for a single Gaussian with fixed variance which fits mixture of two Gaussians. Figure 

4 shows the objective function value implied by the TV loss, log loss (Pinsker’s bound), and our -truncated bound as a function of the Gaussian mean. We find that log loss provides an upper bound on distinguishability (via Pinsker’s inequality) but is loose and results in a low quality estimate. In contrast, -truncation results in a nearly identical minimizer as directly minimizing TV.

4 Implementing Truncation

4.1 Training

Our algorithm has three components at training time. First, it trains a model on all the data using standard hyperparameters, which we refer to as “hotstarting” the model. Second, it tracks a running estimate of the

quantile of the losses during training. Third, it performs gradient updates on examples that are below the current quantile estimate. We present the pseudocode in Algorithm 1 and describe each step in detail below.222Our code is available at https://github.com/ddkang/loss_dropper.

Hotstarting. First, our algorithm hotstarts the model (hotstart() in Alg. 1) by training with the standard log loss. Hotstarting address two challenges in optimizing the truncated loss. First, losses are uninformative at the start of training so truncating examples based on these losses will result in dropping valid examples. We have empirically found that truncating after hotstarting primarily drops invalid references, which avoids this problem. Second, hotstarting allows the model to transfer information from the entire dataset to the clean fraction of the data. Examples that cause a model to hallucinate may still contain valid information about the fluency of a sentence, which hotstarting can capture. This is effectively pretraining our model on the entire data before learning to generate on the clean subset. We have found this procedure to be effective in practice.

Quantile estimation. Second, our algorithm keeps track of the quantile over the distribution of losses. For each new minibatch , we update an online estimate of the quantile (estimateQuantile() in Alg. 1). To estimate this quantile, our algorithm constructs a histogram over the last 10,000 examples seen during training and estimates the empirical quantile every 10,000 examples.333For datasets with fewer than 10,000 examples, we can perform this procedure over the entire dataset.

Loss dropping. Third, our algorithm will perform minibatch stochastic gradient descent while excluding examples that have losses above the current top quantile estimate (truncatedUpdate() in Alg. 1

). Dropping can be accomplished in automatic differentiation packages (e.g., Tensorflow and PyTorch) by setting the loss on the given example to zero.

4.2 Generating High-Probability Samples

Thus far, our goal has been to robustly learn the underlying distribution. However, in some cases, a user may wish to only generate high confidence sequences, which will ideally correspond to high quality sequences.

To generate such samples, we propose sequence-level rejection sampling.

Recall that our truncation heuristic selects for the quantile of the distribution. For a user-defined level , our rejection sampling scheme will aim to generate samples from the quantile.

To perform rejection sampling, given a model and a user-defined rejection level , we first sample sequences (e.g., titles in a summarization task). Then, we sample a random sequence from the smallest samples as measured by log loss. Ideally, this procedure will return a sample in the quantile of .

We show that rejection sampling can outperform baselines in generating factual summaries (§5). We further show examples of selected and rejected samples in Appendix A.

Data: Model , fraction to drop, iterations
hotstart() ;
for  to  do
       ;
       ;
       ;
      
end for
Algorithm 1 The proposed loss truncation procedure with three components (see main text for details for each component).
Loss trunc. Trunc+reject () Full samp. Beam top- () top- () GAN
HUSE 0.58 0.04 0.55 0.04 0.32 0.32 0.003
HUSE-D 0.88 0.12 0.98 0.18 0.59 0.65 0.25
HUSE-Q 0.70 0.92 0.58 0.86 0.73 0.67 0.75
Table 2: HUSE, HUSE-D, and HUSE-Q scores for loss truncation and baselines. As shown, loss truncation outperforms all baselines on HUSE score.

5 Evaluation

5.1 Experimental Setup

Dataset and Task. We primarily evaluate loss truncation on abstractive summarization in the form of generating news headlines from an article. We selected this task to highlight that loss truncation can improve sample quality and factual accuracy, while also achieving the secondary goal of diversity for abstractive systems see2017get; kryscinski2019neural.

We evaluated on the Gigaword summarization task rush2017neural as in gehrmann2018bottom. While there are other summarization datasets, we chose Gigaword for the following reasons. First, it is large enough that sample quality defects are not caused by a lack of data. Second, the dataset is structured so that neither model nor computation is the bottleneck in performance: the standard sequence-to-sequence models are competitive on the Gigaword dataset. Third, while Gigaword dataset is known to have noise, this matches the behavior of existing annotation errors beigman2009learning; klebanov2010some and uncertainty kryscinski2019neural.

To show that loss truncation is applicable beyond summarization, we also performed a preliminary evaluation of our approach on the E2E NLG task. In E2E, the goal is to generate restaurant reviews from meaning representations dusek2019e2e.

Model and Baselines. We used a standard LSTM architecture with global attention for summarization that has been used for the Gigaword summarization task in the past gehrmann2018bottom. The learning rate and hyperparameters are given in Appendix C. For the E2E task, we use a standard model with the exact settings as in puzikov2018e2e.

For loss truncation on Gigaword, we used . We matched the total number of training steps when training via loss truncation (including the hotstart) and standard log loss. We sampled from the full model distribution for loss truncated models except when rejection sampling.

As baselines on Gigaword, we generate from the log loss trained language model using several decoders that have been reported to mitigate low-quality outputs such as beam search, top- sampling fan2018hierarchical, and top- sampling holtzman2019curious. We also evaluate directly sampling from the probabilistic model in order to estimate overall distinguishability and understand the diversity-quality trade-offs of each model.

Finally, on Gigaword, we also compared against a recent generative adversarial network (GAN) model with a publicly available implementation wang2018learning.

Human-evaluation metrics.

We evaluate whether loss truncation improves model distinguishability on summarization by measuring the HUSE estimator for TV hashimoto2019unifying. HUSE measures distinguishability by learning a classifier over the log-probabilities and human evaluation scores over both samples from the model and references. We also use HUSE to evaluate the quality-diversity tradeoffs of the models by estimating both HUSE-Q (which measures quality via human judgement) and HUSE-D (which measures diversity via statistical evaluation).

In order to assess whether this leads to improvements in the faithfulness of samples, we measure whether loss truncation reduces the number of factually inaccurate outputs from the model via a crowdsourced survey. We designed our prompt based on earlier factual accuracy human evaluation novikova2017we and measured whether the original article contained all of the information given in the generated title.

We describe the crowd worker setup in Appendix D.

Automated metrics. While human evaluation is our primary metric of evaluation as it is considered gold-standard, we additionally evaluate on automated metrics to contextualize our human evaluation results. We measure ROUGE-L lin2003automatic for summarization and BLEU score papineni2002bleu for E2E.

5.2 Loss Truncation Outperforms Baselines on HUSE

Using the HUSE score to measure the TV distance, we assessed whether loss truncation successfully improved our model in terms of distinguishability compared to log loss. As shown in Table 2, loss truncation outperforms all baselines on HUSE score (including the original log loss model Full samp), suggesting the truncated model is a better language model than the log loss model as measured by distinguishability.

We find that that loss truncation improves over the log loss by increasing the generation quality (HUSE-Q) by 12% without substantially lowering diversity (e.g., memorizing examples from the training set). These results affirmatively answers an open question posed by hashimoto2019unifying on whether it is possible to obtain models that improve the quality while maintaining overall distinguishability compared to log loss trained models. Post-hoc modification of the log loss model’s distribution by removing unlikely words using either top- or top- sampling result in substantial losses in HUSE due to losses in diversity.

We further considered matching the entropy of the loss truncation model with top- and top- (Appendix C). At a fixed entropy, loss truncation can outperform on HUSE by up to 26%.

Comparing models with high sample quality, loss truncation with rejection sampling improves upon all baselines (including beam search) in terms of raw human quality evaluation (HUSE-Q), and we see that the Pareto frontier of truncation and rejection sampling (which can be achieved via ensembling) dominates the baselines on both quality and diversity (Figure 5). Rejection sampling decreases overall HUSE score because it is designed to only return high quality samples (i.e., high HUSE-Q): this comes at the cost of reduced diversity, so overall HUSE score suffers.

The results amongst our baselines recapitulate known results for the quality-diversity tradeoffs of existing methods. Beam search has high sample quality, but low diversity; top- and top- samplers provide diversity gains over beam search; and GANs generally underperform well-tuned log loss based models on both diversity and quality.

5.3 Loss Truncation with Rejection Sampling Produces High Quality Outputs

We now ask whether improvements in distinguishability (as measured by HUSE) for the loss truncation model translate to practical improvements in sample quality, such as the factual accuracy of generated outputs in summarization. We evaluate this through a crowdsourced study on factual accuracy.

Since we are interested in studying whether our model can produce high quality samples, we used rejection sampling with to obtain high-quality samples from the model. We compare this to the log loss model with baseline decoders. For the top- and top- sampling decoders that have quality-diversity tradeoffs, we select and such that the entropy of the sampling distribution matches our rejection sampling approach (see Appendix C for details).

Figure 5: HUSE-D vs HUSE-Q for loss truncation, truncation + rejection sampling, and baselines. The red line shows the best achievable frontier via ensembling. Truncation and rejection outperform all baselines.
Condition Mean score
Human 3.63 0.05
Truncation + Rejection () 3.79 0.06
Beam 3.51 0.05
top- () 3.42 0.05
top- () 3.29 0.05
Sampling 2.96 0.05
Table 3:

Mean scores and standard errors of factuality in generated news titles given articles. As shown, rejection sampling outperforms all baselines and matches the human reference score.

To measure factual accuracy, we asked crowd workers how much information in the generated titles was contained in the article in a similar fashion to novikova2017we. Table 3 shows the average factual accuracy rating for each model. We find that rejection sampling outperforms all baselines, including the current gold standard of beam search, and matches the human reference level of factual accuracy.

Although it may seem surprising that loss truncation and rejection sampling together can achieve the same factual accuracy score as humans, recall that over 34% of the dataset consists of titles which have facts that are not contained in the article. The loss truncation approach biases the model towards learning only the easily predicted (and likely factually accurate) titles.

5.4 Loss Truncation Produces Diverse Outputs

Method Example
Context at least ## people have been killed and more than ##,### made homeless by floods that swept across southern africa in the past week , striking a region already grappling with severe food shortages .
Gold floods kill ## in famine-hit southern africa
Loss truncation at least ## people killed ##,### evacuated in floods in southern african region
floods that sweep parts of africa kill at least ##
Beam flooding hits southern africa as deaths rise
Full sampling child farming stalls in southern africa
earthquake kills ## in southern africa
top- () torrential rains prompt warnings in southern africa
toll nears ## in southern africa
top- () at least ## killed ##,### homeless in southern africa floods
at least ## dead ##,### homeless as floods hit southern africa
Table 4: Examples of generations for various baselines and loss truncation (two replicates shown for sampled outputs). As shown, loss truncation can achieve diverse and high quality outputs. In contrast, baselines either are not diverse (beam, top-) or poor quality (full sampling, top-). We color incorrect facts in red.

Finally, one of the benefits of optimizing for distinguishability is that it naturally optimizes for both diversity and quality. Manually examining outputs from the models, we find that directly sampling from the loss truncated model often produces high quality and diverse outputs. We show examples of generated outputs for baselines and loss truncation in Table 4. Loss truncation uses different phrasings (‘at least # killed’, and ‘floods sweep’) while top- follows a nearly templated pattern with a few changes to the words which appear. Top- and direct sampling both have diverse phrasings, but also hallucinate facts (‘earthquake’ in sampling and ‘torrential rains’ in top- sampling).

5.5 Loss Truncation can Outperform on Automated Metrics

While our primary evaluation metrics are human evaluations (HUSE and factuality), we additionally investigate automated metrics to further contextualize our results. For summarization, we used ROUGE-L and for E2E we use BLEU score for the automated metrics.

For summarization, the ROUGE-L scores for loss truncation and entropy-matched top- and top- decoding were 23.2, 22.8, and 22.8 respectively. While loss truncation does not substantially improve ROUGE-L, we see that it still outperforms baselines. We do not expect reference-based evaluations to fully capture the benefits of loss truncation, as these metrics encourage the models to fully imitate the data distribution – including invalid and hallucinated examples.

For E2E, the BLEU scores for loss truncation and the baseline were 0.72 and 0.64 respectively. We confirmed that the baseline model for the E2E task achieves a similar score as reported by balakrishnan2019constrained. Perhaps surprisingly, improving BLEU score to 0.72 almost closes the gap to using complex tree-structured semantic representations, which achieves a BLEU score of 0.74 balakrishnan2019constrained.

We further show that loss truncation is not sensitive to the hyperparameter on automated metrics in Appendix E.1 and provide a preliminary investigation of combining loss truncation and alternative decoders in Appendix E.2.

6 Related Work

Decoder-based diversity. Researchers have proposed a variety of models for text generation radford2019language; keskar2019ctrl; sutskever2014sequence. These models generate text using decoding methods such as beam search. While beam search is generally thought of as the gold standard tillmann2003word, it can produce generic and repetitive outputs holtzman2019curious. To achieve diversity, top- fan2018hierarchical and top- holtzman2019curious sampling stochastically decodes the outputs after restricting the output space to avoid low-quality outputs.

While these techniques can improve generation quality, they rely on models trained via log loss, which we show can result in undesired behavior that cannot be fixed post-hoc. Our work is complementary to existing work on decoders by proposing a loss that can improve the probabilistic models which these decoders operate on.

Loss modifications. Prior work has identified specific issues in generative models, such as repetitiveness, and proposed loss modifications to address these specific issues in the context of long text generation welleck2019neural; holtzman2018learning. In contrast, we identify an issue with the widely used log loss, and propose loss truncation, which does not require a task- and issue-specific modification. Many of the penalties and decoding techniques proposed in these earlier works can be combined with truncated log loss to obtain models that are more robust to noisy references.

Contemporaneous with our work, tian2019sticking propose an attention weight approach to improving generation faithfulness via decoder and loss modifications. Our work complements this by providing a conceptual basis for improving faithfulness by ignoring examples (i.e., optimizing distinguishability), and providing a simple and general loss. We consider complex, model dependent loss truncation methods for optimizing distinguishability to be exciting future work.

Other generation methods optimize for task-specific losses och2003minimum; shen2015minimum. Task specific losses are not known in many cases and thus we require an effective task-agnostic loss, e.g., log loss or TV. We show that TV acts as a useful task-agnostic goodness of fit measure, and we provide an improved alternative to log loss.

GANs. GANs have been proposed to learn models that minimize distinguishability li2017adversarial; rajeswar2017adversarial; dai2017towards. While GANs have been successful in generating images goodfellow2014generative; brock2018large, GANs remaining challenging to optimize for text due to the discrete nature of text. Our findings match earlier reports that GANs underperform log loss trained sequence-to-sequence models caccia2018language. In this work, we show that better training methods for distinguishability can arise from modifying the standard log loss via truncation.

Robust learning. Robust learning is the study of learning in the face of outliers tukey1960survey; donoho1982breakdown; huber1992robust. Our work is related to the -contamination model, in which an fraction of the data has been modified, potentially by an adversary diakonikolas2018sever. Our work shows that robust learning under log loss can result in improved empirical performance and bounds on distinguishability.

While there are a number of effective approaches to robust learning diakonikolas2018sever; fischler1981random, we focus on a simple truncation procedure as it is one of the only procedures scaleable enough to apply on large-scale generation datasets. Our work shows that more effective, scalable robust learning procedures can help improve natural language generation methods.

7 Conclusion

In this work, we show that log loss is not robust to noise, which can in turn cause undesired behavior, such as hallucinating facts in summarization. In response, we propose loss truncation, a robust training method that optimizes for distinguishability of generated samples. We additionally propose a sequence-level rejection sampling scheme to generate high quality sequences. We show that loss truncation outperforms a range of baselines (including beam search, top-, top-, and full sampling) on distinguishability. We additionally show that rejection sampling outperforms all baselines, including beam search, on generating factual summaries. These results suggest that robust learning in the form of truncating the log loss can complement model-based approaches to faithful generation by ignoring invalid and undesired references.

References

Appendix A Examples of Titles and Generations

Examples of ground truth titles. We present examples of titles in Figure 6 that require factual hallucination and can be directly entailed from context.

Context: Donna Shalala is sporting a mustache to promote public health.
Title: Milk on Her Lip Shalala Raises Eyebrows

(a) Example of a title that requires hallucinating new facts, e.g., “Milk on Her Lip” and “raises eyebrows”.

Context: Southwest China’s Sichuan province has decided to build an inter-city high-tech industrial belt to serve development of Western China.
Title: Sichuan to Build High-Tech Industrial Belt

(b) Example of a title that can be directly generated from the context.
Figure 6: Examples of titles that require hallucinating new facts and titles that are directly entailed from context.

Examples of generated titles. We present examples of titles that from rejection sampling that are selected and that were rejected in sampling in Figure 7. As shown, rejected titles tend to be of lower quality.

Context: At least two people have tested positive for the bird flu virus in Eastern Turkey, health minister Recep Akdag told a news conference Wednesday.
Ground truth: Two test positive for bird flu virus in Turkey
Selected sample: Two reported positive for bird flu in Eastern Turkey
Rejected sample: Two officials fail to get good for bird flu in Eastern Turkey

(a) Example 1.

Context: British investment fund Fidelity has increased its stake in Puma, the German maker of sportswear and equipment, to just over five percent, Puma said on Thursday.
Ground truth: Private equity firm Fidelity raises stake in Puma to over five pct
Selected sample: Fidelity increases stake in Puma
Rejected sample: Boost higher first-half stake in Puma says Puma

(b) Example 2.
Figure 7: Examples of sampled titles that were selected and rejected in rejection sampling at .

Appendix B Proof of Lemma and Proposition

Lemma. We prove the lemma that all elements in are close to in total variation.

Lemma 1.
Proof.

By definition of , for any there exists a such that so,

Proposition. We prove that the truncated log loss bounds total variation.

Proof.
(4)
(5)
(6)

which follows from the triangle inequality, Pinsker’s inequality, and using Lemma 1 to bound the remaining terms by . ∎

Appendix C Hyperparameters

Summarization model hyperparameters. We used a standard OpenNMT-py model with global attention for all sequence-to-sequence experiments klein2017opennmt. It has a single LSTM layer in the encoder and two in the decoder.

For the baseline model, we train for 200,000 steps with SGD and an initial learning rate of . For the loss truncated model, we hotstart with 100,000 minibatch updates and subsequently with 100,000 minibatch updates with the truncated loss with an initial learning rate of .

and selection. A key parameter in top- and top- sampling are and respectively. These parameters trade off between diversity and quality. To select these values, we chose values of and that had similar entropies to our model trained with loss truncation.

Specifically, and matched loss truncation at for summarization (entropies of , , and respectively). and matched rejection sampling for summarization at (entropies of , , and respectively).

Appendix D Crowd Worker Setup and Prompts

Crowdsourcing setup. For all human evaluations, we used Amazon Mechanical Turk (all prompts shown below). We sampled 312 context/title pairs to measure HUSE. For each generated title, we asked 9 crowd workers to measure the typicality of the generated title, as in hashimoto2019unifying. Each crowd worker responded to 24 generated titles.

For measuring factuality, we sampled 312 examples and for each example, we asked two crowd workers how much information in the generated title was present in the article.

(a) Prompt for measuring HUSE.
(b) Prompt for measuring factuality.
Figure 8: Prompts for measuring HUSE and factuality.

Prompts. We show crowd worker prompts for measuring HUSE and factuality in Figure 8. The HUSE prompt was directly taken from hashimoto2019unifying with an extra control.

Appendix E Further experiments

e.1 Sensitivity to

Condition ROUGE-L
Truncation, 24.3
Truncation, 24.9
Truncation, 24.0
Truncation, 23.2
top- 22.8
top- 22.8
Table 5: ROUGE-L scores for loss truncation at various and entropy-matched top- and top- decoding for summarization. As shown, loss truncation outperforms on ROUGE-L for a range of .
Condition BLEU
Truncation, 0.72
Truncation, 0.71
Truncation, 0.70
Truncation, 0.69
Truncation, 0.69
Baseline 0.64
0.72 0.64
Table 6: BLEU scores for loss truncation at various and the baseline model on the E2E task. As shown, loss truncation outperforms the baseline on BLEU score at a range of hyperparameters.

We investigate the sensitivity of loss truncation to the hyperparameter . To do so, we vary and measure ROUGE-L and BLEU scores, for summarization and E2E respectively.

We show results for summarization in Table 5 and E2E in Table 6 along with baselines. As shown, truncation outperforms on automated metrics on a variety of hyperparameter settings on automated metrics. We leave a full investigation of sensitivity to as future work.

e.2 Combining Loss Truncation and Decoders

Condition ROUGE-L
Log-loss, beam 41.4
Log-loss, full sampling 27.9
Truncation, top- 33.4
Truncation, top- 38.9
Truncation, top- 35.1
Truncation, top- 40.9
Table 7: Loss truncation combined with top- and top- decoding.

As loss truncation is a training method, it can be combined with alternative methods of decoding at inference time. As such, we perform a preliminary investigation of using top- and top- decoding with loss truncation.

We show ROUGE-L of loss truncation combined with various decoders and baselines for summarization in Table 7. As shown, top- and top- decoding work with loss truncation and can improve sample quality.