## 1 Introduction

Learning to generate text is a core part of many NLP tasks, including summarization nallapati2016abstractive

lin2014microsoft, and story generation roemmele2016writing. A common challenge to all these tasks is that references from the training distribution are not unique and contain substantial variations in phrasing and content wiseman2017challenges; dhingra2019handling. Learning to generate under a set of diverse and noisy references is challenging as some variations ought to be learned (e.g., paraphrasing) while others should not (e.g., hallucinated facts, ignoring prompts).Existing training procedures for models seek to match the underlying distribution, leading to models that replicate and sometimes even amplify unwanted behaviors such as hallucination during generation. For example, neural language models often produce fluent text that is unfaithful to the source tian2019sticking; wiseman2017challenges; lee2018hallucinations. Existing work fan2018hierarchical; holtzman2019curious has primarily addressed these issues by constructing decoders that implicitly remove unwanted variation when generating (see §6 for a detailed discussion of task-specific losses).

In this work, we argue that this phenomenon is not model specific, but is due to the widely-used log loss: we demonstrate that log loss is not robust to noisy and invalid references (§2

). In particular, log loss requires that models assign probabilities to

allpotential test reference sequences. As a result, log loss is sensitive to outliers: invalid or noisy references with small probability mass can cause large changes in model behavior. We show that the brittleness of log loss, together with the noise in existing generation datasets, lead to low-quality and unfaithful generated text.

Instead of optimizing log loss, which has little correlation
with model output quality theis2016note; hashimoto2019unifying; gamon2005sentence, recent work on diverse generation models has proposed optimizing for the *distinguishability* of samples from the model and the reference.
Distinguishability provides a natural and appealing
guarantee: samples that are indistinguishable from human generated text will be as
high quality as human generated text. Furthermore, we show that optimizing for distinguishability
is robust in the face of noisy and even invalid data. Despite its appeal, distinguishability
has not been widely used due to statistical and computational challenges. For
example, existing methods that directly optimize for distinguishability have yet to match even naive log loss based baselines caccia2018language.

We propose a modification to the log loss, *loss truncation*, that has
the benefits of distinguishability while being efficient to train. Loss truncation is as efficient
to train as log loss, nearly as robust as distinguishability, and provides
distinguishability guarantees via an upper bound. It achieves these properties
by modifying the standard log loss to adaptively remove examples with high log loss. We additionally extend loss
truncation with a *sequence-level* rejection sampling scheme that generates
higher quality sequences by restricting the outputs to be high probability
sequences.

We show that loss truncation with direct and rejection sampling outperforms standard log loss based generation methods (beam search, full sampling, top-, and top- sampling) on distinguishability, as measured by the HUSE score hashimoto2019unifying. We additionally study the factual accuracy of a summarization system trained on loss truncation and show that our proposed approach produces summaries which improve upon all baselines (including beam searched models) and match references on factual accuracy.

## 2 Motivation and Problem Statement

Task and Background.
We consider a natural language generation task with a *conditional
language model*, where we are given a context drawn from and our
probabilistic model produces an output by
approximating a (usually human) reference distribution .

In order to achieve this, many existing models are trained to minimize the Kullback-Leibler (KL) divergence,

(1) |

We refer to the first term of this divergence as the *log loss*

of a model. The second term is commonly ignored as it is a constant with respect to the model. Minimizing the log loss has several practical benefits: 1) it is written as an expected loss (and is thus straightforward to optimize via stochastic gradient descent), 2) it factorizes across tokens in autoregressive modeling, and 3) it provides a guarantee on a model’s goodness of fit (Eq (

1)).Unfortunately, log loss also suffers from several drawbacks. It is known to have little correlation with a model’s sample quality and it can be brittle to invalid references in the training data.

Log loss is not robust to noise. The KL divergence has intuitively correct behavior when each input has a single correct reference : it will maximize the probability of the single correct reference. However, log loss can be problematic when there are multiple correct references, of which some are invalid or difficult to model.

In particular, log loss is sensitive to invalid or noisy data because it requires
that the model assign high probabilities to *all* potential references. Log loss is
unbounded above: a model assigning zero probability to even a single reference
makes the model incur an infinite overall loss.

We show a well-known example of this behavior with synthetic data. We consider fitting a single Gaussian to a mixture of two Gaussian in Figure 1. The reference distribution (blue) has a valid set of references at zero as well as variation that the model does not expect (e.g., invalid or noisy references) on the right. Minimizing the log loss results in a suboptimal model that is forced to span both groups. Furthermore, post-hoc processing the model does not help, as even the most likely output under the log loss trained model (~3) has low probability under the reference distribution.

In natural language generation, training sets can contain invalid or poor quality references. As such, these types of problems manifest themselves in tasks such as summarization (hallucinating facts), story generation (ignoring prompts and constraints), and captioning (ignoring parts of the image).

Much of the existing literature on faithful generation has focused on designing better models for *valid* references (via copying or attention constraints), but the example in Figure 1 shows that this alone may not be sufficient. The Gaussian ‘model’ in this case perfectly fits the mixture component at zero but is still brittle because it cannot simultaneously fit the other group of (invalid) samples.
Resolving this will require either a model which is designed explicitly to capture *invalid*

references or a loss function that can ignore them.

#### Case Study: Hallucination in Summarization

We show that low-probability reference sequences (e.g., Figure 1) are pervasive by examining the Gigaword summarization dataset rush2017neural

. We manually classified 300 titles into two categories: 1) requires hallucinating new facts and 2) directly entailed from the context. We show an example of a reference that requires hallucination in Figure

2. In this example, a model that assigns high probability to the new fact (Thursday) must also frequently hallucinate dates on other examples.We show the fraction of examples in each category in
Table 1. As shown, *35%* of titles
require hallucinating new facts. Others have found this phenomenon to be
pervasive in other datasets kryscinski2019neural, including the CNN/DM
dataset see2017get.

Studying the log loss of these examples^{1}^{1}1The log loss was computed from
a standard language model, see §5 for details., we note that the average log loss of
titles that require new facts is over 1.7 the average loss of the titles
that are directly entailed (Table 1) and the
high-loss examples are clearly dominated by examples which require hallucination
(Figure 3). In fact, we find that over 80% of examples with
greater than 40 log loss requires some form of hallucination.

These statistics are similar to the toy example we presented earlier in Figure 1. A small but nontrivial fraction of invalid and unexpected data force the model to incur high losses. Much like in the earlier example, we can see that a model which aims to have low log loss on this dataset must spend a substantial amount of effort learning to hallucinate.

New facts | Directly entailed | |
---|---|---|

Percent | 35% | 65% |

Avg. log loss | 34.3 | 20.5 |

*35%*of titles require hallucinating new facts and the average log loss of titles requiring new facts is over 1.7 the loss of the directly entailed sequences.

Distinguishability. Given that large-scale data will inevitably contain annotation errors and noise, we might ask whether there are effective alternatives to the KL divergence for training models. The distinguishability of samples from a model compared to the reference is one such objective. Distinguishability has recently gained attention as a way to learn and evaluate models based on both sample quality and diversity hashimoto2019unifying; zhou2019hype; zellers2019defending; gehrmann2019gltr. We show that this objective also serves as a naturally robust alternative to the KL divergence for learning language models. Unfortunately, directly optimizing for distinguishability (e.g., via generative adversarial networks) is challenging caccia2018language and we show this works poorly in practice (§5).

Distinguishability is defined as the error rate of an optimal classifier which seeks to distinguish samples from both the model and reference, and we will formally define this via the mixture

where . We can now define to be twice the optimal error in identifying samples from the model

(2) |

Our measure of distinguishability, the *total variation (TV) distance*, is a linear function of this error

where and

refer to the joint distributions

and for brevity. Note that distinguishability is inherently*robust*to the addition of

*any*small fraction of noisy data donoho1988automatic. Unlike the log loss, the model’s loss on an example for TV is upper bounded by (Eq 2). We show an example of TV’s robustness in Figure 1, where a small amount of noise does not substantially affect the learned distribution.

Log loss as a surrogate for distinguishability.
Distinguishability is both robust and provides sample quality guarantees, but is
challenging to optimize caccia2018language. One approach to optimize for
distinguishability is to find an appropriate *surrogate loss* which serves
as an upper bound. This is analogous to the use of logistic or hinge losses as a
way to optimize for classification accuracy. For log loss, *Pinsker’s
inequality* csiszar2011information relates the KL divergence and
distinguishability as

(3) |

This explains the empirical success of log loss in low-uncertainty situations, where KL is sufficiently small and this bound becomes tight.

Our approach will be to modify the log loss slightly by truncating the distribution. This truncated loss will be as easy to optimize as log loss, while being more robust and providing a tighter variant of Pinsker’s inequality.

## 3 Loss Truncation

Intuition. We would like the model to ignore data that would force it to unnecessarily hallucinate at test time. Concretely, recall the toy example (Figure 1); there is a set of invalid references that force the model to be degenerate. If we could remove these these invalid references by truncating the distribution, the resulting model would be high quality. We can show that this intuition is theoretically justified, and that truncating (i.e., removing) an appropriate -fraction of the data provides tighter bounds on the distinguishability of the model.

Improved log losses for distinguishability.
We will demonstrate that log loss with an appropriate -fraction of the data removed provides guarantees on distinguishability.
We will define the set of *truncated* distributions as the set of distributions with any -fraction of data removed

A simple lemma shows that that all elements in are -close to in TV (Appendix B).

Now we state our main result,

###### Proposition 1.

For any and ,

See Appendix B for the proof.
Namely, distinguishability is bounded by the log loss with respect to the
truncated distribution and a small constant. Furthermore, this upper bound is valid for
*any* , although different will change the tightness of the bound and
produce different models.

This truncated bound can be substantially tighter than Pinsker’s inequality.
Consider for example a model that can perfectly capture fraction of
the data, but -fraction of the reference outputs cannot be generated by the
model and receive probability zero. In this case, the distinguishability
(TV) is , the KL divergence is *infinite*, while our
truncated bound is . This suggests that appropriately
truncating high-loss examples makes log loss robust and allows us to use log loss as a surrogate for distinguishability, even in the presence of invalid and noisy references.

Loss truncation. Given that the log loss on any -fraction of the data is a surrogate loss for distinguishability (Eq (6)), a key parameter to optimize is the truncated distribution . An oracle solution would exhaustively search over and which data to drop. However, exhaustively searching through

is a combinatorial optimization problem and infeasible. Our approach will be to optimize

with a heuristic. The truncated objective takes the form of a log loss and negative entropy term,

and we will select by dropping the examples with the highest log loss, treating the negative entropy term as being upper bounded by zero.

This heuristic is straightforward to compute, provides an upper bound on distinguishability, and matches our earlier observation that high-loss examples are correlated with invalid examples we would like the model to ignore (see Table 1).

As an example of how our heuristic can improve estimation and tightness in bounds, consider the earlier toy example in Figure 1

. In this example, we find the optimal mean for a single Gaussian with fixed variance which fits mixture of two Gaussians. Figure

4 shows the objective function value implied by the TV loss, log loss (Pinsker’s bound), and our -truncated bound as a function of the Gaussian mean. We find that log loss provides an upper bound on distinguishability (via Pinsker’s inequality) but is loose and results in a low quality estimate. In contrast, -truncation results in a nearly identical minimizer as directly minimizing TV.## 4 Implementing Truncation

### 4.1 Training

Our algorithm has three components at training time. First, it trains a model on all the data using standard hyperparameters, which we refer to as “hotstarting” the model. Second, it tracks a running estimate of the

quantile of the losses during training. Third, it performs gradient updates on examples that are below the current quantile estimate. We present the pseudocode in Algorithm 1 and describe each step in detail below.^{2}

^{2}2Our code is available at https://github.com/ddkang/loss_dropper.

Hotstarting. First, our algorithm hotstarts the model (hotstart() in Alg. 1) by training with the standard log loss. Hotstarting address two challenges in optimizing the truncated loss. First, losses are uninformative at the start of training so truncating examples based on these losses will result in dropping valid examples. We have empirically found that truncating after hotstarting primarily drops invalid references, which avoids this problem. Second, hotstarting allows the model to transfer information from the entire dataset to the clean fraction of the data. Examples that cause a model to hallucinate may still contain valid information about the fluency of a sentence, which hotstarting can capture. This is effectively pretraining our model on the entire data before learning to generate on the clean subset. We have found this procedure to be effective in practice.

Quantile estimation.
Second, our algorithm keeps track of the quantile over
the distribution of losses. For each new minibatch , we update an online
estimate of the quantile (estimateQuantile() in
Alg. 1). To estimate this quantile, our algorithm
constructs a histogram over the last 10,000 examples seen during training and
estimates the empirical quantile every 10,000 examples.^{3}^{3}3For
datasets with fewer than 10,000 examples, we can perform this procedure over the
entire dataset.

Loss dropping. Third, our algorithm will perform minibatch stochastic gradient descent while excluding examples that have losses above the current top quantile estimate (truncatedUpdate() in Alg. 1

). Dropping can be accomplished in automatic differentiation packages (e.g., Tensorflow and PyTorch) by setting the loss on the given example to zero.

### 4.2 Generating High-Probability Samples

Thus far, our goal has been to robustly learn the underlying distribution. However, in some cases, a user may wish to only generate high confidence sequences, which will ideally correspond to high quality sequences.

To generate such samples, we propose *sequence-level rejection sampling*.

Recall that our truncation heuristic selects for the quantile of the distribution. For a user-defined level , our rejection sampling scheme will aim to generate samples from the quantile.

To perform rejection sampling, given a model and a user-defined rejection level , we first sample sequences (e.g., titles in a summarization task). Then, we sample a random sequence from the smallest samples as measured by log loss. Ideally, this procedure will return a sample in the quantile of .

We show that rejection sampling can outperform baselines in generating factual summaries (§5). We further show examples of selected and rejected samples in Appendix A.

Loss trunc. | Trunc+reject () | Full samp. | Beam | top- () | top- () | GAN | |
---|---|---|---|---|---|---|---|

HUSE | 0.58 | 0.04 | 0.55 | 0.04 | 0.32 | 0.32 | 0.003 |

HUSE-D | 0.88 | 0.12 | 0.98 | 0.18 | 0.59 | 0.65 | 0.25 |

HUSE-Q | 0.70 | 0.92 | 0.58 | 0.86 | 0.73 | 0.67 | 0.75 |

## 5 Evaluation

### 5.1 Experimental Setup

Dataset and Task. We primarily evaluate loss truncation on abstractive summarization in the form of generating news headlines from an article. We selected this task to highlight that loss truncation can improve sample quality and factual accuracy, while also achieving the secondary goal of diversity for abstractive systems see2017get; kryscinski2019neural.

We evaluated on the Gigaword summarization task rush2017neural as in gehrmann2018bottom. While there are other summarization datasets, we chose Gigaword for the following reasons. First, it is large enough that sample quality defects are not caused by a lack of data. Second, the dataset is structured so that neither model nor computation is the bottleneck in performance: the standard sequence-to-sequence models are competitive on the Gigaword dataset. Third, while Gigaword dataset is known to have noise, this matches the behavior of existing annotation errors beigman2009learning; klebanov2010some and uncertainty kryscinski2019neural.

To show that loss truncation is applicable beyond summarization, we also performed a preliminary evaluation of our approach on the E2E NLG task. In E2E, the goal is to generate restaurant reviews from meaning representations dusek2019e2e.

Model and Baselines. We used a standard LSTM architecture with global attention for summarization that has been used for the Gigaword summarization task in the past gehrmann2018bottom. The learning rate and hyperparameters are given in Appendix C. For the E2E task, we use a standard model with the exact settings as in puzikov2018e2e.

For loss truncation on Gigaword, we used . We matched the total number of training steps when training via loss truncation (including the hotstart) and standard log loss. We sampled from the full model distribution for loss truncated models except when rejection sampling.

As baselines on Gigaword, we generate from the log loss trained language model using several decoders that have been reported to mitigate low-quality outputs such as beam search, top- sampling fan2018hierarchical, and top- sampling holtzman2019curious. We also evaluate directly sampling from the probabilistic model in order to estimate overall distinguishability and understand the diversity-quality trade-offs of each model.

Finally, on Gigaword, we also compared against a recent generative adversarial network (GAN) model with a publicly available implementation wang2018learning.

Human-evaluation metrics.

We evaluate whether loss truncation improves model distinguishability on summarization by measuring the HUSE estimator for TV hashimoto2019unifying. HUSE measures distinguishability by learning a classifier over the log-probabilities and human evaluation scores over both samples from the model and references. We also use HUSE to evaluate the quality-diversity tradeoffs of the models by estimating both HUSE-Q (which measures quality via human judgement) and HUSE-D (which measures diversity via statistical evaluation).In order to assess whether this leads to improvements in the faithfulness of samples, we measure whether loss truncation reduces the number of *factually inaccurate* outputs from the model via a crowdsourced survey. We designed our prompt based on earlier factual accuracy human evaluation novikova2017we and measured whether the original article contained all of the information given in the generated title.

We describe the crowd worker setup in Appendix D.

Automated metrics. While human evaluation is our primary metric of evaluation as it is considered gold-standard, we additionally evaluate on automated metrics to contextualize our human evaluation results. We measure ROUGE-L lin2003automatic for summarization and BLEU score papineni2002bleu for E2E.

### 5.2 Loss Truncation Outperforms Baselines on HUSE

Using the HUSE score to measure the TV distance, we assessed whether loss truncation successfully improved our model in terms of distinguishability compared to log loss. As shown in Table 2, loss truncation outperforms all baselines on HUSE score (including the original log loss model Full samp), suggesting the truncated model is a better language model than the log loss model as measured by distinguishability.

We find that that loss truncation improves over the log loss by increasing the generation quality (HUSE-Q) by 12% without substantially lowering diversity (e.g., memorizing examples from the training set). These results affirmatively answers an open question posed by hashimoto2019unifying on whether it is possible to obtain models that improve the quality while maintaining overall distinguishability compared to log loss trained models. Post-hoc modification of the log loss model’s distribution by removing unlikely words using either top- or top- sampling result in substantial losses in HUSE due to losses in diversity.

We further considered matching the entropy of the loss truncation model with top- and top- (Appendix C). At a fixed entropy, loss truncation can outperform on HUSE by up to 26%.

Comparing models with high sample quality, loss truncation with rejection
sampling improves upon all baselines (including beam search) in terms of raw
human quality evaluation (HUSE-Q), and we see that the Pareto frontier of
truncation and rejection sampling (which can be achieved via ensembling)
dominates the baselines on *both* quality and diversity
(Figure 5).
Rejection sampling decreases overall HUSE score because it is designed to only
return high quality samples (i.e., high HUSE-Q): this comes at the cost of
reduced diversity, so overall HUSE score suffers.

The results amongst our baselines recapitulate known results for the quality-diversity tradeoffs of existing methods. Beam search has high sample quality, but low diversity; top- and top- samplers provide diversity gains over beam search; and GANs generally underperform well-tuned log loss based models on both diversity and quality.

### 5.3 Loss Truncation with Rejection Sampling Produces High Quality Outputs

We now ask whether improvements in distinguishability (as measured by HUSE) for the loss truncation model translate to practical improvements in sample quality, such as the factual accuracy of generated outputs in summarization. We evaluate this through a crowdsourced study on factual accuracy.

Since we are interested in studying whether our model can produce high quality samples, we used rejection sampling with to obtain high-quality samples from the model. We compare this to the log loss model with baseline decoders. For the top- and top- sampling decoders that have quality-diversity tradeoffs, we select and such that the entropy of the sampling distribution matches our rejection sampling approach (see Appendix C for details).

Condition | Mean score |
---|---|

Human | 3.63 0.05 |

Truncation + Rejection () | 3.79 0.06 |

Beam | 3.51 0.05 |

top- () | 3.42 0.05 |

top- () | 3.29 0.05 |

Sampling | 2.96 0.05 |

Mean scores and standard errors of factuality in generated news titles given articles. As shown, rejection sampling outperforms all baselines and matches the human reference score.

To measure factual accuracy, we asked crowd workers how much information in the
generated titles was contained in the article in a similar fashion to
novikova2017we. Table 3 shows the average factual
accuracy rating for each model. We find that rejection sampling outperforms
*all* baselines, including the current gold standard of beam search, and
matches the human reference level of factual accuracy.

Although it may seem surprising that loss truncation and rejection sampling together can achieve the same factual accuracy score as humans, recall that over 34% of the dataset consists of titles which have facts that are not contained in the article. The loss truncation approach biases the model towards learning only the easily predicted (and likely factually accurate) titles.

### 5.4 Loss Truncation Produces Diverse Outputs

Method | Example |
---|---|

Context | at least ## people have been killed and more than ##,### made homeless by floods that swept across southern africa in the past week , striking a region already grappling with severe food shortages . |

Gold | floods kill ## in famine-hit southern africa |

Loss truncation | at least ## people killed ##,### evacuated in floods in southern african region |

floods that sweep parts of africa kill at least ## | |

Beam | flooding hits southern africa as deaths rise |

Full sampling | child farming stalls in southern africa |

earthquake kills ## in southern africa | |

top- () | torrential rains prompt warnings in southern africa |

toll nears ## in southern africa | |

top- () | at least ## killed ##,### homeless in southern africa floods |

at least ## dead ##,### homeless as floods hit southern africa |

Finally, one of the benefits of optimizing for distinguishability is that it naturally optimizes for both diversity and quality. Manually examining outputs from the models, we find that directly sampling from the loss truncated model often produces high quality and diverse outputs. We show examples of generated outputs for baselines and loss truncation in Table 4. Loss truncation uses different phrasings (‘at least # killed’, and ‘floods sweep’) while top- follows a nearly templated pattern with a few changes to the words which appear. Top- and direct sampling both have diverse phrasings, but also hallucinate facts (‘earthquake’ in sampling and ‘torrential rains’ in top- sampling).

### 5.5 Loss Truncation can Outperform on Automated Metrics

While our primary evaluation metrics are human evaluations (HUSE and factuality), we additionally investigate automated metrics to further contextualize our results. For summarization, we used ROUGE-L and for E2E we use BLEU score for the automated metrics.

For summarization, the ROUGE-L scores for loss truncation and entropy-matched top- and top- decoding were 23.2, 22.8, and 22.8 respectively. While loss truncation does not substantially improve ROUGE-L, we see that it still outperforms baselines. We do not expect reference-based evaluations to fully capture the benefits of loss truncation, as these metrics encourage the models to fully imitate the data distribution – including invalid and hallucinated examples.

For E2E, the BLEU scores for loss truncation and the baseline were 0.72 and 0.64 respectively. We confirmed that the baseline model for the E2E task achieves a similar score as reported by balakrishnan2019constrained. Perhaps surprisingly, improving BLEU score to 0.72 almost closes the gap to using complex tree-structured semantic representations, which achieves a BLEU score of 0.74 balakrishnan2019constrained.

## 6 Related Work

Decoder-based diversity. Researchers have proposed a variety of models for text generation radford2019language; keskar2019ctrl; sutskever2014sequence. These models generate text using decoding methods such as beam search. While beam search is generally thought of as the gold standard tillmann2003word, it can produce generic and repetitive outputs holtzman2019curious. To achieve diversity, top- fan2018hierarchical and top- holtzman2019curious sampling stochastically decodes the outputs after restricting the output space to avoid low-quality outputs.

While these techniques can improve generation quality, they rely on models trained via log loss, which we show can result in undesired behavior that cannot be fixed post-hoc. Our work is complementary to existing work on decoders by proposing a loss that can improve the probabilistic models which these decoders operate on.

Loss modifications. Prior work has identified specific issues in generative models, such as repetitiveness, and proposed loss modifications to address these specific issues in the context of long text generation welleck2019neural; holtzman2018learning. In contrast, we identify an issue with the widely used log loss, and propose loss truncation, which does not require a task- and issue-specific modification. Many of the penalties and decoding techniques proposed in these earlier works can be combined with truncated log loss to obtain models that are more robust to noisy references.

Contemporaneous with our work, tian2019sticking propose an attention weight approach to improving generation faithfulness via decoder and loss modifications. Our work complements this by providing a conceptual basis for improving faithfulness by ignoring examples (i.e., optimizing distinguishability), and providing a simple and general loss. We consider complex, model dependent loss truncation methods for optimizing distinguishability to be exciting future work.

Other generation methods optimize for task-specific losses och2003minimum; shen2015minimum. Task specific losses are not known in many cases and thus we require an effective task-agnostic loss, e.g., log loss or TV. We show that TV acts as a useful task-agnostic goodness of fit measure, and we provide an improved alternative to log loss.

GANs. GANs have been proposed to learn models that minimize distinguishability li2017adversarial; rajeswar2017adversarial; dai2017towards. While GANs have been successful in generating images goodfellow2014generative; brock2018large, GANs remaining challenging to optimize for text due to the discrete nature of text. Our findings match earlier reports that GANs underperform log loss trained sequence-to-sequence models caccia2018language. In this work, we show that better training methods for distinguishability can arise from modifying the standard log loss via truncation.

Robust learning. Robust learning is the study of learning in the face of outliers tukey1960survey; donoho1982breakdown; huber1992robust. Our work is related to the -contamination model, in which an fraction of the data has been modified, potentially by an adversary diakonikolas2018sever. Our work shows that robust learning under log loss can result in improved empirical performance and bounds on distinguishability.

While there are a number of effective approaches to robust learning diakonikolas2018sever; fischler1981random, we focus on a simple truncation procedure as it is one of the only procedures scaleable enough to apply on large-scale generation datasets. Our work shows that more effective, scalable robust learning procedures can help improve natural language generation methods.

## 7 Conclusion

In this work, we show that log loss is not robust to noise, which can in turn cause undesired behavior, such as hallucinating facts in summarization. In response, we propose loss truncation, a robust training method that optimizes for distinguishability of generated samples. We additionally propose a sequence-level rejection sampling scheme to generate high quality sequences. We show that loss truncation outperforms a range of baselines (including beam search, top-, top-, and full sampling) on distinguishability. We additionally show that rejection sampling outperforms all baselines, including beam search, on generating factual summaries. These results suggest that robust learning in the form of truncating the log loss can complement model-based approaches to faithful generation by ignoring invalid and undesired references.

## References

## Appendix A Examples of Titles and Generations

Examples of ground truth titles. We present examples of titles in Figure 6 that require factual hallucination and can be directly entailed from context.

Examples of generated titles. We present examples of titles that from rejection sampling that are selected and that were rejected in sampling in Figure 7. As shown, rejected titles tend to be of lower quality.

## Appendix B Proof of Lemma and Proposition

Lemma. We prove the lemma that all elements in are close to in total variation.

###### Lemma 1.

###### Proof.

By definition of , for any there exists a such that so,

∎

Proposition. We prove that the truncated log loss bounds total variation.

###### Proof.

(4) | |||

(5) | |||

(6) |

which follows from the triangle inequality, Pinsker’s inequality, and using Lemma 1 to bound the remaining terms by . ∎

## Appendix C Hyperparameters

Summarization model hyperparameters. We used a standard OpenNMT-py model with global attention for all sequence-to-sequence experiments klein2017opennmt. It has a single LSTM layer in the encoder and two in the decoder.

For the baseline model, we train for 200,000 steps with SGD and an initial learning rate of . For the loss truncated model, we hotstart with 100,000 minibatch updates and subsequently with 100,000 minibatch updates with the truncated loss with an initial learning rate of .

and selection. A key parameter in top- and top- sampling are and respectively. These parameters trade off between diversity and quality. To select these values, we chose values of and that had similar entropies to our model trained with loss truncation.

Specifically, and matched loss truncation at for summarization (entropies of , , and respectively). and matched rejection sampling for summarization at (entropies of , , and respectively).

## Appendix D Crowd Worker Setup and Prompts

Crowdsourcing setup. For all human evaluations, we used Amazon Mechanical Turk (all prompts shown below). We sampled 312 context/title pairs to measure HUSE. For each generated title, we asked 9 crowd workers to measure the typicality of the generated title, as in hashimoto2019unifying. Each crowd worker responded to 24 generated titles.

For measuring factuality, we sampled 312 examples and for each example, we asked two crowd workers how much information in the generated title was present in the article.

Prompts. We show crowd worker prompts for measuring HUSE and factuality in Figure 8. The HUSE prompt was directly taken from hashimoto2019unifying with an extra control.

## Appendix E Further experiments

### e.1 Sensitivity to

Condition | ROUGE-L |
---|---|

Truncation, | 24.3 |

Truncation, | 24.9 |

Truncation, | 24.0 |

Truncation, | 23.2 |

top- | 22.8 |

top- | 22.8 |

Condition | BLEU |
---|---|

Truncation, | 0.72 |

Truncation, | 0.71 |

Truncation, | 0.70 |

Truncation, | 0.69 |

Truncation, | 0.69 |

Baseline | 0.64 |

0.72 | 0.64 |

We investigate the sensitivity of loss truncation to the hyperparameter . To do so, we vary and measure ROUGE-L and BLEU scores, for summarization and E2E respectively.

### e.2 Combining Loss Truncation and Decoders

Condition | ROUGE-L |
---|---|

Log-loss, beam | 41.4 |

Log-loss, full sampling | 27.9 |

Truncation, top- | 33.4 |

Truncation, top- | 38.9 |

Truncation, top- | 35.1 |

Truncation, top- | 40.9 |

As loss truncation is a training method, it can be combined with alternative methods of decoding at inference time. As such, we perform a preliminary investigation of using top- and top- decoding with loss truncation.

We show ROUGE-L of loss truncation combined with various decoders and baselines for summarization in Table 7. As shown, top- and top- decoding work with loss truncation and can improve sample quality.