Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation

by   Ran Tian, et al.

Neural conditional text generation systems have achieved significant progress in recent years, showing the ability to produce highly fluent text. However, the inherent lack of controllability in these systems allows them to hallucinate factually incorrect phrases that are unfaithful to the source, making them often unsuitable for many real world systems that require high degrees of precision. In this work, we propose a novel confidence oriented decoder that assigns a confidence score to each target position. This score is learned in training using a variational Bayes objective, and can be leveraged at inference time using a calibration technique to promote more faithful generation. Experiments on a structured data-to-text dataset – WikiBio – show that our approach is more faithful to the source than existing state-of-the-art approaches, according to both automatic metrics and human evaluation.


page 1

page 2

page 3

page 4


Evaluation of Text Generation: A Survey

The paper surveys evaluation methods of natural language generation (NLG...

R2D2: Robust Data-to-Text with Replacement Detection

Unfaithful text generation is a common problem for text generation syste...

Neural Text Generation: A Practical Guide

Deep learning methods have recently achieved great empirical success on ...

Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation

Table-to-text generation aims to translate the structured data into the ...

Neural Text Generation with Part-of-Speech Guided Softmax

Neural text generation models are likely to suffer from the low-diversit...

Latent Template Induction with Gumbel-CRFs

Learning to control the structure of sentences is a challenging problem ...

Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information

Recently, online shopping has gradually become a common way of shopping ...

1 Introduction

Conditional text generation is the task of generating some target text conditioned on source content

. It is the essence of many natural language processing problems, such as text summarization 

(mani1999advances), where is a long document and is a more concise version, machine translation (koehn2009statistical), where and represent equivalent text in different languages, and data-to-text generation (kukich1983design; mckeown1992text), where is a structured table and is a textual description.

While traditionally done with template-based approaches (becker2002practical; foster2004techniques; gatt2009simplenlg; reiter2005choosing), recently neural encoder-decoder approaches (sutskever2014sequence; cho2014learning; bahdanau2014neural) have become a popular approach. In this formulation, the source content is encoded with a neural architecture, and the decoder autoregressively produces a token at each output position based on its internal state and the source representation. By leveraging continuous representations with rich non-linearities, encoder-decoder approaches can generate highly fluent text (rush2015neural; radford2019language) without the need for cumbersome handcrafted rules and templates.

However, encoder-decoder architectures are inherently difficult to control, and have been shown to be prone to hallucination, i.e., generating text that is fluent but unfaithful to the source (vinyals2015neural; koehn2017six; wiseman2017challenges; lee2018hallucinations). This severe shortcoming can often limit the use of neural approaches in many real world systems, where it is not acceptable to produce output that is even occasionally unfaithful.

In this work, we focus on data-to-text generation since the structured form of source content makes it relatively easy to evaluate faithfulness using both human evaluation and domain-specific automatic metrics (dhingra2019handling). In particular, we focus on the WikiBio (lebret2016neural) dataset, where the task is to generate a sentence summarizing a tabular biography of a person. Figure 1 shows an example.

First note that the reference contains information such as bonanno crime family and informant that are true, but cannot be inferred from the source. This source-reference divergence exists in many large-scale generation datasets (wiseman2017challenges; dhingra2019handling). Secondly, most generation systems are agnostic to this divergence and trained to maximize the log-likelihood of reference. This can often encourage the models to output phrases that are unsupported by the source. For example, Figure 1 shows the output of a state-of-the-art generation baseline, the Pointer-Generator network (see2017get), which contains the phrase criminal defense attorney that is false (but loosely related to FBI in the table). Thus, hallucination can often result from the coupling of model shortcomings (e.g. lack of formal reasoning, learning false correlations), and noise/artifacts in the training data.

Figure 1: Example in the WikiBio dataset (lebret2016neural) showing the biography of Frank Lino. The baseline Pointer-Generator (see2017get) exhibits hallucination.

In this work, we propose a confidence oriented approach which assigns a learned confidence score to each decoder position, and then uses the score in two ways to reduce hallucination: (1)

In test, it uses confidence to adjust the output probabilities by a calibration technique 

(braverman2019calibration). (2) In training, we employ a variational Bayes objective to jointly learn the confidence score while allowing the model to skip tokens with a low confidence score to avoid training on reference phrases that are difficult to infer from the source. In Figure 1, our approach leads to a faithful generation that omits the occupation.

Empirically, when evaluated on the WikiBio dataset (lebret2016neural), we show that our approach is considerably more faithful to the source than existing state-of-the-art solutions, according to both PARENT precision (dhingra2019handling) and human evaluation.

2 Related Work

Improving the fidelity and accuracy of text generation systems is an important research topic that has spawned a variety of different approaches. Some focus on blending extractive and abstractive approaches, e.g., allowing the model to copy tokens directly from the source (gu2016incorporating; see2017get), separating content selection from generation (zhou-etal-2017-selective; gehrmann2018bottom) and utilizing topic information from the source to make informed generation (narayan-etal-2018-dont).

Other approaches have proposed generating more accurate text using semiparametric approaches (guu2018generating; pandey2018exemplar; peng2019text)

, reinforcement learning-based rewards 

(paulus2018deep; pasunuru2018multi)

, semi-Markov models to learn neural templates 

(wiseman2018learning), content planning (puduppully2019data), and constrained vocabulary decoding (wu-aaai18). While many leverage the structure of the source (liu2017table) or task-based insights (puduppully2019entity)

, our approach is complementary in that it uses general machine learning techniques to build a confidence oriented decoder, that is more faithful to the source and robust to divergence/noise in the training data. Furthermore, many previous works rely on automatic metrics such as BLEU, which can be poorly correlated with human judgment of faithfulness 

(wiseman2017challenges; dhingra2019handling). In contrast, we evaluate on PARENT precision (dhingra2019handling), a metric specifically designed to capture faithfulness in data-to-text generation, and conduct a rigorous human evaluation to assess hallucination in our models.

3 Preliminaries

Before describing our approach, we first review the existing encoder-decoder framework (sutskever2014sequence; bahdanau2014neural) with one stop-gradient (SG) tweak. Let , be the source input of length and be the target sequence of length . Each token takes one value from a vocabulary . Our goal is to model the conditional distribution , where is the prefix of up to the

token. The source can be encoded by any neural network function


, such as a convolutional neural network 

(CNN, lecun1990handwritten)

, long-short-term memory 

(LSTM, hochreiter1997long), or Transformer (vaswani2017attention). Let .

Define as the dimensional embedding of token . Then, the probability of each target token is computed as:




The first term on the right hand side in Eq. 2 represents a Luong-style attention (defined in Eq. 3, DBLP:conf/emnlp/LuongPM15) while the second term represents the hidden state at position that is modelled with an RNN111While it is possible our approach could extend to other types of decoders, our current formulation of the confidence score uses this specific form of attention. (defined in Eq. 4):


We have made one change to the conventional input-feeding approach as defined by bahdanau2014neural, and apply a stop-gradient (

) to the attention vector

above. In this work, we use as a control on the information flow during training, making our model match the intended design. The above prevents information at the current step from being propagated to previous attentions, which is intended by our attention score defined in Section 4.1.

4 Model

Our approach is based on the idea of a learned confidence score at every decoding position that is a balance of two factors:

  • How much the model should rely on the source for this position.

  • How much the model actually relies on the source for this position.

Intuitively, it is reasonable for a system to depend mostly on language modeling to predict function words that make the generation fluent, but it should consult more of the source data to predict content words. For example, given a partial generation “Christian Campbell is __”, one could predict that the next token is mostly likely “a”, “an” or “the”, based on language modeling. However, if a model predicts “American” as the next token to “Christian Campbell is an __”, it should be based on a field such as “Nationality: U.S.” in the source, rather than the language tendency that “American” is likely to appear after the phrase “is an”. A typical neural network can make predictions based on both reasons with little controllability; this, we contend, is a cause of hallucination.

In order to measure how much the model should rely on the source, we compare the encoder-decoder model with an unconditioned language model. The unconditioned language model does not have any access to the source data, so if it can predict a token as precisely as the encoder-decoder, that token is probably an element of a general linguistic construction that does not convey source information.

In order to measure how much the model actually relies on the source, we derive an attention score of the encoder-decoder from the attention mechanism. If a token is likely a content word (i.e. when its generation probability by the encoder-decoder is much higher than the unconditioned language model), but the attention score is low, then the token might not be predicted based on the source, and could be hallucination. Thus, we design the confidence score to be low in this case.

The confidence score is specified in Section 4.1. There are two ways we leverage the score:

  • At test time, we augment the generation probability with the confidence score, using a calibration technique (Section 4.2). It allows us to weigh more on the confidence of generation, without sacrificing perplexity (braverman2019calibration).

  • In training, we allow the model to skip some tokens with low confidence score in order to avoid training on noisy references or scarce examples (Section 4.3). However, since the confidence score itself needs to be learned during training, we use a variational Bayes objective to formalize this symbiotic cycle (DBLP:journals/corr/KingmaW13).

Figure 2: Example of learned attention score, base-LM probability, and confidence score. For content words the base-LM probability is lower, and the confidence score depends more on the attention score.

4.1 Confidence Score

We define the confidence score

as an interpolation between the encoder-decoder model

and the unconditioned language model :


where is the attention score measuring how much the encoder-decoder is paying attention to the source:


Here, is the Euclidean norm.

For function words, we expect both and to be high, so the confidence score defined in Eq. 5 will be high no matter what the attention score is. On the other hand, we expect to be higher than for content words, so the confidence score will largely depend on the attention score in this case.

We refer to the unconditioned language model in the definition of confidence score as the base-LM. In this work, we use an RNN for base-LM, but modify the input-feeding of the RNN as following:


Here, is the hidden-state of the base-LM, SG means stop-gradient, and the input embedding is weighted with a component of the previous confidence score. We found this weighting scheme to decrease dependence of the base-LM on content words, seemingly resulting in a model of soft templates (Figure 2).

Copy mechanism

In case the encoder-decoder is equipped with a copy mechanism, the generation probability is mixed with a probability of copying from the source (gu2016incorporating; see2017get):


where is the probability of doing generation instead of copying at step , and is an attention weight that the copy mechanism is paying to position in the source. The sum is taken over all positions where the word is the same as . When the copy mechanism is incorporated, we re-define the attention score as


and the confidence score is re-defined accordingly.

4.2 Calibration

In order to use the confidence score to promote faithful generation, we apply a calibration technique (braverman2019calibration) which augments the generation probability as follows:



is a one-parameter family of probability distributions called the calibration of

to . Note that stops gradients so that only the parameter is updated during training. In order to learn , we minimize the negative log-likelihood of jointly with and :


Since is a special case of (namely at ), the training perplexity of is at most . In practice, is initialized as and found converging to a positive value (Section. 5.4). Therefore, the calibration trick can improve confidence of generation without sacrificing perplexity.

4.3 Training with a Variational Bayes Objective

In practice, the training data of a conditional text generation task almost always contain noises and/or scarce examples. In order to reduce the impact of such outliers and train a confident generation model, we allow the model to skip some tokens in the training data when it feels unconfident. For this purpose, we use the confidence score to sample a sub-sequence of each training target, and minimize the negative log-likelihood on that confident sub-sequence. However, since the confidence score itself needs to be trained, we use a variational Bayes objective to formalize the problem.

Figure 3: Example of sampling a sub-sequence according to the confidence score. Our variational Bayes objective combines the sampling probability and the generation probability .

Specifically, for each target , we define as a latent sub-sequence of , which consists of confident tokens of length . Here, is an inclusion of indices. We assume that is generated by the calibrated probability :


Then, we connect to the probability of training target . From Bayes rule:


We assume that for every training example because the training data is uniquely given. Then, we regard as a sequential “keep/skip” labeling over , and define a probability distribution


to sample a sub-sequence according to the confidence score (Figure 3). Here, and are hyper-parameters. Now let


and the idea is to use as an approximation to the unknown posterior in Eq. 13. By taking of both sides in Eq. 13 and trivially introducing , we get


Then, we take the expectation of both sides and note that , so


The variational Bayes objective is to minimize the upper bound on the right hand side of Eq. 17. In practice, it is computationally expensive to explicitly calculate by enumerating all sub-sequences of , because the number of sub-sequences is exponential to the length . Thus, we apply a Monte Carlo method which calculates by sampling from . In order to back-propagate gradients through the expectation

as well, the loss function is given as follows



Here, is the number of samples taken, and is a hyper-parameter controlling how fast the gradients go through . However, since we define in Eq. 12 by the calibrated probability, which only learns one parameter , we add joint learning terms into Eq. 18 to make the final objective:


In order to sample sub-sequences from , we apply the same Gumbel-max trick as in DBLP:conf/icml/KoolHW19. Although the random sampling is purely based on a learned probability distribution, without any constraints to make it fluent, surprisingly the model still learns to generate fluent text.

5 Experiments

Although our approach could apply to many conditional text generation tasks, in this work we consider data-to-text generation, in which the source is some structured data and the target is natural language text describing the data. Usually, the data have concise semantics and simple structure, which makes it easy to check the facticity of the generation.

5.1 Dataset and Evaluation Metrics

The WikiBio dataset (lebret2016neural) contains 728,321 biographies paired with infoboxes, taken from the Sep-2015 dump of English Wikipedia, and splitted into train/valid/test sets in a ratio. The biography text is the first sentence of the Wikipedia page ( words on average). Infoboxes have non-empty fields on average.

For automatic metrics, we report BLEU (papineni2002bleu), as well as PARENT (dhingra2019handling), a metric designed to mitigate the shortcomings of BLEU on structured data-to-text generation.

For human evaluation, we obtain crowd-source annotations on examples randomly chosen from predictions on the WikiBio test set, the same 1000 for each model. We instruct raters to grade on each of 3 criteria: faithfulness, coverage, and fluency, as below:

  • Faithfulness (precision) - We define a sentence to be faithful if all the information in the proposed sentence is supported by the table or the reference. A single hallucinated piece of information makes the sentence non-faithful.

  • Coverage (recall) - The number of table cells that contain information present in the sentence.

  • Fluency - A sentence is defined to be fluent if it is clear, natural, and grammatically correct. Raters choose among three options: Fluent, Mostly Fluent, Not Fluent.

An ideal system would always produce fluent and faithful text with high coverage.

5.2 Experiment Setting

Model Warmup Steps Learning Rate RNN Dropout RNN Dim. Beam Size
Confident BERT-to-RNN 40000 0.05 0.1 768 8 0.75 1/8 4 1/64
Confident Pointer-Generator None 0.0005 0.2 200 8 0.5 1/16 4 1/4
Table 1: Hyper-parameters. We did a hyper-parameter search for , and , within the range , and , respectively. Model selection is based on PARENT F1.

We compare the following systems:

  • BERT-to-BERT (rothe2019leveraging): A Transformer encoder-decoder model (vaswani2017attention) where the encoder and decoder are both initialized with BERT (devlin2018bert).

  • Structure Aware Seq2Seq (liu2017table): A state-of-the-art method on the WikiBio dataset in terms of BLEU.

  • Pointer-Generator (see2017get): A Seq2Seq with attention and copy mechanism (our implementation).

  • Confident BERT-to-RNN (This Work): A Transformer encoder initialized with BERT checkpoint, and a GRU (cho2014learning) decoder with our confident decoding method.

  • Confident Pointer-Generator (This Work): Pointer-Generator model with confident decoding.

We built our systems using Tensorflow 

(abadi2016tensorflow). Infoboxes are linearized into sequences, with field names and values separated by special tokens. For BERT-to-BERT and Confident BERT-to-RNN, we pre-trained a BERT checkpoint on the Books corpus (zhu2015aligning) only, since the original BERT was trained on Wikipedia that overlaps with the test targets in WikiBio. For Pointer-Generator and Confident Pointer-Generator, we use GloVe (pennington2014glove) as the input word embedding, and the two models share the same hyper-parameter settings. For our confident decoding models, there are additional hyper-parameters as defined in Eq. 14, and as defined in Eq. 18. The hyper-parameters are given in Table 1. The optimizer is Adam (DBLP:journals/corr/KingmaB14).

5.3 Results

Table 2 shows the results. According to human evaluation, our approach gives a clear improvement in faithfulness over the baselines, with some drop in coverage. To further measure the validity of our confidence score, we postprocessed the output to remove words with lower confidence than . This thresholding technique gives further gains to faithfulness, while sacrificing some fluency.

Among the automatic metrics, PARENT precision and recall seem correlated to faithfulness and coverage respectively, and our approach achieves the highest precision score. BLEU, perhaps because of its length penalty that rewards longer generations, seems more correlated to coverage rather than faithfulness. Regarding the baselines, we see that BERT-to-BERT is the most fluent while Pointer Generator is the most faithful, suggesting that pretraining might help fluency while the copy mechanism can be valuable for faithfulness.

Automatic Metrics Human evaluation
Model BLEU PARENT (Prec. / Rec. / F1) Avg Len. Faithful % Avg Cov. Fluency %
BERT-to-BERT (rothe2019leveraging) 44.83 77.62  /  43.00  /  53.13 20.9 77.6 4.33 98.5 / 99.4
Structure-Aware Seq2Seq (liu2017table) 45.36 73.98  /  44.02  /  52.81 23.1 66.1 4.47 88.6 / 99.7
Pointer-Generator (see2017get) 41.07 77.59  /  42.12  /  52.10 19.1 80.3 4.24 93.1 / 96.0
Confident BERT-to-RNN (This Work) 33.30 77.98  /  37.21  /  47.90 16.6 85.2* 3.90 92.3 / 94.1
Confident Pointer-Generator (This Work) 38.10 79.52  /  40.60  /  51.38 17.0 86.8* 4.05 95.4 / 96.3
                               +threshold=0.125 36.62 80.15  /  39.59  /  50.50 16.4 90.7* 4.01 91.6 / 92.2
Table 2: Performance on WikiBio test set. Two Fluency measures differ in whether to include sentences graded as Mostly Fluent. Starred numbers are statistically significant against baselines (), by bootstrap test.

5.4 Ablations

Our confident decoding method has three novel components: (1) The use of a base-LM to define confidence score; (2) The calibration technique to adjust output probability; and (3) The variational Bayes objective to train a confident model. In this section, we assess the effects of each component by an ablative study. We start from the Confident Pointer-Generator, and in each test replace one component by a trivial alternative: (1) In order to assess the effects of using the base-LM in the confidence score, we instead use directly as confidence, and train models with the same hyper-parameter search. The results on WikiBio are shown in Table 3 as “No base-LM”. (2) We use instead of at test time (‘No calibration”), to assess the effects of calibration. The model is the same as Confident Pointer-Generator. The learned was . (3) Instead of the variational Bayes objective, we use the joint training loss in Eq. 11 without sampling sub-sequences from training targets (No variational).

As we can see from Table 3, all three components improve PARENT precision. While the improvement by calibration is the smallest, the technique also improves PARENT recall and BLEU score at the same time, making it an easy choice. The other techniques trade recall for precision, making them useful for tasks that require a high degree of faithfulness. When all three components are disabled, the model is exactly the same as our implementation of the Pointer-Generator. Every component improves PARENT precision upon it as well. Especially, comparing Pointer-Generator with “No variational” shows again that joint training with calibration improves all metrics.

We also note that the average lengths of generations by our confident decoding models are shorter. While there exists heuristics such as length penalty 

(wu2016google) to encourage longer generation at inference time, shorter generation is not trivial. In the “Truncated“ setting, we truncate predictions by the Pointer-Generator two words each to match the average length of Confident Pointer-Generator. The PARENT precision by our confident decoding method is not trivially achieved by truncation.

BLEU PARENT (Prec. / Rec. / F1) Avg Len.
Confident Pointer-Generator (This Work) 38.10 79.52  /  40.60  /  51.38 17.0
                                   No base-LM 39.39 78.77  /  41.55  /  52.08 17.9
                                   No calibration 37.89 79.47  /  40.47  /  51.26 16.9
                                   No variational 41.29 78.25  /  42.40  /  52.52 18.9
Pointer-Generator 41.07 77.59  /  42.12  /  52.10 19.1
             Truncated 35.50 77.68  /  38.16  /  48.66 17.1
Table 3: Ablative tests on three components of our confident decoding method, and a truncation test.

6 Conclusion

In this work, we proposed a confidence oriented decoder that achieved more faithful generation on the WikiBio dataset than existing state-of-the-art approaches. Our method is general in principle, so it could potentially be adapted to other forms of conditional text generation such as document summarization and machine translation. Another source of future work could be enhancing the confidence score to move beyond shallow alignment and do more complex logical inference (steedman2011combinatory; kamath2018survey).