1 Introduction
Conditional text generation is the task of generating some target text conditioned on source content
. It is the essence of many natural language processing problems, such as text summarization
(mani1999advances), where is a long document and is a more concise version, machine translation (koehn2009statistical), where and represent equivalent text in different languages, and datatotext generation (kukich1983design; mckeown1992text), where is a structured table and is a textual description.While traditionally done with templatebased approaches (becker2002practical; foster2004techniques; gatt2009simplenlg; reiter2005choosing), recently neural encoderdecoder approaches (sutskever2014sequence; cho2014learning; bahdanau2014neural) have become a popular approach. In this formulation, the source content is encoded with a neural architecture, and the decoder autoregressively produces a token at each output position based on its internal state and the source representation. By leveraging continuous representations with rich nonlinearities, encoderdecoder approaches can generate highly fluent text (rush2015neural; radford2019language) without the need for cumbersome handcrafted rules and templates.
However, encoderdecoder architectures are inherently difficult to control, and have been shown to be prone to hallucination, i.e., generating text that is fluent but unfaithful to the source (vinyals2015neural; koehn2017six; wiseman2017challenges; lee2018hallucinations). This severe shortcoming can often limit the use of neural approaches in many real world systems, where it is not acceptable to produce output that is even occasionally unfaithful.
In this work, we focus on datatotext generation since the structured form of source content makes it relatively easy to evaluate faithfulness using both human evaluation and domainspecific automatic metrics (dhingra2019handling). In particular, we focus on the WikiBio (lebret2016neural) dataset, where the task is to generate a sentence summarizing a tabular biography of a person. Figure 1 shows an example.
First note that the reference contains information such as bonanno crime family and informant that are true, but cannot be inferred from the source. This sourcereference divergence exists in many largescale generation datasets (wiseman2017challenges; dhingra2019handling). Secondly, most generation systems are agnostic to this divergence and trained to maximize the loglikelihood of reference. This can often encourage the models to output phrases that are unsupported by the source. For example, Figure 1 shows the output of a stateoftheart generation baseline, the PointerGenerator network (see2017get), which contains the phrase criminal defense attorney that is false (but loosely related to FBI in the table). Thus, hallucination can often result from the coupling of model shortcomings (e.g. lack of formal reasoning, learning false correlations), and noise/artifacts in the training data.
In this work, we propose a confidence oriented approach which assigns a learned confidence score to each decoder position, and then uses the score in two ways to reduce hallucination: (1)
In test, it uses confidence to adjust the output probabilities by a calibration technique
(braverman2019calibration). (2) In training, we employ a variational Bayes objective to jointly learn the confidence score while allowing the model to skip tokens with a low confidence score to avoid training on reference phrases that are difficult to infer from the source. In Figure 1, our approach leads to a faithful generation that omits the occupation.Empirically, when evaluated on the WikiBio dataset (lebret2016neural), we show that our approach is considerably more faithful to the source than existing stateoftheart solutions, according to both PARENT precision (dhingra2019handling) and human evaluation.
2 Related Work
Improving the fidelity and accuracy of text generation systems is an important research topic that has spawned a variety of different approaches. Some focus on blending extractive and abstractive approaches, e.g., allowing the model to copy tokens directly from the source (gu2016incorporating; see2017get), separating content selection from generation (zhouetal2017selective; gehrmann2018bottom) and utilizing topic information from the source to make informed generation (narayanetal2018dont).
Other approaches have proposed generating more accurate text using semiparametric approaches (guu2018generating; pandey2018exemplar; peng2019text)
, reinforcement learningbased rewards
(paulus2018deep; pasunuru2018multi), semiMarkov models to learn neural templates
(wiseman2018learning), content planning (puduppully2019data), and constrained vocabulary decoding (wuaaai18). While many leverage the structure of the source (liu2017table) or taskbased insights (puduppully2019entity), our approach is complementary in that it uses general machine learning techniques to build a confidence oriented decoder, that is more faithful to the source and robust to divergence/noise in the training data. Furthermore, many previous works rely on automatic metrics such as BLEU, which can be poorly correlated with human judgment of faithfulness
(wiseman2017challenges; dhingra2019handling). In contrast, we evaluate on PARENT precision (dhingra2019handling), a metric specifically designed to capture faithfulness in datatotext generation, and conduct a rigorous human evaluation to assess hallucination in our models.3 Preliminaries
Before describing our approach, we first review the existing encoderdecoder framework (sutskever2014sequence; bahdanau2014neural) with one stopgradient (SG) tweak. Let , be the source input of length and be the target sequence of length . Each token takes one value from a vocabulary . Our goal is to model the conditional distribution , where is the prefix of up to the
token. The source can be encoded by any neural network function
enc, such as a convolutional neural network
(CNN, lecun1990handwritten)(LSTM, hochreiter1997long), or Transformer (vaswani2017attention). Let .Define as the dimensional embedding of token . Then, the probability of each target token is computed as:
(1) 
where
(2) 
The first term on the right hand side in Eq. 2 represents a Luongstyle attention (defined in Eq. 3, DBLP:conf/emnlp/LuongPM15) while the second term represents the hidden state at position that is modelled with an RNN^{1}^{1}1While it is possible our approach could extend to other types of decoders, our current formulation of the confidence score uses this specific form of attention. (defined in Eq. 4):
(3) 
(4) 
We have made one change to the conventional inputfeeding approach as defined by bahdanau2014neural, and apply a stopgradient (
) to the attention vector
above. In this work, we use as a control on the information flow during training, making our model match the intended design. The above prevents information at the current step from being propagated to previous attentions, which is intended by our attention score defined in Section 4.1.4 Model
Our approach is based on the idea of a learned confidence score at every decoding position that is a balance of two factors:

How much the model should rely on the source for this position.

How much the model actually relies on the source for this position.
Intuitively, it is reasonable for a system to depend mostly on language modeling to predict function words that make the generation fluent, but it should consult more of the source data to predict content words. For example, given a partial generation “Christian Campbell is __”, one could predict that the next token is mostly likely “a”, “an” or “the”, based on language modeling. However, if a model predicts “American” as the next token to “Christian Campbell is an __”, it should be based on a field such as “Nationality: U.S.” in the source, rather than the language tendency that “American” is likely to appear after the phrase “is an”. A typical neural network can make predictions based on both reasons with little controllability; this, we contend, is a cause of hallucination.
In order to measure how much the model should rely on the source, we compare the encoderdecoder model with an unconditioned language model. The unconditioned language model does not have any access to the source data, so if it can predict a token as precisely as the encoderdecoder, that token is probably an element of a general linguistic construction that does not convey source information.
In order to measure how much the model actually relies on the source, we derive an attention score of the encoderdecoder from the attention mechanism. If a token is likely a content word (i.e. when its generation probability by the encoderdecoder is much higher than the unconditioned language model), but the attention score is low, then the token might not be predicted based on the source, and could be hallucination. Thus, we design the confidence score to be low in this case.
The confidence score is specified in Section 4.1. There are two ways we leverage the score:

At test time, we augment the generation probability with the confidence score, using a calibration technique (Section 4.2). It allows us to weigh more on the confidence of generation, without sacrificing perplexity (braverman2019calibration).

In training, we allow the model to skip some tokens with low confidence score in order to avoid training on noisy references or scarce examples (Section 4.3). However, since the confidence score itself needs to be learned during training, we use a variational Bayes objective to formalize this symbiotic cycle (DBLP:journals/corr/KingmaW13).
4.1 Confidence Score
We define the confidence score
as an interpolation between the encoderdecoder model
and the unconditioned language model :(5) 
where is the attention score measuring how much the encoderdecoder is paying attention to the source:
(6) 
Here, is the Euclidean norm.
For function words, we expect both and to be high, so the confidence score defined in Eq. 5 will be high no matter what the attention score is. On the other hand, we expect to be higher than for content words, so the confidence score will largely depend on the attention score in this case.
We refer to the unconditioned language model in the definition of confidence score as the baseLM. In this work, we use an RNN for baseLM, but modify the inputfeeding of the RNN as following:
(7) 
Here, is the hiddenstate of the baseLM, SG means stopgradient, and the input embedding is weighted with a component of the previous confidence score. We found this weighting scheme to decrease dependence of the baseLM on content words, seemingly resulting in a model of soft templates (Figure 2).
Copy mechanism
In case the encoderdecoder is equipped with a copy mechanism, the generation probability is mixed with a probability of copying from the source (gu2016incorporating; see2017get):
(8) 
where is the probability of doing generation instead of copying at step , and is an attention weight that the copy mechanism is paying to position in the source. The sum is taken over all positions where the word is the same as . When the copy mechanism is incorporated, we redefine the attention score as
(9) 
and the confidence score is redefined accordingly.
4.2 Calibration
In order to use the confidence score to promote faithful generation, we apply a calibration technique (braverman2019calibration) which augments the generation probability as follows:
(10) 
Here,
is a oneparameter family of probability distributions called the calibration of
to . Note that stops gradients so that only the parameter is updated during training. In order to learn , we minimize the negative loglikelihood of jointly with and :(11) 
Since is a special case of (namely at ), the training perplexity of is at most . In practice, is initialized as and found converging to a positive value (Section. 5.4). Therefore, the calibration trick can improve confidence of generation without sacrificing perplexity.
4.3 Training with a Variational Bayes Objective
In practice, the training data of a conditional text generation task almost always contain noises and/or scarce examples. In order to reduce the impact of such outliers and train a confident generation model, we allow the model to skip some tokens in the training data when it feels unconfident. For this purpose, we use the confidence score to sample a subsequence of each training target, and minimize the negative loglikelihood on that confident subsequence. However, since the confidence score itself needs to be trained, we use a variational Bayes objective to formalize the problem.
Specifically, for each target , we define as a latent subsequence of , which consists of confident tokens of length . Here, is an inclusion of indices. We assume that is generated by the calibrated probability :
(12) 
Then, we connect to the probability of training target . From Bayes rule:
(13) 
We assume that for every training example because the training data is uniquely given. Then, we regard as a sequential “keep/skip” labeling over , and define a probability distribution
(14) 
to sample a subsequence according to the confidence score (Figure 3). Here, and are hyperparameters. Now let
(15) 
and the idea is to use as an approximation to the unknown posterior in Eq. 13. By taking of both sides in Eq. 13 and trivially introducing , we get
(16) 
Then, we take the expectation of both sides and note that , so
(17) 
The variational Bayes objective is to minimize the upper bound on the right hand side of Eq. 17. In practice, it is computationally expensive to explicitly calculate by enumerating all subsequences of , because the number of subsequences is exponential to the length . Thus, we apply a Monte Carlo method which calculates by sampling from . In order to backpropagate gradients through the expectation
as well, the loss function is given as follows
(DBLP:conf/icml/PaisleyBJ12):(18) 
Here, is the number of samples taken, and is a hyperparameter controlling how fast the gradients go through . However, since we define in Eq. 12 by the calibrated probability, which only learns one parameter , we add joint learning terms into Eq. 18 to make the final objective:
(19) 
In order to sample subsequences from , we apply the same Gumbelmax trick as in DBLP:conf/icml/KoolHW19. Although the random sampling is purely based on a learned probability distribution, without any constraints to make it fluent, surprisingly the model still learns to generate fluent text.
5 Experiments
Although our approach could apply to many conditional text generation tasks, in this work we consider datatotext generation, in which the source is some structured data and the target is natural language text describing the data. Usually, the data have concise semantics and simple structure, which makes it easy to check the facticity of the generation.
5.1 Dataset and Evaluation Metrics
The WikiBio dataset (lebret2016neural) contains 728,321 biographies paired with infoboxes, taken from the Sep2015 dump of English Wikipedia, and splitted into train/valid/test sets in a ratio. The biography text is the first sentence of the Wikipedia page ( words on average). Infoboxes have nonempty fields on average.
For automatic metrics, we report BLEU (papineni2002bleu), as well as PARENT (dhingra2019handling), a metric designed to mitigate the shortcomings of BLEU on structured datatotext generation.
For human evaluation, we obtain crowdsource annotations on examples randomly chosen from predictions on the WikiBio test set, the same 1000 for each model. We instruct raters to grade on each of 3 criteria: faithfulness, coverage, and fluency, as below:

Faithfulness (precision)  We define a sentence to be faithful if all the information in the proposed sentence is supported by the table or the reference. A single hallucinated piece of information makes the sentence nonfaithful.

Coverage (recall)  The number of table cells that contain information present in the sentence.

Fluency  A sentence is defined to be fluent if it is clear, natural, and grammatically correct. Raters choose among three options: Fluent, Mostly Fluent, Not Fluent.
An ideal system would always produce fluent and faithful text with high coverage.
5.2 Experiment Setting
Model  Warmup Steps  Learning Rate  RNN Dropout  RNN Dim.  Beam Size  
Confident BERTtoRNN  40000  0.05  0.1  768  8  0.75  1/8  4  1/64 
Confident PointerGenerator  None  0.0005  0.2  200  8  0.5  1/16  4  1/4 
We compare the following systems:

BERTtoBERT (rothe2019leveraging): A Transformer encoderdecoder model (vaswani2017attention) where the encoder and decoder are both initialized with BERT (devlin2018bert).

Structure Aware Seq2Seq (liu2017table): A stateoftheart method on the WikiBio dataset in terms of BLEU.

PointerGenerator (see2017get): A Seq2Seq with attention and copy mechanism (our implementation).

Confident BERTtoRNN (This Work): A Transformer encoder initialized with BERT checkpoint, and a GRU (cho2014learning) decoder with our confident decoding method.

Confident PointerGenerator (This Work): PointerGenerator model with confident decoding.
We built our systems using Tensorflow
(abadi2016tensorflow). Infoboxes are linearized into sequences, with field names and values separated by special tokens. For BERTtoBERT and Confident BERTtoRNN, we pretrained a BERT checkpoint on the Books corpus (zhu2015aligning) only, since the original BERT was trained on Wikipedia that overlaps with the test targets in WikiBio. For PointerGenerator and Confident PointerGenerator, we use GloVe (pennington2014glove) as the input word embedding, and the two models share the same hyperparameter settings. For our confident decoding models, there are additional hyperparameters as defined in Eq. 14, and as defined in Eq. 18. The hyperparameters are given in Table 1. The optimizer is Adam (DBLP:journals/corr/KingmaB14).5.3 Results
Table 2 shows the results. According to human evaluation, our approach gives a clear improvement in faithfulness over the baselines, with some drop in coverage. To further measure the validity of our confidence score, we postprocessed the output to remove words with lower confidence than . This thresholding technique gives further gains to faithfulness, while sacrificing some fluency.
Among the automatic metrics, PARENT precision and recall seem correlated to faithfulness and coverage respectively, and our approach achieves the highest precision score. BLEU, perhaps because of its length penalty that rewards longer generations, seems more correlated to coverage rather than faithfulness. Regarding the baselines, we see that BERTtoBERT is the most fluent while Pointer Generator is the most faithful, suggesting that pretraining might help fluency while the copy mechanism can be valuable for faithfulness.
Automatic Metrics  Human evaluation  
Model  BLEU  PARENT (Prec. / Rec. / F1)  Avg Len.  Faithful %  Avg Cov.  Fluency % 
BERTtoBERT (rothe2019leveraging)  44.83  77.62 / 43.00 / 53.13  20.9  77.6  4.33  98.5 / 99.4 
StructureAware Seq2Seq (liu2017table)  45.36  73.98 / 44.02 / 52.81  23.1  66.1  4.47  88.6 / 99.7 
PointerGenerator (see2017get)  41.07  77.59 / 42.12 / 52.10  19.1  80.3  4.24  93.1 / 96.0 
Confident BERTtoRNN (This Work)  33.30  77.98 / 37.21 / 47.90  16.6  85.2*  3.90  92.3 / 94.1 
Confident PointerGenerator (This Work)  38.10  79.52 / 40.60 / 51.38  17.0  86.8*  4.05  95.4 / 96.3 
+threshold=0.125  36.62  80.15 / 39.59 / 50.50  16.4  90.7*  4.01  91.6 / 92.2 
5.4 Ablations
Our confident decoding method has three novel components: (1) The use of a baseLM to define confidence score; (2) The calibration technique to adjust output probability; and (3) The variational Bayes objective to train a confident model. In this section, we assess the effects of each component by an ablative study. We start from the Confident PointerGenerator, and in each test replace one component by a trivial alternative: (1) In order to assess the effects of using the baseLM in the confidence score, we instead use directly as confidence, and train models with the same hyperparameter search. The results on WikiBio are shown in Table 3 as “No baseLM”. (2) We use instead of at test time (‘No calibration”), to assess the effects of calibration. The model is the same as Confident PointerGenerator. The learned was . (3) Instead of the variational Bayes objective, we use the joint training loss in Eq. 11 without sampling subsequences from training targets (No variational).
As we can see from Table 3, all three components improve PARENT precision. While the improvement by calibration is the smallest, the technique also improves PARENT recall and BLEU score at the same time, making it an easy choice. The other techniques trade recall for precision, making them useful for tasks that require a high degree of faithfulness. When all three components are disabled, the model is exactly the same as our implementation of the PointerGenerator. Every component improves PARENT precision upon it as well. Especially, comparing PointerGenerator with “No variational” shows again that joint training with calibration improves all metrics.
We also note that the average lengths of generations by our confident decoding models are shorter. While there exists heuristics such as length penalty
(wu2016google) to encourage longer generation at inference time, shorter generation is not trivial. In the “Truncated“ setting, we truncate predictions by the PointerGenerator two words each to match the average length of Confident PointerGenerator. The PARENT precision by our confident decoding method is not trivially achieved by truncation.BLEU  PARENT (Prec. / Rec. / F1)  Avg Len.  
Confident PointerGenerator (This Work)  38.10  79.52 / 40.60 / 51.38  17.0 
No baseLM  39.39  78.77 / 41.55 / 52.08  17.9 
No calibration  37.89  79.47 / 40.47 / 51.26  16.9 
No variational  41.29  78.25 / 42.40 / 52.52  18.9 
PointerGenerator  41.07  77.59 / 42.12 / 52.10  19.1 
Truncated  35.50  77.68 / 38.16 / 48.66  17.1 
6 Conclusion
In this work, we proposed a confidence oriented decoder that achieved more faithful generation on the WikiBio dataset than existing stateoftheart approaches. Our method is general in principle, so it could potentially be adapted to other forms of conditional text generation such as document summarization and machine translation. Another source of future work could be enhancing the confidence score to move beyond shallow alignment and do more complex logical inference (steedman2011combinatory; kamath2018survey).