Conditional text generation is the task of generating some target text conditioned on source contentmani1999advances), where is a long document and is a more concise version, machine translation (koehn2009statistical), where and represent equivalent text in different languages, and data-to-text generation (kukich1983design; mckeown1992text), where is a structured table and is a textual description.
While traditionally done with template-based approaches (becker2002practical; foster2004techniques; gatt2009simplenlg; reiter2005choosing), recently neural encoder-decoder approaches (sutskever2014sequence; cho2014learning; bahdanau2014neural) have become a popular approach. In this formulation, the source content is encoded with a neural architecture, and the decoder autoregressively produces a token at each output position based on its internal state and the source representation. By leveraging continuous representations with rich non-linearities, encoder-decoder approaches can generate highly fluent text (rush2015neural; radford2019language) without the need for cumbersome handcrafted rules and templates.
However, encoder-decoder architectures are inherently difficult to control, and have been shown to be prone to hallucination, i.e., generating text that is fluent but unfaithful to the source (vinyals2015neural; koehn2017six; wiseman2017challenges; lee2018hallucinations). This severe shortcoming can often limit the use of neural approaches in many real world systems, where it is not acceptable to produce output that is even occasionally unfaithful.
In this work, we focus on data-to-text generation since the structured form of source content makes it relatively easy to evaluate faithfulness using both human evaluation and domain-specific automatic metrics (dhingra2019handling). In particular, we focus on the WikiBio (lebret2016neural) dataset, where the task is to generate a sentence summarizing a tabular biography of a person. Figure 1 shows an example.
First note that the reference contains information such as bonanno crime family and informant that are true, but cannot be inferred from the source. This source-reference divergence exists in many large-scale generation datasets (wiseman2017challenges; dhingra2019handling). Secondly, most generation systems are agnostic to this divergence and trained to maximize the log-likelihood of reference. This can often encourage the models to output phrases that are unsupported by the source. For example, Figure 1 shows the output of a state-of-the-art generation baseline, the Pointer-Generator network (see2017get), which contains the phrase criminal defense attorney that is false (but loosely related to FBI in the table). Thus, hallucination can often result from the coupling of model shortcomings (e.g. lack of formal reasoning, learning false correlations), and noise/artifacts in the training data.
In this work, we propose a confidence oriented approach which assigns a learned confidence score to each decoder position, and then uses the score in two ways to reduce hallucination: (1)
In test, it uses confidence to adjust the output probabilities by a calibration technique(braverman2019calibration). (2) In training, we employ a variational Bayes objective to jointly learn the confidence score while allowing the model to skip tokens with a low confidence score to avoid training on reference phrases that are difficult to infer from the source. In Figure 1, our approach leads to a faithful generation that omits the occupation.
Empirically, when evaluated on the WikiBio dataset (lebret2016neural), we show that our approach is considerably more faithful to the source than existing state-of-the-art solutions, according to both PARENT precision (dhingra2019handling) and human evaluation.
2 Related Work
Improving the fidelity and accuracy of text generation systems is an important research topic that has spawned a variety of different approaches. Some focus on blending extractive and abstractive approaches, e.g., allowing the model to copy tokens directly from the source (gu2016incorporating; see2017get), separating content selection from generation (zhou-etal-2017-selective; gehrmann2018bottom) and utilizing topic information from the source to make informed generation (narayan-etal-2018-dont).
Other approaches have proposed generating more accurate text using semiparametric approaches (guu2018generating; pandey2018exemplar; peng2019text)
, reinforcement learning-based rewards(paulus2018deep; pasunuru2018multi)
, semi-Markov models to learn neural templates(wiseman2018learning), content planning (puduppully2019data), and constrained vocabulary decoding (wu-aaai18). While many leverage the structure of the source (liu2017table) or task-based insights (puduppully2019entity)
, our approach is complementary in that it uses general machine learning techniques to build a confidence oriented decoder, that is more faithful to the source and robust to divergence/noise in the training data. Furthermore, many previous works rely on automatic metrics such as BLEU, which can be poorly correlated with human judgment of faithfulness(wiseman2017challenges; dhingra2019handling). In contrast, we evaluate on PARENT precision (dhingra2019handling), a metric specifically designed to capture faithfulness in data-to-text generation, and conduct a rigorous human evaluation to assess hallucination in our models.
Before describing our approach, we first review the existing encoder-decoder framework (sutskever2014sequence; bahdanau2014neural) with one stop-gradient (SG) tweak. Let , be the source input of length and be the target sequence of length . Each token takes one value from a vocabulary . Our goal is to model the conditional distribution , where is the prefix of up to the
token. The source can be encoded by any neural network functionenc
, such as a convolutional neural network(CNN, lecun1990handwritten)hochreiter1997long), or Transformer (vaswani2017attention). Let .
Define as the dimensional embedding of token . Then, the probability of each target token is computed as:
The first term on the right hand side in Eq. 2 represents a Luong-style attention (defined in Eq. 3, DBLP:conf/emnlp/LuongPM15) while the second term represents the hidden state at position that is modelled with an RNN111While it is possible our approach could extend to other types of decoders, our current formulation of the confidence score uses this specific form of attention. (defined in Eq. 4):
We have made one change to the conventional input-feeding approach as defined by bahdanau2014neural, and apply a stop-gradient (
) to the attention vectorabove. In this work, we use as a control on the information flow during training, making our model match the intended design. The above prevents information at the current step from being propagated to previous attentions, which is intended by our attention score defined in Section 4.1.
Our approach is based on the idea of a learned confidence score at every decoding position that is a balance of two factors:
How much the model should rely on the source for this position.
How much the model actually relies on the source for this position.
Intuitively, it is reasonable for a system to depend mostly on language modeling to predict function words that make the generation fluent, but it should consult more of the source data to predict content words. For example, given a partial generation “Christian Campbell is __”, one could predict that the next token is mostly likely “a”, “an” or “the”, based on language modeling. However, if a model predicts “American” as the next token to “Christian Campbell is an __”, it should be based on a field such as “Nationality: U.S.” in the source, rather than the language tendency that “American” is likely to appear after the phrase “is an”. A typical neural network can make predictions based on both reasons with little controllability; this, we contend, is a cause of hallucination.
In order to measure how much the model should rely on the source, we compare the encoder-decoder model with an unconditioned language model. The unconditioned language model does not have any access to the source data, so if it can predict a token as precisely as the encoder-decoder, that token is probably an element of a general linguistic construction that does not convey source information.
In order to measure how much the model actually relies on the source, we derive an attention score of the encoder-decoder from the attention mechanism. If a token is likely a content word (i.e. when its generation probability by the encoder-decoder is much higher than the unconditioned language model), but the attention score is low, then the token might not be predicted based on the source, and could be hallucination. Thus, we design the confidence score to be low in this case.
The confidence score is specified in Section 4.1. There are two ways we leverage the score:
At test time, we augment the generation probability with the confidence score, using a calibration technique (Section 4.2). It allows us to weigh more on the confidence of generation, without sacrificing perplexity (braverman2019calibration).
In training, we allow the model to skip some tokens with low confidence score in order to avoid training on noisy references or scarce examples (Section 4.3). However, since the confidence score itself needs to be learned during training, we use a variational Bayes objective to formalize this symbiotic cycle (DBLP:journals/corr/KingmaW13).
4.1 Confidence Score
We define the confidence score
as an interpolation between the encoder-decoder modeland the unconditioned language model :
where is the attention score measuring how much the encoder-decoder is paying attention to the source:
Here, is the Euclidean norm.
For function words, we expect both and to be high, so the confidence score defined in Eq. 5 will be high no matter what the attention score is. On the other hand, we expect to be higher than for content words, so the confidence score will largely depend on the attention score in this case.
We refer to the unconditioned language model in the definition of confidence score as the base-LM. In this work, we use an RNN for base-LM, but modify the input-feeding of the RNN as following:
Here, is the hidden-state of the base-LM, SG means stop-gradient, and the input embedding is weighted with a component of the previous confidence score. We found this weighting scheme to decrease dependence of the base-LM on content words, seemingly resulting in a model of soft templates (Figure 2).
In case the encoder-decoder is equipped with a copy mechanism, the generation probability is mixed with a probability of copying from the source (gu2016incorporating; see2017get):
where is the probability of doing generation instead of copying at step , and is an attention weight that the copy mechanism is paying to position in the source. The sum is taken over all positions where the word is the same as . When the copy mechanism is incorporated, we re-define the attention score as
and the confidence score is re-defined accordingly.
In order to use the confidence score to promote faithful generation, we apply a calibration technique (braverman2019calibration) which augments the generation probability as follows:
is a one-parameter family of probability distributions called the calibration ofto . Note that stops gradients so that only the parameter is updated during training. In order to learn , we minimize the negative log-likelihood of jointly with and :
Since is a special case of (namely at ), the training perplexity of is at most . In practice, is initialized as and found converging to a positive value (Section. 5.4). Therefore, the calibration trick can improve confidence of generation without sacrificing perplexity.
4.3 Training with a Variational Bayes Objective
In practice, the training data of a conditional text generation task almost always contain noises and/or scarce examples. In order to reduce the impact of such outliers and train a confident generation model, we allow the model to skip some tokens in the training data when it feels unconfident. For this purpose, we use the confidence score to sample a sub-sequence of each training target, and minimize the negative log-likelihood on that confident sub-sequence. However, since the confidence score itself needs to be trained, we use a variational Bayes objective to formalize the problem.
Specifically, for each target , we define as a latent sub-sequence of , which consists of confident tokens of length . Here, is an inclusion of indices. We assume that is generated by the calibrated probability :
Then, we connect to the probability of training target . From Bayes rule:
We assume that for every training example because the training data is uniquely given. Then, we regard as a sequential “keep/skip” labeling over , and define a probability distribution
to sample a sub-sequence according to the confidence score (Figure 3). Here, and are hyper-parameters. Now let
Then, we take the expectation of both sides and note that , so
The variational Bayes objective is to minimize the upper bound on the right hand side of Eq. 17. In practice, it is computationally expensive to explicitly calculate by enumerating all sub-sequences of , because the number of sub-sequences is exponential to the length . Thus, we apply a Monte Carlo method which calculates by sampling from . In order to back-propagate gradients through the expectation
as well, the loss function is given as follows(DBLP:conf/icml/PaisleyBJ12):
Here, is the number of samples taken, and is a hyper-parameter controlling how fast the gradients go through . However, since we define in Eq. 12 by the calibrated probability, which only learns one parameter , we add joint learning terms into Eq. 18 to make the final objective:
In order to sample sub-sequences from , we apply the same Gumbel-max trick as in DBLP:conf/icml/KoolHW19. Although the random sampling is purely based on a learned probability distribution, without any constraints to make it fluent, surprisingly the model still learns to generate fluent text.
Although our approach could apply to many conditional text generation tasks, in this work we consider data-to-text generation, in which the source is some structured data and the target is natural language text describing the data. Usually, the data have concise semantics and simple structure, which makes it easy to check the facticity of the generation.
5.1 Dataset and Evaluation Metrics
The WikiBio dataset (lebret2016neural) contains 728,321 biographies paired with infoboxes, taken from the Sep-2015 dump of English Wikipedia, and splitted into train/valid/test sets in a ratio. The biography text is the first sentence of the Wikipedia page ( words on average). Infoboxes have non-empty fields on average.
For automatic metrics, we report BLEU (papineni2002bleu), as well as PARENT (dhingra2019handling), a metric designed to mitigate the shortcomings of BLEU on structured data-to-text generation.
For human evaluation, we obtain crowd-source annotations on examples randomly chosen from predictions on the WikiBio test set, the same 1000 for each model. We instruct raters to grade on each of 3 criteria: faithfulness, coverage, and fluency, as below:
Faithfulness (precision) - We define a sentence to be faithful if all the information in the proposed sentence is supported by the table or the reference. A single hallucinated piece of information makes the sentence non-faithful.
Coverage (recall) - The number of table cells that contain information present in the sentence.
Fluency - A sentence is defined to be fluent if it is clear, natural, and grammatically correct. Raters choose among three options: Fluent, Mostly Fluent, Not Fluent.
An ideal system would always produce fluent and faithful text with high coverage.
5.2 Experiment Setting
|Model||Warmup Steps||Learning Rate||RNN Dropout||RNN Dim.||Beam Size|
We compare the following systems:
BERT-to-BERT (rothe2019leveraging): A Transformer encoder-decoder model (vaswani2017attention) where the encoder and decoder are both initialized with BERT (devlin2018bert).
Structure Aware Seq2Seq (liu2017table): A state-of-the-art method on the WikiBio dataset in terms of BLEU.
Pointer-Generator (see2017get): A Seq2Seq with attention and copy mechanism (our implementation).
Confident BERT-to-RNN (This Work): A Transformer encoder initialized with BERT checkpoint, and a GRU (cho2014learning) decoder with our confident decoding method.
Confident Pointer-Generator (This Work): Pointer-Generator model with confident decoding.
We built our systems using Tensorflow(abadi2016tensorflow). Infoboxes are linearized into sequences, with field names and values separated by special tokens. For BERT-to-BERT and Confident BERT-to-RNN, we pre-trained a BERT checkpoint on the Books corpus (zhu2015aligning) only, since the original BERT was trained on Wikipedia that overlaps with the test targets in WikiBio. For Pointer-Generator and Confident Pointer-Generator, we use GloVe (pennington2014glove) as the input word embedding, and the two models share the same hyper-parameter settings. For our confident decoding models, there are additional hyper-parameters as defined in Eq. 14, and as defined in Eq. 18. The hyper-parameters are given in Table 1. The optimizer is Adam (DBLP:journals/corr/KingmaB14).
Table 2 shows the results. According to human evaluation, our approach gives a clear improvement in faithfulness over the baselines, with some drop in coverage. To further measure the validity of our confidence score, we postprocessed the output to remove words with lower confidence than . This thresholding technique gives further gains to faithfulness, while sacrificing some fluency.
Among the automatic metrics, PARENT precision and recall seem correlated to faithfulness and coverage respectively, and our approach achieves the highest precision score. BLEU, perhaps because of its length penalty that rewards longer generations, seems more correlated to coverage rather than faithfulness. Regarding the baselines, we see that BERT-to-BERT is the most fluent while Pointer Generator is the most faithful, suggesting that pretraining might help fluency while the copy mechanism can be valuable for faithfulness.
|Automatic Metrics||Human evaluation|
|Model||BLEU||PARENT (Prec. / Rec. / F1)||Avg Len.||Faithful %||Avg Cov.||Fluency %|
|BERT-to-BERT (rothe2019leveraging)||44.83||77.62 / 43.00 / 53.13||20.9||77.6||4.33||98.5 / 99.4|
|Structure-Aware Seq2Seq (liu2017table)||45.36||73.98 / 44.02 / 52.81||23.1||66.1||4.47||88.6 / 99.7|
|Pointer-Generator (see2017get)||41.07||77.59 / 42.12 / 52.10||19.1||80.3||4.24||93.1 / 96.0|
|Confident BERT-to-RNN (This Work)||33.30||77.98 / 37.21 / 47.90||16.6||85.2*||3.90||92.3 / 94.1|
|Confident Pointer-Generator (This Work)||38.10||79.52 / 40.60 / 51.38||17.0||86.8*||4.05||95.4 / 96.3|
|+threshold=0.125||36.62||80.15 / 39.59 / 50.50||16.4||90.7*||4.01||91.6 / 92.2|
Our confident decoding method has three novel components: (1) The use of a base-LM to define confidence score; (2) The calibration technique to adjust output probability; and (3) The variational Bayes objective to train a confident model. In this section, we assess the effects of each component by an ablative study. We start from the Confident Pointer-Generator, and in each test replace one component by a trivial alternative: (1) In order to assess the effects of using the base-LM in the confidence score, we instead use directly as confidence, and train models with the same hyper-parameter search. The results on WikiBio are shown in Table 3 as “No base-LM”. (2) We use instead of at test time (‘No calibration”), to assess the effects of calibration. The model is the same as Confident Pointer-Generator. The learned was . (3) Instead of the variational Bayes objective, we use the joint training loss in Eq. 11 without sampling sub-sequences from training targets (No variational).
As we can see from Table 3, all three components improve PARENT precision. While the improvement by calibration is the smallest, the technique also improves PARENT recall and BLEU score at the same time, making it an easy choice. The other techniques trade recall for precision, making them useful for tasks that require a high degree of faithfulness. When all three components are disabled, the model is exactly the same as our implementation of the Pointer-Generator. Every component improves PARENT precision upon it as well. Especially, comparing Pointer-Generator with “No variational” shows again that joint training with calibration improves all metrics.
We also note that the average lengths of generations by our confident decoding models are shorter. While there exists heuristics such as length penalty(wu2016google) to encourage longer generation at inference time, shorter generation is not trivial. In the “Truncated“ setting, we truncate predictions by the Pointer-Generator two words each to match the average length of Confident Pointer-Generator. The PARENT precision by our confident decoding method is not trivially achieved by truncation.
|BLEU||PARENT (Prec. / Rec. / F1)||Avg Len.|
|Confident Pointer-Generator (This Work)||38.10||79.52 / 40.60 / 51.38||17.0|
|No base-LM||39.39||78.77 / 41.55 / 52.08||17.9|
|No calibration||37.89||79.47 / 40.47 / 51.26||16.9|
|No variational||41.29||78.25 / 42.40 / 52.52||18.9|
|Pointer-Generator||41.07||77.59 / 42.12 / 52.10||19.1|
|Truncated||35.50||77.68 / 38.16 / 48.66||17.1|
In this work, we proposed a confidence oriented decoder that achieved more faithful generation on the WikiBio dataset than existing state-of-the-art approaches. Our method is general in principle, so it could potentially be adapted to other forms of conditional text generation such as document summarization and machine translation. Another source of future work could be enhancing the confidence score to move beyond shallow alignment and do more complex logical inference (steedman2011combinatory; kamath2018survey).