Controllable Models of Grounded Response Generation
We formalize the problem as follows: given dialogue context , lexical control phrases and sentences of grounding , generate a response that contains semantic information guided by . Control phrases can be either directly provided by a user or automatically derived from a content planner. To differentiate derived control phrases from gold or user-provided , we denote these as . This new framework, called Controllable Grounded Response Generation (), assumes we have grounded conversational dataset, such as in 
. We assume that each data instance consists of a dialogue context, grounding knowledge and a reference response. To analyze this framework, we define a control mechanism that defines one or more control phrases for each instance. For more focus on grounding, our user controls are lexical phrases that are relevant to both target response and some part of grounding knowledge. Since it is costly and unscalable to have humans annotate the control phrases, we use lexical matching, defining control phrases to be informative n-grams that appear in both grounding and the reference response. Details of our dataset and its processing are presented in SectionDocument.
Extensions to GPT-2
GPT-2 is a transformer-based language model trained on large scale web data Radford2019gpt2 and uses self-attention where each token attends to its left tokens. It is trained with the objective: predict the next word, given all of the previous words within a defined context window. To apply GPT-2 within , we concatenate , (or ) and to be our input sequence, as shown in Figure Document (left). Then we have the model predict the next response word given the concatenated input sequence (denoted as ) and the previous response tokens in . is the subset of that is relevant to . For example, in this work, we denote the grounding sentences that contain any phrase in as . To differentiate the input elements, we insert an end-of-text token at the end of each dialogue utterance in , a token at the end of each control phrase in and a token at the end of each sentence in . We first concatenate the input sequence and the response sequence into a long text. We denote the source sequence as , which is used to generate target sentence
. The conditional probability ofcan be written as the product of conditional probabilities: p(R—S) = ∏_k=1^m+1 p(r_k—w_1, ⋯, w_n, r_1, ⋯, r_k-1) where is the additional end-of-text token indicative of the end of generation.
GPT-2 with Inductive Attention
GPT-2 by default takes a consecutive text sequence as its input in order to train a language model. In our problem setting, we have each input element of , , in a segmented format. Simply concatenating all these input elements into a GPT-2 model can induce noise, as some segments may not necessarily be strongly relevant, and we consider attention links between such segments to be uninformative. We remove potentially uninformative attention links for each data example by injecting pre-established structural information between and . For example, in Figure Document (right), say that consists of , and consists of and . If we know is only found in , then we only want to keep the attention link between and , and not between and any of the other grounded sentences. Since we think is a set of segmented sentences from , we remove all cross-sentence links within tokens. Similarly, we remove all links between non-identical phrases. Thus, the attention links for each data example are pre-determined by structural information between and . To implement this, in each transformer layer, we apply attention masks where the removed attention links and links to future tokens have value 0 and the others have value 1. We refer to this pre-calculated attention as inductive attention. Each response token still attends to all input tokens and other response tokens on its left. We denote the start and end positions of a control phrase in to be and and those of a grounding sentence to be and . Then we calculate the attention mask as follows:
Then for each transformer head, we have the stacked matrices Q, K and V to represent each example sequence (concatenated and ) as in . We calculate the attention as follows ( is the model dimension): Attention(Q, K, V) = softmax (M ∘QKTd)V
We experiment with two content planners in order to assess the effectiveness of our models when gold constraints are not provided by users. The first is a simple retrieval-based pipeline: for each test dialogue context, we (i) Rank the sentences in by IDF-weighted word overlaps with ; (ii) Extract statistical phrases from the top 50 sentences; (iii) Obtain the 2 statistical phrases that appear most frequently in the 50 sentences as . In order to reduce search space, we use noun phrases only. As there is no need to train such extraction pipeline, it is only applied during inference stage. We also experiment with BERT QA as a content planner. We fine-tune a BERT QA model on our training examples, with as the query, as the document and as answers. Then we use the fine-tuned model to predict answers on test examples. We obtain the top 2 answers as predicted control phrases and drop the second if the string overlaps with the first.
We use the grounded Reddit conversation dataset described in Qin2019CMR, which features Reddit conversations about web pages such as news stories and Wikipedia articles, and covers diverse topics (178 subreddit topics ranging from news/technology to literature/music) and writing styles. As a social media aggregator, Reddit is akin to multiple datasets. In order to make this dataset support controllable text generation, we apply the following pipeline to extract control phrases: we match each n-gram () in the reference response to each grounding sentence. In order to ensure certain informativeness of control phrases, we set an IDF threshold (8.5) for unigrams. When two n-grams are identical except for an added function word or punctuation, we use only the shorter version. In addition, we remove the matched n-grams that appear in dialogue context as we argue that new words are more informative. For each data instance, we have the remaining matched n-gram(s) as control phrases. We use crowdsourced workers to annotate whether the extracted control phrases are central to the reference response, given the dialogue context. For each response, we had 3 judges to enter on a 1-6 scale and calculate the average score. In 2000 annotated examples, the median score was 4.33 and 67.4% of examples had a score over 4. Inter-rater agreement was “fair” with Krippendorff’s alpha coefficient at 0.32. We keep only examples where at least one matched phrase can be found. Such strict lexical matching between target response and grounding assures that the kept examples have a high ratio of grounding utilization, which fits one focus of this work: leveraging grounding in response generation. After the processing, we reduce the number of utterances of train, dev and test from 2.36M, 0.12M and 0.34M to 390K, 6.7K and 21K respectively. And the average length of all reference responses increases from approximately 18.5 to 26.5. The average numbers of phrases in for train, dev and test set are 1.32, 1.27 and 1.38 respectively. The average numbers of sentences in for train, dev and test set are 4.37, 4.32 and 4.25 respectively. And we use up to 3 dialogue turns in experiments.
Training and Inference Setup
In our GPT-2 baseline and Inductive Attention (GPT2IA) models, we have both type and positional embedding for each input token. We treat , each sentence in , each phrase in and response as separate segments. We set the maximum number of sentences in to be 20 and maximum number of phrases in to be 10, then we have “0” for ; “1-20” for ; “21-30” for and “31” for tokens as type embedding. For each segment, we have the position embedding for each token as its position in that segment. We use the small version of GPT-2 with 117M parameters, with the maximum length of the input or target response sequence to be 512. We use BPE tokenization, following GPT-2. We train our model and all other GPT-2-based baselines on top of DialoGPT zhang2019dialogpt, which is a conversational response generation model trained on 147M Reddit comment chains on the basis of GPT-2. None of their Reddit training or validation examples overlap with our test examples. We use batch size 32. Learning rate and warmup steps are tuned on valid set. We use greedy search as the decoding strategy for all GPT-2 and GPT2IA setups, except for a single experiment setting where grid beam search (GBS) hokamp2017lcd is applied for comparison with lexical constrained decoding. The goal of the comparison of our methods with GBS is to investigate whether it helps to encode the constraints into the hidden state during both training and inference, as GBS uses lexical constraints only during inference.
We conduct experiments to draw insights from comparison of different response generation models and input settings. We evaluate our models according to the following settings: : This is the standard setting for non-controllable response generation, where only the dialogue context is given. We conduct experiments for the state-of-the-art generation model GPT-2. +: This is the standard setting for grounded response generation. We compare two models: CMR Qin2019CMR and GPT-2. CMR is the state-of-the-art grounded response generation model that combines a MRC model and a LSTM decoder. GPT-2 for this setting concatenates and as its input. Note that as both models have input sequence length limit, only a randomly chosen subset of grounding sentences are fed into each model. +: This is the controllable response generation setting without grounding. We conduct experiments for GPT-2 by concatenating and . +: This setting measures how the grounding only relevant to can help with response generation, without explicitly providing . We conduct experiments for GPT-2, by concatenating and to be the input. ++: This setting measures how grounded control can help with response generation. We conduct experiments for GPT-2 and GPT2IA, by concatenating , and to be the input. ++: This setting is for comparison against existing constrained generation methods like grid beam search (GBS) introduced in hokamp2017lcd, where lexical control phrases are added in decoding only without involving training. We conduct experiments for GPT-2 where and are the only encoded inputs and is only applied in decoding with GBS. To provide more insight into experiment scores, we also evaluate human responses as a ‘system’. This is possible because we are using multi-reference test set 
with 3.3k unique test dialogue contexts. For each test dialogue context, we retain up to 6 references and set aside one of these for evaluation, so the “human response” can be evaluated against the remaining references for automatic evaluation. To ensure comparability, all systems are evaluated against the same 5 references. For each evaluation metric, we report the highest score among the 5 references.
We experiment with both user-controllable and automatic response generation, with gold and predicted control phrases from a content planner respectively. As different reference responses incorporate different gold control phrases, we use single-reference evaluation for the user-controllable setting. Predicted control phrases are independent of reference responses, so we can use multi-reference evaluation in the automatic generation setting. For automatic evaluation, we measure the overall relevance of the generated responses with metrics including BLEU-4 Papineni2002bleu and NIST-4 Doddington2002nist. NIST is a variant of BLEU that weights n-gram matches by their information gain, which penalizes uninformative n-grams. We measure the diversity of n-grams in generated responses with the ratio between the number of distinct n-grams and the total number of n-grams. Previous works li2016persona, sun2019ssim has shown that automatic metrics for generation can sometimes be unreliable, and response generation generally achieves low absolute metric scores. Accordingly our main conclusions are based on human evaluations (Section Document). Nevertheless, we find that our automatic evaluation results comport well with our human evaluations. In order to provide a sense of how control phrases help enforce the specificity level for generation, in the user-controllable setting, we report control phrase inclusion rate, namely the percentage of gold control phrases included in the generated responses. However, lower inclusion rate does not necessarily indicate worse performance in satisfying the user’s control request, as we treat the lexical control phrases as soft semantic guidance in generation, rather than as hard constraints.
Human evaluation was conducted using crowd-sourced workers. Relevance and appropriateness to the preceding dialog and consistency with the background text (as a metric of factual correctness) were measured. Judges were presented with paired randomized outputs from each system. Document title, a short snippet of the document and up to two conversational turns were provided as context. Judgments were entered on a five-point Likert scale, and ties were permitted. Three to four judges evaluated each pair and metrics were imposed to block poorly performing judges. Inter-rater agreement, was “fair” with Krippendorff’s alpha coefficient at 0.32.222Sample sizes vary. The number was reduced from an initial 1,000 when we automatically removed a number of instances where egregiously offensive content rendered them inappropriate to display to judges.
Results and Analysis
Controllable Response Generation
We focus here on analyzing the user-controllable grounded response generation framework, using single-reference evaluation. In Table Document, lines 1-3 are not controllable settings and do not have control phrases as input, while lines 4-8 have control phrases as input, either explicitly or implicitly. The huge performance gap between lines (1-3) and (4-8) demonstrates the value of adding control. More importantly, we can draw the following conclusions by comparing rows in Table Document: (i) 1 vs. 3: Simply adding groundings to the model input improves the performance to a limited extent; (ii) 2 vs. 3: GPT-2 in general performs better than the state-of-the-art grounded model CMR, indicating that the combination of pre-training and having a transformer-based decoder helps improve generation; (iii) 4 vs. 7-8: providing constraint-sensitive grounding boosts performance compared to having all the grounding (iv) 5 vs. 7-8: providing control phrases in an explicit way is important; (v) 6 vs. 7-8: applying control in hidden states helps the model generate better quality responses than applying control at decoding only; (vi) 7 vs. 8: inductive attention helps reduce noise and improve the performance of GPT-2. Although the comparison between line 6 vs. 7-8 shows that applying control in hidden states is more effective than strict constraints at decoding, it is possible that controls at the training and decoding stages could be complementary. We leave investigation of methods of combining these for future research. Human evaluation results in Table Document show +++outperforms other systems, except in the case of Consistency, where there is no statistical difference between +++and +++GPT2, both grounded systems.
|1-5 Relevance: Which response is more relevant|
|and appropriate to the preceding dialog?|
|1-5 Consistency: Which response is more|
|consistent with the grounding text?|
|1-5 1-5 ++||28.1%||44.3%||27.6%||++|
Content-Planned Response Generation
In a fully automatic conversation scenario, we propose to have a content planner predict control phrases in order to leverage our proposed framework for automatic response generation. Table Document compares settings where no control phrases and predicted control phrases () are provided to the model. We observe that both the retrieval-based or BERT QA based content planner achieve good results in terms of NIST and Div-2. (These are the methods presented previously in Section Document.) We also provide the evaluation results on the carved out human response in Table Document, which indicates the upper bounds for this task. As described in Section Document, we conduct multi-reference evaluation for the predicted control phrases setting.
As an intermediate assessment of the content planner, we report the Precision, Recall and F1 of tokens in and , with respect to reference responses (counts for stop-words and punctuation tokens are removed) in Table Document. For each test dialogue context, we calculate the values for the reference response that gives the highest F1 score and report the average among all test examples for each metric. We notice that the retrieved-based content planner predicts slightly better quality phrases than BERT QA, while still worse than the gold control phrases from the carved out human response.
To understand how grounding knowledge assists generation, we plot the token-level probability (Figure Document) for both + and ++systems. We intentionally select an example about an uncommon entity to eliminate the possibility that the knowledge is captured in pre-training. This figure shows the token-level probability of a potential response, given the dialogue context Do you know the education background of the new faculty, Sam?, control phrases University of Toronto and neural networks, and grounding sentences Sam got his bachelor degree in Physics at University of Science and Technology of China. He spent 6 months at University of Tokyo in Japan as a visiting student, when he was a master student in Computer Science at University of Hong Kong from 2010-2012. And he finished his PhD at University of Toronto in Canada with his research focused on interpretability of neural networks on text generation in 2017. The grounded model assigns higher probabilities to contextual words from grounding such as graduated and thesis as well as to factually correct entity tokens like 2017. It assigns lower probability to factually incorrect tokens such as economics. These facts suggest that grounding knowledge can potentially help controllable generation: (i) contextualize control phrases; (ii) distinguish correct and incorrect facts.
Figure Document illustrates another example to analyze the functions of control and grounding for generation. We list top 6 tokens after a partial response given the same dialogue context and grounding, and control phrase Canada. The ungrounded and non-controllable model gives equally distributed probabilities to commonly known American state names after University of. Adding grounding helps the model rank locations based on the background knowledge. Further adding controls helps the model locate the correct or intended answer.
In order to quantify the observations in Figure Document and Figure Document, we sample 100 test examples and randomly pick an entity from each reference response to calculate the entity’s probability from each model. We restrict the entity to be not in control phrases. Then we calculate the average probability ratio for +/++and +/++, to be 0.773 and 0.886 respectively. Both of them are smaller than 1.0, which indicates having both grounding and control phrases gives higher probability to correct entities than having grounding or control phrases alone. Explicit control phrases can be leveraged to dissect the generation process. Table Document shows how controls may guide or perturb the GPT2IA model to produce responses with diverging semantics. And we provide more sample outputs of different systems in Table Document.
Grounded Response Generation
Although some relevant work draws on external knowledge sources, none incorporates user control. Ghazvininejad2018AKN develop a memory network based model that leverages grounding information from Foursquare tips. Moghe2018TowardsEB, while zhou2018dataset collect movie discussion datasets via crowdsourcing. These are limited to specific domains. Dinan2018WizardOW crowdsource conversations where each utterance is grounded in up to one single selected Wikipedia sentence. We focus on a more realistic, scalable setting, in which a response may constitute a blend of multiple grounding information pieces, rather than a single factual sentence rephrasing. Other researchers propose a copy mechanism to import tokens from both dialogue context and grounding Yavuz2018DeepCopy or leverage a reading comprehension model to co-encode dialogue context and grounding knowledge Qin2019CMR. Other work incorporates relational knowledge bases zhu2017flexible, liu2018diffusion or commonsense knowledge graphs young2018augment to conversational models. More recently, liu2019kgconv develop a graph-path-based method on knowledge graphs augmented with unstructured grounding. Our present work focuses on text based grounding knowledge and does not require preconstructed knowledge graphs.
Prior work on machine translation and language generation has sought enforce user-specified constraints, primarily in the form of lexical constraints hokamp2017lcd,hu2019ParaBank,hu-etal-2019-improved,miao2019cgmh. These approaches exploit constraints at inference time only; in our case, constraints are applied during training, with the option of also being applied at inference. Application during training enables the constraints to be incorporated into the latent space for better predictions. Other related work [14, 10, 15] have explored non-lexical constraints, but do not examine how these could facilitate use of grounding and external knowledge. We see this line of research as complementary to ours.444These papers also make the assumption that (gold) constraints can always given to the system, which limits the potential to demonstrate broader benefits of the approaches. To address this concern, we also evaluate our models in settings where gold constraints are unavailable (e.g., based on predicted constraints produced by a content planner). Controllable text generation has also been employed in text style transfer  and other tasks [3, 2, 5], to disentangle high-level style information from contextual information such that the style information can be independently manipulated.  uses discrete latent actions to learn an interpretable representation for task-oriented dialogue systems. While these works use “style” labels (e.g. positive/negative, formal/informal) as controlling signal, our framework controls generation with specific lexical constraints, allowing for fine-grained semantic control.
The framework allows users to inject soft semantic control into the generation process. It incorporates grounding to contextualize users’ semantic intents as well as to boost information reliability. We introduce an inductive attention mechanism for self-attention-based generation models like GPT-2 in order to boost its performance. We also demonstrate that this framework can benefit standard automatic response generation when integrated with a content planner. Some interesting future directions include exploring various types of user desired control and extending the controllable grounded generation concept to broader generation tasks like document writing assistance.
We thank members of Microsoft Research and University of Washington’s NLP groups who provided feedback and insights to this work.
Wizard of Wikipedia: knowledge-powered conversational agents. In ICLR, Cited by: p2.
-  (2017) Learning to generate product reviews from attributes. In Proc. of EACL, Cited by: p20.
-  (2017) Controlling linguistic style aspects in neural language generation. Proc. of EMNLP. Cited by: p20.
-  (2019) Jointly optimizing diversity and relevance in neural response generation. In Proc. of NAACL, Cited by: p2.
-  (2019) Structuring latent spaces for stylized response generation. In Proc. of EMNLP, Cited by: p20.
-  (2018) A knowledge-grounded neural conversation model. In Proc. of AAAI, Cited by: p2.
-  (2017) Lexically constrained decoding for sequence generation using grid beam search. In Proc. of ACL, Cited by: p2.
-  (2020) The curious case of neural text degeneration. In ICLR, Cited by: p2.
-  (2017) Toward controlled generation of text. In Proc. of ICML, Cited by: p20.
-  (2019) CTRL: a conditional transformer language model for controllable generation. Computing Research Repository arXiv:1909.05858. Note: version 2 Cited by: p2, p20.
-  (2016-06) A diversity-promoting objective function for neural conversation models. In Proc. of NAACL, pp. 110–119. Cited by: p2.
-  (2019) Conversing by reading: contentful neural conversation with on-demand machine reading. In Proc. of ACL, Cited by: p10, p2, p4.
-  (2019) Language models are unsupervised multitask learners. OpenAI Blog. Cited by: p2.
-  (2019) What makes a good conversation? how controllable attributes affect human judgments. In Proc. of NAACL, Cited by: p2, p20.
-  (2019) Target-guided open-domain conversation. In Proc. of ACL, Cited by: p2, p20.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: p6.
-  (2019) Defending against neural fake news. In NeurIPS, Cited by: p2.
-  (2020) DialoGPT: large-scale generative pre-training for conversational response generation. In ACL demo paper, Cited by: p2.
-  (2018) Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proc. of ACL, Cited by: p20.