A Controllable Model of Grounded Response Generation

by   Zeqiu Wu, et al.

Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process. This control is essential to ensure that users' semantic intents are satisfied and to impose a degree of specificity on generated outputs. Attempts to boost informativeness alone come at the expense of factual accuracy, as attested by GPT-2's propensity to "hallucinate" facts. While this may be mitigated by access to background knowledge, there is scant guarantee of relevance and informativeness in generated responses. We propose a framework that we call controllable grounded response generation (CGRG), in which lexical control phrases are either provided by an user or automatically extracted by a content planner from dialogue context and grounding knowledge. Quantitative and qualitative results show that, using this framework, a GPT-2 based model trained on a conversation-like Reddit dataset outperforms strong generation baselines.


Controllable Dialogue Generation with Disentangled Multi-grained Style Specification and Attribute Consistency Reward

Controllable text generation is an appealing but challenging task, which...

A Unified Pre-training Framework for Conversational AI

In this work, we explore the application of PLATO-2 on various dialogue ...

Grounded and Controllable Image Completion by Incorporating Lexical Semantics

In this paper, we present an approach, namely Lexical Semantic Image Com...

Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading

Although neural conversation models are effective in learning how to pro...

Using a KG-Copy Network for Non-Goal Oriented Dialogues

Non-goal oriented, generative dialogue systems lack the ability to gener...

Predict and Use Latent Patterns for Short-Text Conversation

Many neural network models nowadays have achieved promising performances...

Let's be Humorous: Knowledge Enhanced Humor Generation

The generation of humor is an under-explored and challenging problem. Pr...

Controllable Models of Grounded Response Generation

We formalize the problem as follows: given dialogue context , lexical control phrases and sentences of grounding , generate a response that contains semantic information guided by . Control phrases can be either directly provided by a user or automatically derived from a content planner. To differentiate derived control phrases from gold or user-provided , we denote these as . This new framework, called Controllable Grounded Response Generation (), assumes we have grounded conversational dataset, such as in [12]

. We assume that each data instance consists of a dialogue context, grounding knowledge and a reference response. To analyze this framework, we define a control mechanism that defines one or more control phrases for each instance. For more focus on grounding, our user controls are lexical phrases that are relevant to both target response and some part of grounding knowledge. Since it is costly and unscalable to have humans annotate the control phrases, we use lexical matching, defining control phrases to be informative n-grams that appear in both grounding and the reference response. Details of our dataset and its processing are presented in Section 



Figure : GPT-2 considers all possible forward attentions, which can overwhelm the model when the context contains context (), grounding (), and constraints (). On the other hand, Inductive Attention helps focusing on attentions that are relevant to the constraints. Dashed arrows indicate which token-level attentions are preserved. Sparsely connected attention is implemented with a mask on all hidden layers.

Extensions to GPT-2

GPT-2 is a transformer-based language model trained on large scale web data Radford2019gpt2 and uses self-attention where each token attends to its left tokens. It is trained with the objective: predict the next word, given all of the previous words within a defined context window. To apply GPT-2 within , we concatenate , (or ) and to be our input sequence, as shown in Figure Document (left). Then we have the model predict the next response word given the concatenated input sequence (denoted as ) and the previous response tokens in . is the subset of that is relevant to . For example, in this work, we denote the grounding sentences that contain any phrase in as . To differentiate the input elements, we insert an end-of-text token at the end of each dialogue utterance in , a token at the end of each control phrase in and a token at the end of each sentence in . We first concatenate the input sequence and the response sequence into a long text. We denote the source sequence as , which is used to generate target sentence

. The conditional probability of

can be written as the product of conditional probabilities: p(R—S) = ∏_k=1^m+1 p(r_k—w_1, ⋯, w_n, r_1, ⋯, r_k-1) where is the additional end-of-text token indicative of the end of generation.

GPT-2 with Inductive Attention

GPT-2 by default takes a consecutive text sequence as its input in order to train a language model. In our problem setting, we have each input element of , , in a segmented format. Simply concatenating all these input elements into a GPT-2 model can induce noise, as some segments may not necessarily be strongly relevant, and we consider attention links between such segments to be uninformative. We remove potentially uninformative attention links for each data example by injecting pre-established structural information between and . For example, in Figure Document (right), say that consists of , and consists of and . If we know is only found in , then we only want to keep the attention link between and , and not between and any of the other grounded sentences. Since we think is a set of segmented sentences from , we remove all cross-sentence links within tokens. Similarly, we remove all links between non-identical phrases. Thus, the attention links for each data example are pre-determined by structural information between and . To implement this, in each transformer layer, we apply attention masks where the removed attention links and links to future tokens have value 0 and the others have value 1. We refer to this pre-calculated attention as inductive attention. Each response token still attends to all input tokens and other response tokens on its left. We denote the start and end positions of a control phrase in to be and and those of a grounding sentence to be and . Then we calculate the attention mask as follows:

Then for each transformer head, we have the stacked matrices Q, K and V to represent each example sequence (concatenated and ) as in [16]. We calculate the attention as follows ( is the model dimension): Attention(Q, K, V) = softmax (M ∘QKTd)V

Content Planner

We experiment with two content planners in order to assess the effectiveness of our models when gold constraints are not provided by users. The first is a simple retrieval-based pipeline: for each test dialogue context, we (i) Rank the sentences in by IDF-weighted word overlaps with ; (ii) Extract statistical phrases from the top 50 sentences; (iii) Obtain the 2 statistical phrases that appear most frequently in the 50 sentences as . In order to reduce search space, we use noun phrases only. As there is no need to train such extraction pipeline, it is only applied during inference stage. We also experiment with BERT QA as a content planner. We fine-tune a BERT QA model on our training examples, with as the query, as the document and as answers. Then we use the fine-tuned model to predict answers on test examples. We obtain the top 2 answers as predicted control phrases and drop the second if the string overlaps with the first.


We use the grounded Reddit conversation dataset described in Qin2019CMR, which features Reddit conversations about web pages such as news stories and Wikipedia articles, and covers diverse topics (178 subreddit topics ranging from news/technology to literature/music) and writing styles. As a social media aggregator, Reddit is akin to multiple datasets. In order to make this dataset support controllable text generation, we apply the following pipeline to extract control phrases: we match each n-gram () in the reference response to each grounding sentence. In order to ensure certain informativeness of control phrases, we set an IDF threshold (8.5) for unigrams. When two n-grams are identical except for an added function word or punctuation, we use only the shorter version. In addition, we remove the matched n-grams that appear in dialogue context as we argue that new words are more informative. For each data instance, we have the remaining matched n-gram(s) as control phrases. We use crowdsourced workers to annotate whether the extracted control phrases are central to the reference response, given the dialogue context. For each response, we had 3 judges to enter on a 1-6 scale and calculate the average score. In 2000 annotated examples, the median score was 4.33 and 67.4% of examples had a score over 4. Inter-rater agreement was “fair” with Krippendorff’s alpha coefficient at 0.32. We keep only examples where at least one matched phrase can be found. Such strict lexical matching between target response and grounding assures that the kept examples have a high ratio of grounding utilization, which fits one focus of this work: leveraging grounding in response generation. After the processing, we reduce the number of utterances of train, dev and test from 2.36M, 0.12M and 0.34M to 390K, 6.7K and 21K respectively. And the average length of all reference responses increases from approximately 18.5 to 26.5. The average numbers of phrases in for train, dev and test set are 1.32, 1.27 and 1.38 respectively. The average numbers of sentences in for train, dev and test set are 4.37, 4.32 and 4.25 respectively. And we use up to 3 dialogue turns in experiments.

Experimental Setup

Training and Inference Setup

In our GPT-2 baseline and Inductive Attention (GPT2IA) models, we have both type and positional embedding for each input token. We treat , each sentence in , each phrase in and response as separate segments. We set the maximum number of sentences in to be 20 and maximum number of phrases in to be 10, then we have “0” for ; “1-20” for ; “21-30” for and “31” for tokens as type embedding. For each segment, we have the position embedding for each token as its position in that segment. We use the small version of GPT-2 with 117M parameters, with the maximum length of the input or target response sequence to be 512. We use BPE tokenization, following GPT-2. We train our model and all other GPT-2-based baselines on top of DialoGPT zhang2019dialogpt, which is a conversational response generation model trained on 147M Reddit comment chains on the basis of GPT-2. None of their Reddit training or validation examples overlap with our test examples. We use batch size 32. Learning rate and warmup steps are tuned on valid set. We use greedy search as the decoding strategy for all GPT-2 and GPT2IA setups, except for a single experiment setting where grid beam search (GBS) hokamp2017lcd is applied for comparison with lexical constrained decoding. The goal of the comparison of our methods with GBS is to investigate whether it helps to encode the constraints into the hidden state during both training and inference, as GBS uses lexical constraints only during inference.

Evaluated Systems

We conduct experiments to draw insights from comparison of different response generation models and input settings. We evaluate our models according to the following settings: : This is the standard setting for non-controllable response generation, where only the dialogue context is given. We conduct experiments for the state-of-the-art generation model GPT-2. +: This is the standard setting for grounded response generation. We compare two models: CMR Qin2019CMR and GPT-2. CMR is the state-of-the-art grounded response generation model that combines a MRC model and a LSTM decoder. GPT-2 for this setting concatenates and as its input. Note that as both models have input sequence length limit, only a randomly chosen subset of grounding sentences are fed into each model. +: This is the controllable response generation setting without grounding. We conduct experiments for GPT-2 by concatenating and . +: This setting measures how the grounding only relevant to can help with response generation, without explicitly providing . We conduct experiments for GPT-2, by concatenating and to be the input. ++: This setting measures how grounded control can help with response generation. We conduct experiments for GPT-2 and GPT2IA, by concatenating , and to be the input. ++: This setting is for comparison against existing constrained generation methods like grid beam search (GBS) introduced in hokamp2017lcd, where lexical control phrases are added in decoding only without involving training. We conduct experiments for GPT-2 where and are the only encoded inputs and is only applied in decoding with GBS. To provide more insight into experiment scores, we also evaluate human responses as a ‘system’. This is possible because we are using multi-reference test set [12]

with 3.3k unique test dialogue contexts. For each test dialogue context, we retain up to 6 references and set aside one of these for evaluation, so the “human response” can be evaluated against the remaining references for automatic evaluation. To ensure comparability, all systems are evaluated against the same 5 references. For each evaluation metric, we report the highest score among the 5 references.

Automatic Evaluation

We experiment with both user-controllable and automatic response generation, with gold and predicted control phrases from a content planner respectively. As different reference responses incorporate different gold control phrases, we use single-reference evaluation for the user-controllable setting. Predicted control phrases are independent of reference responses, so we can use multi-reference evaluation in the automatic generation setting. For automatic evaluation, we measure the overall relevance of the generated responses with metrics including BLEU-4 Papineni2002bleu and NIST-4 Doddington2002nist. NIST is a variant of BLEU that weights n-gram matches by their information gain, which penalizes uninformative n-grams. We measure the diversity of n-grams in generated responses with the ratio between the number of distinct n-grams and the total number of n-grams. Previous works li2016persona, sun2019ssim has shown that automatic metrics for generation can sometimes be unreliable, and response generation generally achieves low absolute metric scores. Accordingly our main conclusions are based on human evaluations (Section Document). Nevertheless, we find that our automatic evaluation results comport well with our human evaluations. In order to provide a sense of how control phrases help enforce the specificity level for generation, in the user-controllable setting, we report control phrase inclusion rate, namely the percentage of gold control phrases included in the generated responses. However, lower inclusion rate does not necessarily indicate worse performance in satisfying the user’s control request, as we treat the lexical control phrases as soft semantic guidance in generation, rather than as hard constraints.

Human Evaluation

Human evaluation was conducted using crowd-sourced workers. Relevance and appropriateness to the preceding dialog and consistency with the background text (as a metric of factual correctness) were measured. Judges were presented with paired randomized outputs from each system. Document title, a short snippet of the document and up to two conversational turns were provided as context. Judgments were entered on a five-point Likert scale, and ties were permitted. Three to four judges evaluated each pair and metrics were imposed to block poorly performing judges. Inter-rater agreement, was “fair” with Krippendorff’s alpha coefficient at 0.32.222Sample sizes vary. The number was reduced from an initial 1,000 when we automatically removed a number of instances where egregiously offensive content rendered them inappropriate to display to judges.

Results and Analysis

Controllable Response Generation

We focus here on analyzing the user-controllable grounded response generation framework, using single-reference evaluation. In Table Document, lines 1-3 are not controllable settings and do not have control phrases as input, while lines 4-8 have control phrases as input, either explicitly or implicitly. The huge performance gap between lines (1-3) and (4-8) demonstrates the value of adding control. More importantly, we can draw the following conclusions by comparing rows in Table Document: (i) 1 vs. 3: Simply adding groundings to the model input improves the performance to a limited extent; (ii) 2 vs. 3: GPT-2 in general performs better than the state-of-the-art grounded model CMR, indicating that the combination of pre-training and having a transformer-based decoder helps improve generation; (iii) 4 vs. 7-8: providing constraint-sensitive grounding boosts performance compared to having all the grounding (iv) 5 vs. 7-8: providing control phrases in an explicit way is important; (v) 6 vs. 7-8: applying control in hidden states helps the model generate better quality responses than applying control at decoding only; (vi) 7 vs. 8: inductive attention helps reduce noise and improve the performance of GPT-2. Although the comparison between line 6 vs. 7-8 shows that applying control in hidden states is more effective than strict constraints at decoding, it is possible that controls at the training and decoding stages could be complementary. We leave investigation of methods of combining these for future research. Human evaluation results in Table Document show +++outperforms other systems, except in the case of Consistency, where there is no statistical difference between +++and +++GPT2, both grounded systems.

width=0.48 Setting Model NIST BLEU Div-2 Avg-L Incl.  1) GPT-2 0.90 0.55% 4.9% 22.2  2) + CMR 0.34 0.17% 11.3% 15.1  3) + GPT-2 0.98 0.67% 7.5% 23.1  4) + GPT-2 1.67 2.65% 10.7% 28.7 69.4%  5) + GPT-2 1.34 1.58% 11.1% 26.6 34.8%  6) ++ GPT-2+GBS333++(GBS) only takes +as the encoder input while is seen at decoding only. 1.60 2.38% 10.6% 26.8 98.0%  7) ++ GPT-2 1.77 3.22% 11.3% 27.0 65.1%  8) ++ GPT2IA 1.80 3.26% 11.6% 25.9 63.5%

Table : Controllable Response Generation automatic evaluation (with user constraints).
[]1-5 GPT2IA Tied GPT-2
[]1-5 Relevance: Which response is more relevant
and appropriate to the preceding dialog?
[]1-5 ++ 69.8% 14.1% 16.1% +++GBS
++ 42.1% 23.5% 34.4% +
++ 38.1% 28.6% 33.3% ++
[]1-5 Consistency: Which response is more
consistent with the grounding text?
[]1-5 1-5 ++ 28.1% 44.3% 27.6% ++
++ 37.6% 31.4% 31.0% +
Table : Controllable Response Generation human evaluation for relevance and background consistency, showing preferences (%). A number in bold indicates the system is significantly better at computed using 10k bootstrap replications.

Content-Planned Response Generation

In a fully automatic conversation scenario, we propose to have a content planner predict control phrases in order to leverage our proposed framework for automatic response generation. Table Document compares settings where no control phrases and predicted control phrases () are provided to the model. We observe that both the retrieval-based or BERT QA based content planner achieve good results in terms of NIST and Div-2. (These are the methods presented previously in Section Document.) We also provide the evaluation results on the carved out human response in Table Document, which indicates the upper bounds for this task. As described in Section Document, we conduct multi-reference evaluation for the predicted control phrases setting.

width=0.48 Setting Model Content Planner NIST BLEU Div-2 GPT-2 - 1.42 1.31% 18.1%  + GPT-2 Retrieval-based 1.61 1.26% 19.4%  ++ GPT2IA Retrieval-based 1.67 1.23% 20.2%  ++ GPT2IA BertQA 1.67 1.26% 19.6%  Human - - 2.04 2.56% 62.8%

Table : Response Generation automatic evaluation (multi-references) using constraints from content planner. Note that results of Tables Document and Document, as user constraints give away significant information about the intended response.

width=0.48 Content Planner C-P C-R C-F G-P G-R G-F  Retrieval-based 13.8% 5.6% 7.2% 5.5% 21.8% 7.7%  BertQA 14.7% 4.8% 6.5% 5.0% 21.3% 7.1%  Human 24.4% 6.1% 8.6% 6.6% 17.2% 8.0%

Table : Response coverage of control phrase and associated grounding tokens.

width=1.0 Dialogue Context With “nihonium”, Japanese scientists become first from an Asian country to name atomic element. Control periodic table Grounding … The periodic table is a great legacy in chemistry … +++GPT2IA I’m not sure if this is a good thing or not, but I’m pretty sure the periodic table is a great legacy in chemistry. Control artificially Grounding … The artificially synthesized element has 113 protons in its nucleus … +++GPT2IA I wonder if they will be able to name a chemical that artificially produces atomic elements.

Table : For the same dialogue context, GPT2IA generates varied responses given different control phrases.

As an intermediate assessment of the content planner, we report the Precision, Recall and F1 of tokens in and , with respect to reference responses (counts for stop-words and punctuation tokens are removed) in Table Document. For each test dialogue context, we calculate the values for the reference response that gives the highest F1 score and report the average among all test examples for each metric. We notice that the retrieved-based content planner predicts slightly better quality phrases than BERT QA, while still worse than the gold control phrases from the carved out human response.

Qualitative Analysis

To understand how grounding knowledge assists generation, we plot the token-level probability (Figure Document) for both + and ++systems. We intentionally select an example about an uncommon entity to eliminate the possibility that the knowledge is captured in pre-training. This figure shows the token-level probability of a potential response, given the dialogue context Do you know the education background of the new faculty, Sam?, control phrases University of Toronto and neural networks, and grounding sentences Sam got his bachelor degree in Physics at University of Science and Technology of China. He spent 6 months at University of Tokyo in Japan as a visiting student, when he was a master student in Computer Science at University of Hong Kong from 2010-2012. And he finished his PhD at University of Toronto in Canada with his research focused on interpretability of neural networks on text generation in 2017. The grounded model assigns higher probabilities to contextual words from grounding such as graduated and thesis as well as to factually correct entity tokens like 2017. It assigns lower probability to factually incorrect tokens such as economics. These facts suggest that grounding knowledge can potentially help controllable generation: (i) contextualize control phrases; (ii) distinguish correct and incorrect facts.


Figure : Sample showing our grounded model (+++GPT2IA) offers better discrimination against an ungrounded model (++GPT2), given a document about a person’s education background (constraint: University of Toronto; neural networks).

Figure Document illustrates another example to analyze the functions of control and grounding for generation. We list top 6 tokens after a partial response given the same dialogue context and grounding, and control phrase Canada. The ungrounded and non-controllable model gives equally distributed probabilities to commonly known American state names after University of. Adding grounding helps the model rank locations based on the background knowledge. Further adding controls helps the model locate the correct or intended answer.


Figure : The top 5 tokens (plus Canada) to generated after the partial response Sam just graduated from University of. While the ungrounded model makes mostly generic predictions, the grounded model provides more topically relevant ones and the constraint further positively influences the hidden state.

In order to quantify the observations in Figure Document and Figure Document, we sample 100 test examples and randomly pick an entity from each reference response to calculate the entity’s probability from each model. We restrict the entity to be not in control phrases. Then we calculate the average probability ratio for +/++and +/++, to be 0.773 and 0.886 respectively. Both of them are smaller than 1.0, which indicates having both grounding and control phrases gives higher probability to correct entities than having grounding or control phrases alone. Explicit control phrases can be leveraged to dissect the generation process. Table Document shows how controls may guide or perturb the GPT2IA model to produce responses with diverging semantics. And we provide more sample outputs of different systems in Table Document.

width=1.0 Dialogue Context More cops have died of illness linked to 9/11 than had perished in it on that day. Control people died; elevators Grounding … 1,402 people died at or above the floors of impact in the north tower. … many elevators did not plunge, but were destroyed due to the crash and subsequent fires, or were stranded in the shafts. … I’m not sure if this is true, but I’m pretty sure it’s true. I’m pretty sure the elevators were not meant to be elevated above the elevators. I’m surprised that the number of people died in the elevators is low. +IA I’m pretty sure the number of people died in the elevators is higher than the number of people died in the towers. +IA I’m pretty sure the 9/11 attacks were the most horrific of the world trade center attacks. Dialogue Context In 1997 a killer whale held a great white shark upside down for 15 minutes, causing it to suffocate, after which it ate only its liver Control orca Grounding … The orca held the shark upside down to induce tonic immobility and kept the shark still for fifteen minutes , causing it to suffocate … The orca then proceeded to eat the dead shark ’s liver … I think the article is about a great white shark. I’m not sure if this is the orca that killed the killer whale, but i think it’s the killer whale that killed the killer whale. I think the orca is the one who’s the bad guy. +IA I’m pretty sure the orca is the one who killed the shark. +IA I’m pretty sure that’s a great white shark. Dialogue Context Millions of Puerto Ricans just lost power again after a line repaired by Whitefish Energy failed. Control restored Grounding … just 18% of the territory now has power, and officials did not have a clear timeline on when the power will be restored. … I’m not sure if this is true, but I’m pretty sure it’s true. I’m not surprised. I’m surprised that the government is still trying to get the rest of the population restored. I’m surprised they didn’t just repair the line. I mean, it’s not like they were going to be able to get back to the island they were supposed to be. +IA I’m surprised they didn’t just restore power. +IA I’m from San Juan and I’m pretty sure I’m not the only one that noticed this. Dialogue Context 76 % of all known serial killers in the 20th century were from the United States. Control law enforcement Grounding … and may include more stringent policies for military personnel in law enforcement or security … Should the cases cross multiple jurisdictions, the law enforcement system in the US is fragmented and thus not configured to detect multiple similar murders across a large geographic area … I’m pretty sure the US had a police force of around 100,000 people. I’m pretty sure the USA had a large number of serial killers in the US. I’m sure the USA had a large number of law enforcement officers in the US. I’m not sure if this is true, but I’m pretty sure that the US has a lot of law enforcement officers that are from the US. +IA I’m not sure if this is true, but I’m pretty sure that the law enforcement in the US is not very good at detecting serial killers. +IA I’m pretty sure that the USA has a large population of female serial killers.

Table : Sample outputs of the systems, with baseline outputs for comparison.

Related Work

Grounded Response Generation

Although some relevant work draws on external knowledge sources, none incorporates user control. Ghazvininejad2018AKN develop a memory network based model that leverages grounding information from Foursquare tips. Moghe2018TowardsEB, while zhou2018dataset collect movie discussion datasets via crowdsourcing. These are limited to specific domains. Dinan2018WizardOW crowdsource conversations where each utterance is grounded in up to one single selected Wikipedia sentence. We focus on a more realistic, scalable setting, in which a response may constitute a blend of multiple grounding information pieces, rather than a single factual sentence rephrasing. Other researchers propose a copy mechanism to import tokens from both dialogue context and grounding Yavuz2018DeepCopy or leverage a reading comprehension model to co-encode dialogue context and grounding knowledge Qin2019CMR. Other work incorporates relational knowledge bases zhu2017flexible, liu2018diffusion or commonsense knowledge graphs young2018augment to conversational models. More recently, liu2019kgconv develop a graph-path-based method on knowledge graphs augmented with unstructured grounding. Our present work focuses on text based grounding knowledge and does not require preconstructed knowledge graphs.

Controlled Generation

Prior work on machine translation and language generation has sought enforce user-specified constraints, primarily in the form of lexical constraints hokamp2017lcd,hu2019ParaBank,hu-etal-2019-improved,miao2019cgmh. These approaches exploit constraints at inference time only; in our case, constraints are applied during training, with the option of also being applied at inference. Application during training enables the constraints to be incorporated into the latent space for better predictions. Other related work [14, 10, 15] have explored non-lexical constraints, but do not examine how these could facilitate use of grounding and external knowledge. We see this line of research as complementary to ours.444These papers also make the assumption that (gold) constraints can always given to the system, which limits the potential to demonstrate broader benefits of the approaches. To address this concern, we also evaluate our models in settings where gold constraints are unavailable (e.g., based on predicted constraints produced by a content planner). Controllable text generation has also been employed in text style transfer [9] and other tasks [3, 2, 5], to disentangle high-level style information from contextual information such that the style information can be independently manipulated. [19] uses discrete latent actions to learn an interpretable representation for task-oriented dialogue systems. While these works use “style” labels (e.g. positive/negative, formal/informal) as controlling signal, our framework controls generation with specific lexical constraints, allowing for fine-grained semantic control.


The framework allows users to inject soft semantic control into the generation process. It incorporates grounding to contextualize users’ semantic intents as well as to boost information reliability. We introduce an inductive attention mechanism for self-attention-based generation models like GPT-2 in order to boost its performance. We also demonstrate that this framework can benefit standard automatic response generation when integrated with a content planner. Some interesting future directions include exploring various types of user desired control and extending the controllable grounded generation concept to broader generation tasks like document writing assistance.


We thank members of Microsoft Research and University of Washington’s NLP groups who provided feedback and insights to this work.

  • [1] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2019)

    Wizard of Wikipedia: knowledge-powered conversational agents

    In ICLR, Cited by: p2.
  • [2] L. Dong, S. Huang, F. Wei, M. Lapata, M. Zhou, and K. Xu (2017) Learning to generate product reviews from attributes. In Proc. of EACL, Cited by: p20.
  • [3] J. Ficler and Y. Goldberg (2017) Controlling linguistic style aspects in neural language generation. Proc. of EMNLP. Cited by: p20.
  • [4] X. Gao, S. Lee, Y. Zhang, C. Brockett, M. Galley, J. Gao, and B. Dolan (2019) Jointly optimizing diversity and relevance in neural response generation. In Proc. of NAACL, Cited by: p2.
  • [5] X. Gao, Y. Zhang, S. Lee, M. Galley, C. Brockett, J. Gao, and B. Dolan (2019) Structuring latent spaces for stylized response generation. In Proc. of EMNLP, Cited by: p20.
  • [6] M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley (2018) A knowledge-grounded neural conversation model. In Proc. of AAAI, Cited by: p2.
  • [7] C. Hokamp and Q. Liu (2017) Lexically constrained decoding for sequence generation using grid beam search. In Proc. of ACL, Cited by: p2.
  • [8] A. Holtzman, J. Buys, M. Forbes, and Y. Choi (2020) The curious case of neural text degeneration. In ICLR, Cited by: p2.
  • [9] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In Proc. of ICML, Cited by: p20.
  • [10] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) CTRL: a conditional transformer language model for controllable generation. Computing Research Repository arXiv:1909.05858. Note: version 2 Cited by: p2, p20.
  • [11] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016-06) A diversity-promoting objective function for neural conversation models. In Proc. of NAACL, pp. 110–119. Cited by: p2.
  • [12] L. Qin, M. Galley, C. Brockett, X. Liu, X. Gao, B. Dolan, Y. Choi, and J. Gao (2019) Conversing by reading: contentful neural conversation with on-demand machine reading. In Proc. of ACL, Cited by: p10, p2, p4.
  • [13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog. Cited by: p2.
  • [14] A. See, S. Roller, D. Kiela, and J. Weston (2019) What makes a good conversation? how controllable attributes affect human judgments. In Proc. of NAACL, Cited by: p2, p20.
  • [15] J. Tang, T. Zhao, C. Xiong, X. Liang, E. P. Xing, and Z. Hu (2019) Target-guided open-domain conversation. In Proc. of ACL, Cited by: p2, p20.
  • [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: p6.
  • [17] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. In NeurIPS, Cited by: p2.
  • [18] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020) DialoGPT: large-scale generative pre-training for conversational response generation. In ACL demo paper, Cited by: p2.
  • [19] T. Zhao, K. Lee, and M. Eskenazi (2018) Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proc. of ACL, Cited by: p20.