Log In Sign Up

Policy-Driven Neural Response Generation for Knowledge-Grounded Dialogue Systems

by   Behnam Hedayatnia, et al.

Open-domain dialogue systems aim to generate relevant, informative and engaging responses. Seq2seq neural response generation approaches do not have explicit mechanisms to control the content or style of the generated response, and frequently result in uninformative utterances. In this paper, we propose using a dialogue policy to plan the content and style of target responses in the form of an action plan, which includes knowledge sentences related to the dialogue context, targeted dialogue acts, topic information, etc. The attributes within the action plan are obtained by automatically annotating the publicly released Topical-Chat dataset. We condition neural response generators on the action plan which is then realized as target utterances at the turn and sentence levels. We also investigate different dialogue policy models to predict an action plan given the dialogue context. Through automated and human evaluation, we measure the appropriateness of the generated responses and check if the generation models indeed learn to realize the given action plans. We demonstrate that a basic dialogue policy that operates at the sentence level generates better responses in comparison to turn level generation as well as baseline models with no action plan. Additionally the basic dialogue policy has the added effect of controllability.


page 1

page 2

page 3

page 4


Multi-Domain Dialogue Acts and Response Co-Generation

Generating fluent and informative responses is of critical importance fo...

Dynamically Retrieving Knowledge via Query Generation for informative dialogue response

Knowledge-driven dialogue generation has recently made remarkable breakt...

Learning to Plan and Realize Separately for Open-Ended Dialogue Systems

Achieving true human-like ability to conduct a conversation remains an e...

ScriptWriter: Narrative-Guided Script Generation

It is appealing to have a system that generates a story or scripts autom...

Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation

Open-domain neural dialogue models have achieved high performance in res...

Stylized Dialogue Response Generation Using Stylized Unpaired Texts

Generating stylized responses is essential to build intelligent and enga...

1 Introduction

Open-domain dialogue systems have typically been modeled using end-to-end approaches, more specifically encoder-decoder architectures (Sordoni et al., 2015; Serban et al., 2017, 2016; Vinyals and Le, 2015). These seq2seq models are commonly trained on a maximum likelihood objective, which leads to repetitive and uninformative responses Wei et al. (2017). As seen in Figure 1 candidate A is a typical generic response given the dialogue context. In order to deal with this problem, previous work proposed grounding generated responses on knowledge sentences related to the dialogue context (Ghazvininejad et al., 2018; Yavuz et al., 2019; Zhou et al., 2018; Dinan et al., 2018; Gopalakrishnan et al., 2019). To improve the diversity of generated responses, others proposed conditioning response generation on latent (Serban et al., 2016, 2017; Shen et al., 2017; Zhao et al., 2017) or discrete attributes (Sankar and Ravi, 2019; Li et al., 2016a; See et al., 2019; Serban et al., 2017). These discrete attributes are typically presented to the decoder at the turn level, and are not associated with a specific segment of the output.

Another issue with seq2seq approaches is that, due to the lack of explicit control mechanisms, the style of these responses does not always match with what would be suggested by user experience experts. For example, the generated response may not acknowledge what the user just said, or may jump to a new topic without first introducing it. Figure 1 shows examples of two response candidates with similar content: candidate C acknowledges Speaker 2’s previous statement and follows up with a question introducing a new topic and statement, in contrast with candidate B which abruptly transitions into the new topic.

According to Schegloff (2007) human conversations are sequentially organized units. Turns and actions realized within them are related to what came before and affect what comes next. Inspired by the previous studies, we propose a policy-driven neural response generation (PD-NRG) approach for open-domain, knowledge-grounded dialogue systems. Our motivation for this work is to have a mechanism for open domain conversational systems, i.e., a dialogue policy, that can enable such higher-level control of generated responses. The dialogue policy provides a sequential organization plan or action plan. The action plan

specifies the order and relationship of sentences within a turn targeting engaging responses to users throughout the interaction. This form of control is similar to dialogue management (DM) and natural language generation (NLG) in task-oriented systems where a meaning representation determined by the DM is realized as a response during NLG. To further control the content and order of sentences within the generated response, previous work on task-oriented systems proposed explicit content and sentence planning 

Walker et al. (2007). Previous work for open-domain dialogue systems also follow a similar method for sentence planning and design dialogue policies to predict a set of discrete attributes such as topic and dialogue acts (Fang et al., 2018; Cervone et al., 2017; Yu et al., 2019; Bowden et al., 2019; Fulda et al., 2018; Pichl, 2018; Ahmadvand et al., 2018). However, these studies rely on a set of templates for NLG, resulting in repetitive response structures.

We design a set of dialogue policy models that adapt to the dialogue context to appropriately control the responses at both the turn and sentence-levels. We extend the end-to-end-approach of Dinan et al. (2018); Gopalakrishnan et al. (2019): we take in as input both the dialogue context and an action plan to predict the next response. We train our PD-NRG model by fine-tuning on the Generative Pre-trained Transformer (GPT) Radford et al. (2018) model in a TransferTransfo fashion Wolf et al. (2019). The TransferTransfo model is a state-of-the-art neural open-domain dialogue system that won 1st place in automated evaluation and 2nd place in human evaluation at the NeurIPS ConvAI2 conversational Intelligence Challenge Dinan et al. (2020). Our approach differs from previous works that condition on discrete attributes independently by conditioning on these attributes jointly.

Our contributions include:

  • an amended version of the Topical-Chat dataset with annotations on multiple attributes (knowledge, topic, dialogue act). These annotations were tagged automatically which reduces the cost and time of manual annotation while still obtaining strong results.111We are looking into releasing this amended dataset publicly

  • the design of a basic dialogue policy to predict an action plan for controllable generation for neural response generators

  • a sentence-based generation approach that outperforms turn-level generation, and

  • investigation of simple hand-crafted policies as well as automatically learned policies that could be adapted to new applications.

2 Related Work

Controllability of the generated output has been studied for multiple language generation tasks (such as poetry generation and summarization). Previous work on controlling the style and content of the output of generation focused on two main approaches, conditional generation and weighted decoding. Conditional generation modifies the input to the model to condition the generation on control parameters. For example, for summarization, to control the size of the targeted summary, Kikuchi et al. (2016) and Fan et al. (2017) prepended the input to the encoder with the length bin of the target summary (or its embedding) in sequence-to-sequence models with attention. Similarly, previous works proposed conditioning response generators on latent (Serban et al., 2016, 2017; Shen et al., 2017; Zhao et al., 2017) or discrete attributes, including dialogue acts Sankar and Ravi (2019), sentiment Sankar and Ravi (2019), speaker identifiers Li et al. (2016a), lexical features See et al. (2019) or topics Serban et al. (2017).

Weighted decoding See et al. (2019) instead uses features that are controllable (Ghazvininejad et al., 2017; Baheti et al., 2018) and supplements the scores from the decoder model output with these features. Weighted decoding only allows control with token level attributes, and hence is not ideal for our approach. Instead our work focuses on conditional generation methods with sentence-level control, as described in more detail in Section 4. Moreover, the goal of these previous studies is to use these attributes to diversify responses for chit-chat generation; whereas our main focus is on knowledge grounded response generation, and how to appropriately include knowledge in a conversation by jointly conditioning our response on a set of control mechanisms.

There is also previous work on controlling attributes such as question asking at the dialogue level. See et al. (2019) initialized the generation of turns of a dialogue with a fixed distribution that specified what percentage of generated turns should include questions during the dialogue. However this does not allow for flexible control where the number of questions may need to vary depending on the course of the dialogue. A developer may not know the distribution of dialogue acts to use; therefore we focus on learning a dialogue policy model that automatically learns the style of the response based on the dialogue context.

Similar to previous work for response generation we ground our generated responses on knowledge.  Ghazvininejad et al. (2018) used end-to-end memory networks to represent knowledge, Yavuz et al. (2019) added a copy mechanism to attend over the knowledge, and Zhou et al. (2018)

used a static graph attention mechanism for relevant knowledge graphs.  

Dinan et al. (2018) and  Gopalakrishnan et al. (2019) used memory networks based on transformer architectures Vaswani et al. (2017) to encode knowledge sentences and dialogue history to decode a response. (Roller et al., 2020; Smith et al., 2020) extended the incorporation of knowledge by creating a dataset, Blended Skills Dataset (BSD), which incorporated multiple skills (personality, knowledge and empathy) across the conversation by combining utterances from existing datasets (PersonaChat Zhang et al. (2018), Wizard of Wikipedia Dinan et al. (2018), EmpatheticDialogues Rashkin et al. (2018)). A turn within BSD can convey a value such as sadness from empathy or talk about a specific topic from its knowledge. Roller et al. (2020) trained a memory network based transformer architecture on this dataset. However other than knowledge their model does not explicitly predict the values for all skills at every turn. In this work we integrate knowledge in responses by jointly conditioning on the attributes in the input action plan. The values for these attributes are predicted at every turn based on a dialogue policy.

Previous work also investigated content and sentence planning in open-domain dialogue systems. Fang et al. (2018) used a hierarchical dialogue policy where the top-level decides the topic to talk about followed by prediction of a set of speech acts to generate and combine phrases. Ahmadvand et al. (2018) extracts multiple features such as topic, intent, entities, sentiment to send to a dialogue manager to select a response. Pichl (2018) used a Hybrid Code Network for their dialogue manager and used sentiment, dialogue act and topic to predict which response to select. Cervone et al. (2017) segmented an utterance into functional units where each unit is attributed to one of the ISO-standard dialogue acts Mezza et al. (2018)Yu et al. (2019) learned a segmentation model to break the utterances into smaller segments and predict the topic and dialogue act of each segment. Bowden et al. (2019) proposed modeling discourse coherence between generated turns. Fulda et al. (2018) planned and composed a set of responses such as informative and follow-up questions. However these previous works generated responses from a set of templates which are usually repetitive for open-domain conversations. Our work focuses on neural generative models for response generation in open-domain dialogue systems.

The closest work to ours in terms of learning a dialogue policy for open-domain dialogue is Xu et al. (2018) who designed a policy network to predict dialogue acts and fed those acts into a response generation model to control responses. However a key part of open-domain dialogue is to introduce knowledge into a conversation. We design a policy that integrates knowledge with dialogue acts at a sentence-level. In contrast to Xu et al. (2018)

who used a machine learning based approach, we show that a basic rule-based dialogue policy can result in strong performance.

3 Dialogue Policy

Our proposed PD-NRG approach has two parts: a dialogue policy that determines the action plan based on the dialogue context, and a  response generation model that takes the action plan and the dialogue context as input to generate a response. The dialogue policy has components that predict the individual elements of the action plan: knowledge selection and dialogue act planning. Knowledge selection determines the knowledge to be integrated in the response by finding sentences from a knowledge document corpus that are relevant to the dialogue context. Dialogue act (DA) planning determines the style of the response in the form of dialogue acts to be realized. We have two forms of dialogue act planning methods: Knowledge-dependent DA planning and Knowledge-independent DA planning. Figure 2 depicts the architecture of PD-NRG.

Figure 2: Policy-driven neural response generation.
Dialogue Act Definition
Apology apology
ChoiceQ Or-question
Commissive Offer, Commit
Directive Open-Option, Suggest
Feedback Acknowledge, Feedback
PropQ Yes-no-question, Suggest
Salutation bye, greet
SetQ Wh-question
Statement Inform
Thanking thanking, your-welcome
Table 1: The subset of ISO-Standard dialogue acts proposed by Mezza et al. (2018).

3.1 Action Plan (AP)

For the rest of this work, let denote a partial dialogue containing a sequence of turns. And let represent a turn in a dialogue where . Each contains a sequence of sentences, .

Each also contains an action plan that consists of one frame for each sentence . The frames, formed of attributes and values, may include:

  1. Dialogue acts () at a sentence-level to help control the style of the generated response. Table 1 lists all the dialogue acts used in this work.

  2. Topics () at a turn-level to generate topically coherent responses. The complete list of topics are: fashion, politics, books, sports, general-entertainment, music, science & technology and movies.

  3. Knowledge () at a turn or sentence-level to generate interesting and informative responses. The knowledge is represented as a sentence drawn from an unstructured knowledge corpus.

  4. Use-knowledge flag () that signals whether or not to use the knowledge attribute () at the turn or sentence-level.

Each frame in the action plan corresponds to a sentence and the frame corresponding to a sentence, , is denoted as a tuple containing a set of the 4 attributes, (). In this work, we focus on these attributes for action plans, as they are the most basic and critical ways to control knowledge-grounded response generation.

3.2 Knowledge Selection

For the knowledge selection component of our dialogue policy, referenced in Figure 2, we compute the following for each turn at run time. Let be defined as the dialogue history :


is a knowledge sentence from an unstructured knowledge corpus, , in the Topical-Chat dataset Gopalakrishnan et al. (2019). We use the BM25 model Robertson et al. (2009) to rank knowledge sentences and represent and

as TF-IDF vectors for


. We compute cosine similarity between the vectors and 

argmax over the all in our knowledge corpus. For we are only using the most recent previous turn for selection. To decide whether or not to use knowledge, we manually set a threshold value of 0.2 on the similarity score between the sentences. If the similarity score is above the threshold we use the knowledge sentence as input and vice versa.

3.3 Dialogue Act planning

For dialogue act planning, we define a set of dialogue act transitions from common examples in the Topical-Chat corpus. The set of dialogue acts for the next response are determined by both dialogue acts and the knowledge sentence selected, based on the dialogue context. Figure 2

shows the output of the knowledge selection being fed as input into the dialogue act planning component. We represent the transitions as a decision tree. Based on which set of dialogue acts were outputted, we decide whether or not to include the knowledge sentence. Some dialogue acts, such as

Feedback, do not need to include knowledge by definition.

3.3.1 Knowledge-dependent DA planning

We propose a Knowledge-dependent DA planning (KD-DA-P) where there are two inputs to predict the dialogue acts for the current turn :

  1. the last dialogue act associated with the previous sentence

  2. the output of knowledge selection

The dialogue act model looks at the output of the knowledge selection model to see if the knowledge selected is the same or different as compared to the knowledge sentence selected for the previous turn . Based on this information a certain subset of the transitions defined for dialogue act planning are used to predict the dialogue acts for the next response. We represent the KD-DA-P as a decision tree, and include it in the appendix.

3.3.2 Knowledge-independent DA planning

The prediction of the dialogue acts is done independently of the selected knowledge in four ways:

  1. Simple DA planning: We define a set of transitions that determine the set of dialogue acts for the next response based solely on the previous dialogue acts. We represent the transitions as a decision tree, and present it in the appendix.

  2. Seq2Seq Model for DA planning: Using the OpenNMT library Klein et al. (2017), we train a sequence-to-sequence model based on bi-directional LSTMs with Luong attention Luong et al. (2015)

    to estimate the dialogue acts of the current turn given the dialogue context

    . During training, each dialog act label is a separate token in the vocabulary and has its own embedding vector. Both the dialogue act and word embeddings are initialized randomly and learned during training.

  3. PropQ DA planning: For comparison to previous work we use the method in See et al. (2019) which initializes the distribution of questions to be asked at the beginning of the conversation. The work finds that the best model generates questions 65.7% of the time. At each time-step the PropQ dialogue act is picked 65.7% of the time thereby replicating this baseline. As shown in Table 1 PropQ corresponds to a Yes-No question which is the most represented question dialogue act in our dataset. We represent the transitions as a decision tree, and present it in the appendix.

  4. AllQ DA planning: We extend the PropQ DA Prediction baseline above by selecting the PropQ, ChoiceQ or SetQ questions each 21.9% of the time summing up to 65.7%. See et al. (2019) does not make a distinction as to what type of questions were asked. We represent the transitions as a decision tree, and present it in the appendix.

4 Policy-driven Response Generation

As shown in Figure 2, at a given turn in the dialogue context, the goal of the response generator is to realize the action plan output by the dialogue policy. Our proposed models generate the next turn based on the action plan at a sentence-level, in a sequential manner as opposed to at a turn-level. As shown in Figure 2(a) when decoding/generating each sentence of the next turn, the dialogue context as well as the previous sentences generated for the next turn till that iteration are used as input. Algorithm 1 shows the process for sentence-level generation. As seen in the algorithm all the attributes within the AP are jointly taken in as input. To jointly condition on the action plan, each attribute is concatenated to the dialogue history as shown in Figure 2(c). In the training process each dialog act label is a separate token in the vocabulary and has its own embedding vector which is initialized randomly and learned during training. To train our model we represent the knowledge sentence and topic label with the pretrained embeddings from the GPT model whose vocabulary is BPE tokenized. Finally, the use-knowledge flag decides whether or not to include the knowledge embeddings as part of the input. In some of our experiments, we also include the dialogue acts for the past turns by concatenating each turn in the dialogue history with its respective acts.

= []
ActionPlan =
for idx in range(len(ActionPlan))
= ActionPlan[idx]
= Model()
// the step below ensures the
// newly generated sentence is
// included as part of the
// dialogue context for the
// generation of the following
// sentence
Algorithm 1 Sentence-level generation

4.1 Models for response generation

For all our models, we use the Generative Pre-trained Transformer (GPT) Radford et al. (2018) model to finetune in a TransferTransfo fashion Wolf et al. (2019). The TransferTransfo model is a state-of-the-art neural open-domain dialogue system that won 1st place in automated evaluation and 2nd place in human evaluation at the NeurIPS ConvAI2 conversational Intelligence Challenge Dinan et al. (2020).

  • Model for turn-level generation: As depicted in Figure 2(c), our baseline Wolf et al. (2019) is given the dialogue context and knowledge sentence as input and predicts the response at the turn-level.

  • Models for sentence-level generation: As depicted in Figure 2(c), the PD-NRG models are given the action plan (AP) and the dialogue context as input to perform sentence-level prediction. Table 2 lists the versions of PD-NRG models we experimented with along with their corresponding APs. Baseline-Sent is similar to the Baseline-Turn model, except it generates responses sentence-by-sentence. The model generates as many sentences as in the human response.

PD-NRG Models Action Plan (AP)
w/ DA
+knowledge flag
+knowledge flag +topic
Baseline Models
Baseline-Turn }
Baseline-Sent {}
Table 2: Models and their input AP for every timestep where .
(a) GPT Model Radford et al. (2018)
(b) Input into Baseline-Turn Model
(c) Input into PD-NRG Model
(b) Input into Baseline-Turn Model
Figure 3: Input for Baseline-Turn Model and PD-NRG model respectively. Figure 2(a) shows the generation process where the input is fed into the GPT model. The output is then concatenated back to the input. This process repeats until generation is complete.

5 Experiments and Evaluation

5.1 Dataset

We use the publicly released Topical-Chat222 dataset, a large and diverse knowledge-grounded open-domain dialogue dataset where the underlying knowledge spans 8 broad topics including fashion, books, and so on Gopalakrishnan et al. (2019). Each dialogue contains 20+ turns alternating between two crowd workers. For each dialogue there is a reading set for each crowd worker. Each reading set has three entities and a set of corresponding knowledge sentences. When presenting the results, we use both test sets provided with the corpus, test frequent and test rare. Frequent and rare refer to the frequency of topics and entities being discussed in the training set.

5.2 Annotating attributes in Topical-Chat

The dataset does not have annotations for some attributes such as dialogue acts or fine-grained associations between knowledge sentences and dialogue turns. Hence, we used out-of-the-box or simple models to automatically annotate our dataset with each attribute, as defined in Section 3.1. The out-of-the-box model use is the SVM tagger released by  Mezza et al. (2018), which can be found at We assume these annotations are the ground-truth attributes for the ground-truth action plan and use them for testing controllability without degrading the response appropriateness. By automatically annotating we reduce the cost and time it takes to manually annotate our dataset along with getting strong results. We measure both the controllability and appropriateness with human evaluation.

5.2.1 Annotating Knowledge Sentences

Each conversation in Topical-Chat has a pair of reading sets that were presented to crowdworkers before the conversation, to have a knowledgeable interaction. During their conversation crowd workers are asked to annotate which topics/entities were attributed to their turns in the conversation. However there is no fine-grained annotation of which knowledge sentence or sentences were used for a turn, hence we create ground-truth knowledge annotations as a corpus post-processing step. To obtain the knowledge annotation for each turn, we compute similarity between and using Equation 1. We obtain the knowledge annotation for each sentence within a turn by computing similarity between and using Equation 1. To obtain the knowledge annotation for each sentence within a turn, we tokenize the turn into individual sentences. For sentence-tokenization we use the NLTK library Loper and Bird (2002).

We decide whether or not the turn or sentences within a turn should be linked to a knowledge sentence by manually setting a threshold value on the similarity score between the knowledge and turn or sentences within a turn. We use the same threshold, 0.2, as described in Section 3.2.

Figure 4: We calculate automated metrics with both a ground truth and an estimated action plan

5.2.2 Annotating Dialogue Acts

We obtain the dialogue acts for each sentence by running an off-the-shelf SVM dialogue act tagger333 Mezza et al. (2018) which takes in as input the current sentence to predict one of 11 dialogue acts listed in Table 1. We also experimented with using past dialogue acts predicted from the tagger as additional input; however, this did not change the result. If the confidence score from the SVM tagger is not above a threshold of 0.5, the tagger would output no dialogue act which we denote with a special dialogue act token NoDialogueAct. 2.1% of sentences within the Topical-Chat dataset were labeled as NoDialogAct. We assume these are the ground-truth dialogue acts in our dataset. To view the performance of the model we ask two crowd workers to segment and annotate a small set of 100 turns into individual sentences along with their respective dialogue act. The dialogue act tagger obtained an F1 of 0.54, precision of 0.77 and a recall of 0.59 on consolidated test set.

5.2.3 Annotating Topic Labels

For the topic label, we use the topic annotations by the Turkers from the original Topical-Chat data collection. For each turn there are multiple topic annotations; however, unlike the dialogue acts and knowledge sentence, topic annotations are at the turn level and are not linked to individual sentences.

Avg #
Models Past DA PPL BLEU-1 ROUGE-L words sentences
Human - / - - / - - / - 24.3 / 25.0 2.10 / 2.15
Baseline-Turn Wolf et al. (2019) 12.92 / 13.53 0.024 / 0.028 0.134 / 0.130 20.7 / 21.7 1.87 / 1.93
Baseline-Sent 13.85 / 14.36 0.016 / 0.021 0.107 / 0.103 13.7 / 13.9 2.09 / 2.15
PD-NRG w/ DA 12.72 / 13.01 0.024 / 0.027 0.121 / 0.118 18.5 / 19.3 2.05 / 2.10
PD-NRG w/ DA 12.39 / 12.80 0.021 / 0.021 0.115 / 0.111 16.0 / 15.8 1.77 / 1.77
+knowledge flag 12.66 / 12.99 0.025 / 0.027 0.122 / 0.118 17.3 / 18.1 2.03 / 2.08
+knowledge flag 12.25 / 12.62 0.019 / 0.020 0.113 / 0.108 15.2 / 15.3 1.68 / 1.76
+knowledge flag +topic 12.76 / 13.07 0.023 / 0.026 0.123 / 0.117 16.8 / 18.2 2.10 / 2.14
+knowledge flag +topic 12.28 / 12.65 0.019 / 0.020 0.115 / 0.109 16.3 / 16.7 1.82 / 1.85
Corpus Diversity
F1 Precision Recall  n=1  n=2
Human - / - - / - - / - 0.037 / 0.050 0.266 / 0.326
Baseline-Turn Wolf et al. (2019) 0.249 / 0.253 0.275 / 0.272 0.229 / 0.236 0.018 / 0.027 0.118 / 0.165
Baseline-Sent 0.220 / 0.220 0.258 / 0.252 0.191 / 0.195 0.018 / 0.026 0.115 / 0.156
PD-NRG w/ DA 0.241 / 0.240 0.281 / 0.279 0.210 / 0.212 0.018 / 0.027 0.123 / 0.165
PD-NRG w/ DA 0.230 / 0.227 0.291 / 0.287 0.185 / 0.185 0.021 / 0.032 0.133 / 0.180
+knowledge flag 0.240 / 0.242 0.280 / 0.280 0.209 / 0.213 0.018 / 0.027 0.122 / 0.164
+knowledge flag 0.222 / 0.223 0.287 / 0.281 0.180 / 0.180 0.032 / 0.022 0.137 / 0.181
+knowledge flag + topic 0.244 / 0.245 0.276 / 0.274 0.210 / 0.213 0.018 / 0.027 0.118 / 0.159
+knowledge flag + topic 0.224 / 0.221 0.272 / 0.271 0.187 / 0.186 0.020 / 0.029 0.136 / 0.177
Table 3: Automated metrics with ground-truth Action Plan on test freq / rare

5.3 Evaluation Measures

For automatic evaluation we compute a set of metrics between our generated and ground truth response: perplexity, BLEU-1, ROUGE-L, unigram F1-score. We also compute n-gram diversity as defined in 

Ghazvininejad et al. (2018).

For human evaluation, we followed a similar setup as Li et al. (2016b) and generated 200 snippets which contain a dialogue context of 5 turns. We generated responses from 2 models to compare against. We asked a set of 3 crowd workers “Which final response is more appropriate for the given conversation?”. Our MTurk layout for evaluation is shown in the Appendix.

5.4 Results using the Ground-Truth Action Plan

We first check whether the PD-NRG approach results in better responses when we use the ground truth action plan. As seen in Figure 4 instead of using a dialogue policy, we form ground truth action plans from the annotations described in Section 5.2. We then use them to generate a response for that turn. Table 3 presents automated evaluation results for Baseline-Turn, Baseline-Sent and variations of the PD-NRG models. As seen in the results table, adding dialogue acts increases diversity for all the proposed models. This aligns with previous work that using dialogue acts leads to more diverse responses Sankar and Ravi (2019). The F1-scores of the PD-NRG w/ DA model are lower than the Baseline-Turn model due to the PD-NRG model decoding shorter sentences resulting in lower recall. The PD-NRG w/ DA model with the addition of previous dialogue acts as input results in the lowest perplexity for both frequent and rare test sets.

5.4.1 Do the Models follow the Action Plan?

By jointly conditioning on the attributes in the action plan, we aim to control multiple aspects of the response, such as content and style. The dialogue acts determine if the response should be a question, statement or should give feedback. The knowledge determines what content should be present in the response. To see if the model responses follow the action plan, we manually annotated if the model’s responses realize the dialogue acts and their respective knowledge sentence (focusing on the cases where the action plan included a knowledge sentence) in their input. Turns with no dialogue acts, i.e., marked as NoDialogAct, were ignored. The results from the manual evaluation are presented in Table 4. The PD-NRG w/ DA + knowledge flag model has the highest accuracy in realizing the input action plan, achieving 80.6% accuracy on the dialogue acts of the generated responses, and 52.1% accuracy in correctly integrating the provided knowledge sentences. Figure 5 presents an example from this model.

Models Past DA % DA %K
Baseline-Turn Wolf et al. (2019) 26.7 -
Baseline-Sent 59.4 30.8
PD-NRG w/ DA 69.1 47
PD-NRG w/ DA 69.7 39.3
+knowledge flag 80.6 52.1
+knowledge flag 68.1 47.8
+knowledge flag +topic 77.8 47.4
+knowledge flag +topic 69.0 45.3
Table 4: % of Dialogue Acts (DA) and Knowledge (K) Realized for PD-NRG Models to showcase controllability.
Speaker 1: Free with you, they should have had Snoop Dogg
make a theme song for the game like
he did for his son’s high school football team LOL
Speaker 2: Interesting, do you play golf?
Speaker 1:
no, i don’t play golf, but i hear
it has been a lot of years since the last time.
PD-NRG model:
Statement → not really, i’m not a huge fan of golf.
PropQ → have you ever played?
Figure 5: Baseline-Turn model versus PD-NRG model
Avg #
Policy F1 words sentences
Ground truth 0.22 / 0.22 15.2 / 15.3 1.68 / 1.76
Baseline-Turn 0.18 / 0.17 19.8 / 19.7 1.86 / 1.87
KI-DA-P (Simple) 0.14 / 0.14 12.9 / 12.2 1.89 / 1.89
KD-DA-P 0.14 / 0.14 12.3 / 11.5 1.91 / 1.91
KI-DA-P(Seq2Seq) 0.14 / 0.17 13.1 / 13.4 1.46 / 1.56
Table 5: Automated metrics with estimated Action Plan. Baseline-Turn Wolf et al. (2019)
Dialogue policy %W %T %L IAA
KD-DA-P vs. B-Turn* 40.8 30.3 28.9 0.43
KI-DA-P(Seq2Seq) vs. B-Turn* 25.1 35.7 39.2 0.47
KD-DA-P vs. KI-DA-P(PropQ)** 54.2 5.5 40.2 0.46
KD-DA-P vs. KI-DA-P(AllQ)** 54.1 7.4 38.3 0.48
KD-DA-P vs. Human response** 16.7 35.3 48.0 0.53
Table 6: % of Wins(W), Ties (T) and Losses(L) for the baseline models vs PD-NRG model on appropriateness. The KD-DA-P policy is statistically significant compared to the B-Turn(Baseline-Turn) Wolf et al. (2019) as well as the KI-DA-P(PropQ) and KI-DA-P(PropQ) baselines See et al. (2019). We compute Krippendorff’s alpha for Inter-annotator agreement(IAA). We computed the p-value using a two-tailed binomial test. * refers to a p-value 0.05 and ** refers to a p-value 0.01.

5.5 Results using an estimated Action Plan

Using our dialogue policy models, we estimate an action plan for each turn. Given the dialogue context and the action plan we then generate responses using the PD-NRG w/ DA + knowledge flag model + Past DA model. We evaluate the responses using both automated and human evaluation. We present our automatic metrics in Table 5. The KD-DA-P and KI-DA-P(Simple) produced more Feedback and PropQ dialogue acts than the actual distribution of dialogue acts in the dataset, where most dialogue acts were Statements. We believe this change in the distribution resulted in our models generating responses with fewer words and as a result these models have lower F1-scores. Figure 6 shows the distribution of dialogue acts for different dialogue policies. Multiple knowledge sentences in the unstructured knowledge corpus could be relevant to the dialogue context. When a knowledge sentence that is not the same as the ground truth knowledge is selected, the models could still generate an appropriate response, but n-gram overlap measures will fail to capture their appropriateness. Therefore we limited automated evaluation in this set-up to fewer measures.

For a more realistic comparison of our dialogue policy models to our baselines, we ran human evaluation. We provided a set of crowd workers outputs from two models along with the dialogue context, and asked them “Which final response is more appropriate for the given conversation?”. Crowd workers were provided with 3 options: first response, second response and not sure (limited to those cases when the two responses are equally good/bad). The exact setup given to crowd workers is shown in the Appendix. Table 6 presents results from the manual evaluations. As seen, the KD-DA-P responses were chosen over the B-Turn model by a large margin. This result is also seen in KD-DA-P responses versus the KI-DA-P (PropQ/AllQ) responses, proving that its is better to have a dialogue policy adapting to the course of the dialogue versus using a fixed distribution See et al. (2019) to predict the dialogue acts. However, the KI-DA-P (Seq2Seq) results in worse responses than the baseline. We believe this is because the Statement dialogue act is a large portion of the dataset, making learning other acts harder for the model. For future work, we will investigate machine learning approaches to learn better models for the dialogue policy. The proposed KD-DA-P results in responses that are better than or similar to human responses in 52% of the cases.

Figure 6: Distribution of dialogue acts for different dialogue policies

6 Conclusions

In this work, we propose a policy-driven neural response generation approach for knowledge grounded open-domain dialogue systems. We estimate an action plan that consists of a set of attributes that control the content and style of the generated responses at the turn and sentence levels. We investigate both manual and machine learning based policies. Through human evaluation, we empirically demonstrate that a basic dialogue policy that does sentence level generation outperforms turn level generation, as well as knowledge-grounded response generation baselines. Furthermore, the generated responses realize their respective action plans. This allows builders of dialogue systems control over the model’s responses allowing for more consistent user experiences. Our future work includes investigation of better approaches for learning such dialogue policy models along with adding other attributes such as sentiment.


  • A. Ahmadvand, I. J. Choi, H. Sahijwani, J. Schmidt, M. Sun, S. Volokhin, Z. Wang, and E. Agichtein (2018) Emory irisbot: an open-domain conversational bot for personalized information access. Alexa Prize Proceedings. Cited by: §1, §2.
  • A. Baheti, A. Ritter, J. Li, and B. Dolan (2018) Generating more interesting responses in neural conversation models with distributional constraints. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    External Links: Link Cited by: §2.
  • K. K. Bowden, J. Wu, W. Cui, J. Juraska, V. Harrison, B. Schwarzmann, N. Santer, and M. Walker (2019) SlugBot: developing a computational model andframework of a novel dialogue genre. arXiv preprint arXiv:1907.10658. Cited by: §1, §2.
  • A. Cervone, G. Tortoreto, S. Mezza, E. Gambi, G. Riccardi, et al. (2017) Roving mind: a balancing act between open–domain and engaging dialogue systems. Cited by: §1, §2.
  • E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek, D. Kiela, A. Szlam, I. Serban, R. Lowe, et al. (2020) The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pp. 187–208. Cited by: §1, §4.1.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2018)

    Wizard of wikipedia: knowledge-powered conversational agents

    arXiv preprint arXiv:1811.01241. Cited by: §1, §1, §2.
  • A. Fan, D. Grangier, and M. Auli (2017) Controllable abstractive summarization. In

    Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

    External Links: Link Cited by: §2.
  • H. Fang, H. Cheng, M. Sap, E. Clark, A. Holtzman, Y. Choi, N. A. Smith, and M. Ostendorf (2018) Sounding board: a user-centric and content-driven social chatbot. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 96–100. Cited by: §1, §2.
  • N. Fulda, T. Etchart, W. Myers, D. Ricks, Z. Brown, J. Szendre, B. Murdoch, A. Carr, and D. Wingate (2018) Byu-eve: mixed initiative dialog via structured knowledge graph traversal and conversational scaffolding. Cited by: §1, §2.
  • M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley (2018) A knowledge-grounded neural conversation model. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §2, §5.3.
  • M. Ghazvininejad, X. Shi, J. Priyadarshi, and K. Knight (2017) Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pp. 43–48. Cited by: §2.
  • K. Gopalakrishnan, B. Hedayatnia, Q. Chen, A. Gottardi, S. Kwatra, A. Venkatesh, R. Gabriel, and D. Hakkani-Tür (2019) Topical-chat: towards knowledge-grounded open-domain conversations. Proc. Interspeech 2019, pp. 1891–1895. Cited by: §1, §1, §2, §3.2, §5.1.
  • Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura (2016) Controlling output length in neural encoder-decoders. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §2.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017)

    OpenNMT: open-source toolkit for neural machine translation

    In Proc. ACL, External Links: Link, Document Cited by: item 2.
  • J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and B. Dolan (2016a) A persona-based neural conversation model. arXiv preprint arXiv:1603.06155. Cited by: §1, §2.
  • J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky (2016b)

    Deep reinforcement learning for dialogue generation

    arXiv preprint arXiv:1606.01541. Cited by: §5.3.
  • E. Loper and S. Bird (2002) NLTK: the natural language toolkit. arXiv preprint cs/0205028. Cited by: §5.2.1.
  • M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: item 2.
  • S. Mezza, A. Cervone, G. Tortoreto, E. A. Stepanov, and G. Riccardi (2018) ISO-standard domain-independent dialogue act tagging for conversational agents. arXiv preprint arXiv:1806.04327. Cited by: §2, Table 1, §5.2.2, §5.2.
  • J. Pichl (2018) Alquist 2.0: alexa prize socialbot based on sub-dialogue models. Cited by: §1, §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1, 2(a), §4.1.
  • H. Rashkin, E. M. Smith, M. Li, and Y. Boureau (2018) Towards empathetic open-domain conversation models: a new benchmark and dataset. arXiv preprint arXiv:1811.00207. Cited by: §2.
  • S. Robertson, H. Zaragoza, et al. (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: §3.2.
  • S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, et al. (2020) Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. Cited by: §2.
  • C. Sankar and S. Ravi (2019) Deep reinforcement learning for modeling chit-chat dialog with discrete attributes. arXiv preprint arXiv:1907.02848. Cited by: §1, §2, §5.4.
  • E. A. Schegloff (2007) Sequence organization in interaction: a primer in conversation analysis i. Vol. 1, Cambridge University Press. Cited by: §1.
  • A. See, S. Roller, D. Kiela, and J. Weston (2019) What makes a good conversation? how controllable attributes affect human judgments. arXiv preprint arXiv:1902.08654. Cited by: §1, §2, §2, §2, item 3, item 4, §5.5, Table 6.
  • I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models

    In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • X. Shen, H. Su, Y. Li, W. Li, S. Niu, Y. Zhao, A. Aizawa, and G. Long (2017) A conditional variational framework for dialog generation. arXiv preprint arXiv:1705.00316. Cited by: §1, §2.
  • E. M. Smith, M. Williamson, K. Shuster, J. Weston, and Y. Boureau (2020) Can you put it all together: evaluating conversational agents’ ability to blend skills. arXiv preprint arXiv:2004.08449. Cited by: §2.
  • A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1.
  • M. A. Walker, A. Stent, F. Mairesse, and R. Prasad (2007) Individual and domain adaptation in sentence planning for dialogue. Journal of Artificial Intelligence Research 30, pp. 413–456. Cited by: §1.
  • B. Wei, S. Lu, L. Mou, H. Zhou, P. Poupart, G. Li, and Z. Jin (2017) Why do neural dialog systems generate short and meaningless replies? a comparison between dialog and translation. arXiv preprint arXiv:1712.02250. Cited by: §1.
  • T. Wolf, V. Sanh, J. Chaumond, and C. Delangue (2019)

    Transfertransfo: a transfer learning approach for neural network based conversational agents

    arXiv preprint arXiv:1901.08149. Cited by: §1, 1st item, §4.1, Table 3, Table 4, Table 5, Table 6.
  • C. Xu, W. Wu, and Y. Wu (2018) Towards explainable and controllable open domain dialogue generation with dialogue acts. arXiv preprint arXiv:1807.07255. Cited by: §2.
  • S. Yavuz, A. Rastogi, G. Chao, and D. Hakkani-Tur (2019) Deepcopy: grounded response generation with hierarchical pointer networks. arXiv preprint arXiv:1908.10731. Cited by: §1, §2.
  • D. Yu, M. Cohn, Y. M. Yang, C. Y. Chen, W. Wen, J. Zhang, M. Zhou, K. Jesse, A. Chau, A. Bhowmick, et al. (2019) Gunrock: a social bot for complex and engaging long conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pp. 79–84. Cited by: §1, §2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018) Personalizing dialogue agents: i have a dog, do you have pets too?. arXiv preprint arXiv:1801.07243. Cited by: §2.
  • T. Zhao, R. Zhao, and M. Eskenazi (2017)

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders

    arXiv preprint arXiv:1703.10960. Cited by: §1, §2.
  • H. Zhou, T. Young, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018) Commonsense knowledge aware conversation generation with graph attention.. In IJCAI, pp. 4623–4629. Cited by: §1, §2.

Appendix A Appendices

a.1 PD-NRG hyperparameters

For training, we fine-tune the GPT double heads model from the HuggingFace repo 444

. The model was initialized with the pre-trained weights. We trained for 3 epochs with a batch size of 2, and 4 distractor responses were used for the next response classification head. We limit the dialogue history to 128 tokens and the knowledge sentence to 32 tokens. Both the language model and classification model tasks had a weight of 1.0. For inference, we use top-

k and top-p sampling with a k value of 0 and p value of 0.9.

a.2 Policies used

Figure 8 presents the transitions of the Knowledge-Independent DA planning (Simple). According to this policy, a set of dialogue act sequences are specified based on the last dialogue act of the previous turn. One of these sequences of dialogue acts is chosen (i.e., weighted_sample_from()) to be included in the action plan. Additionally, the policy determines if the dialogue act should contain the knowledge selected (i.e.,include_knowledge_in_acts()). Both methods are defined in Figures 11 and  12. Figure 13 presents the transitions for the Knowledge-Dependent DA planning. The transitions are similar to the KI-DA-P (Simple), however in the KD-DA-P one, in addition to the previous dialogue acts, the selected knowledge sentences are also used in the decision when determining the set of dialogue acts for the next turn. Figures 9 and 10 are two baseline policies that predict question dialogue acts 65.7% of the time.

a.3 Layouts for the Crowd Tasks

We performed manual evaluations with experienced annotators as well as crowd workers. Figure 7 shows the interface and instructions we used during manual evalautions.

Figure 7: MTurk layout for Human evaluation
Figure 8: Knowledge-Independent DA planning (Simple)
Figure 9: Knowledge-Independent DA planning (PropQ)
Figure 10: Knowledge-Independent DA planning (AllQ)
Figure 11: Weighted Sample function
Figure 12: Include knowledge function
Figure 13: Knowledge-Dependent DA planning