Log In Sign Up

Chat More If You Like: Dynamic Cue Words Planning to Flow Longer Conversations

To build an open-domain multi-turn conversation system is one of the most interesting and challenging tasks in Artificial Intelligence. Many research efforts have been dedicated to building such dialogue systems, yet few shed light on modeling the conversation flow in an ongoing dialogue. Besides, it is common for people to talk about highly relevant aspects during a conversation. And the topics are coherent and drift naturally, which demonstrates the necessity of dialogue flow modeling. To this end, we present the multi-turn cue-words driven conversation system with reinforcement learning method (RLCw), which strives to select an adaptive cue word with the greatest future credit, and therefore improve the quality of generated responses. We introduce a new reward to measure the quality of cue words in terms of effectiveness and relevance. To further optimize the model for long-term conversations, a reinforcement approach is adopted in this paper. Experiments on real-life dataset demonstrate that our model consistently outperforms a set of competitive baselines in terms of simulated turns, diversity and human evaluation.


page 1

page 2

page 3

page 4


Follow Me: Conversation Planning for Target-driven Recommendation Dialogue Systems

Recommendation dialogue systems aim to build social bonds with users and...

Know More about Each Other: Evolving Dialogue Strategy via Compound Assessment

In this paper, a novel Generation-Evaluation framework is developed for ...

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

An intelligent dialogue system in a multi-turn setting should not only g...

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

Human conversations in real scenarios are complicated and building a hum...

Coherent Dialogue with Attention-based Language Models

We model coherent conversation continuation via RNN-based dialogue model...

Regularizing Dialogue Generation by Imitating Implicit Scenarios

Human dialogues are scenario-based and appropriate responses generally r...

CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers

Dialogue state trackers have made significant progress on benchmark data...

1 Introduction

Building a conversational system that enables natural human-computer interaction has been more and more important. Previous efforts focus on task-oriented dialogue systems [Wen et al.2017, Eric and Manning2017, Liu et al.2018] which help people complete specific tasks in vertical domains. Recently, non-task-oriented dialogue systems [Higuchi, Rzepka, and Araki2008, Yu et al.2016] that converse with humans on open domain topics are attracting increasing attention, due to their various applications, such as chatbots, personal assistants, and interactive question answering etc.

Basically, there are two major categories of open-domain conversation systems: single- and multi-turn dialogue systems. For single-turn dialogue systems, previous research [Shang, Lu, and Li2015, Vinyals and Le2015, Dai and Le2015, Li et al.2016a, Li et al.2016b, Mou et al.2016, Xing et al.2017, Vougiouklis, Hare, and Simperl2016, Yao et al.2017] concentrated on generating a relevant and diverse response when given static context. One of the significant issues is that these systems often generate universal responses such as “I don’t know” and “Okay” [Li et al.2016a, Serban et al.2016, Mou et al.2016]. Besides, the single-turn dialogue systems ignore the long-term dependency among generated responses that is critical in natural conversation. To build a natural and coherent conversation interface, the multi-turn dialogue systems are currently the primary choice. To enhance the long-term dependency modeled in multi-turn dialogue systems, reinforcement learning based dialogue generation methods [Li et al.2016c, Asghar et al.2017, Dhingra et al.2017] are proposed. Nevertheless, the performance of existing conversation systems is still far from satisfactory.

Cue word Utterance
- A: 去哪里(Where are you going?)
回家(home) B: 回家(Home.)
上班(working) A: 好吧,我还在上班(Well, I am still working.)
加班(overtime) B: 加班?你太辛苦了(Work overtime? You are too hard.)
委屈(aggrieved) A: 我也很委屈(I feel aggrieved.)
Table 1: An example of cue words and utterances in a conversation.

In human-human conversations, people tend to talk about highly relevant aspects and topics during a chat session. To make dialogues more interesting, they will find satisfying topics dynamically. It is easy for them to recognize key signs of discomfort, which can be a juncture to seek a new topic. Also, such implicit information [Yao et al.2017] has proven effective for meaningful responses generation.

However, it is difficult and challenging to launch such a human-computer conversation system. 1) Although topics augmented neural response generation methods [Mou et al.2016, Yao et al.2017, Wenjie et al.2018] have shown impressive potential in single-turn conversation system, they do not apply to ongoing dialogues. Because they ignore the long-term impact of the selected cue words. Besides, the selection of cue words is based on a specific measurement, such as Point-wise Mutual Information (PMI) [Mou et al.2016, Yao et al.2017], or direct extraction of important words from context [Wenjie et al.2018], which is not trainable in ongoing dialogues. 2) It is complicated to model the practical flow of a real conversation. Usually, the conversational topics are coherent and drift naturally. However, if the topic makes both sides of the conversation feel uncomfortable, they may try to change the topic.

To tackle these issues, we present a multi-turn cue words driven conversation system with reinforcement learning, named RLCw. Specifically, we aim to model the topic flow of an ongoing dialogue with cue words. In each turn, we strive to select an adaptive cue word with the greatest future credit based on the dialogue state (history context and cue words). Further, we take the cue words as the main gist of the upcoming utterances to guide the response generation. As shown in Table 1, the selected cue words dynamically drive the dialogue direction and help to generate an informative and interesting conversation. Main contributions of this paper include:

  • In a multi-turn dialogue, we adopt cue words to shape the conversation flow, and unify cue words prediction and responses generation in an end-to-end framework.

  • We propose to measure the quality of a cue word from two aspects: effectiveness and relevance. In this way, the selected cue words with higher reward could further drive the dialogue to be more informative and flow a longer and more fluent conversation.

  • Extensive comparisons and analyses are conducted to draw insights into how our proposed RLCw model lead the conversation to a better direction.

The rest of this paper is organized as follows. We firstly review previous related work. Next, we present the overall framework and describe the proposed methods in detail. This is followed by model training process. Finally, we elaborate our experimental setup, results, analysis, and draw our conclusion.

2 Related Work

To build an open-domain conversation system is one of the most interesting and challenging topics in both artificial intelligence and natural language processing research these years. For single-turn conversation systems, prior studies strived to generate more meaningful and informative responses. There are three mainstream ways to address this issue. 1) To modify loss function or to improve the beam search algorithm. li2016diversity (li2016diversity) proposed to use Maximum Mutual Information (MMI) as the objective function in neural models. shao2017generating (shao2017generating) introduced a stochastic beam-search algorithm with segment-by-segment re-ranking and injected diversity in generation process earlier. 2) To learn latent variables. zhao2017learning (zhao2017learning) adopted an utterance-level latent variable to model the distribution of the next response so that the system could generate more diverse responses. 3) To fuse additional information. mou2016sequence (mou2016sequence) leveraged the Pointwise Mutual Information (PMI) to predict a keyword and presented the seq2BF framework to generate a reply containing the given keywords. yao2017towards (yao2017towards) proposed an implicit content-introducing method to incorporate keyword information in a soft schema. Besides, topic information, which is regarded as prior knowledge, has been shown effective in conversation systems 

[Xing et al.2017]. Inspired by these studies, we also resort to enlightening cue words to improve the informativeness and meaningfulness of generated responses.

As for multi-turn conversation systems, serban2016building (serban2016building) presented a hierarchical recurrent encoder-decoder (HRED) approach to encode each utterance and to recurrently model the dialogue context to generate responses, which was further improved through a stochastic latent variable at each dialogue turn [Serban et al.2017]

. These works focused on the static dialogue context. In an ongoing dialogue, naturally, deep reinforcement learning method has been used to improve the performance of response generation. li2016dialogue (li2016dialogue) explored an online learning fashion: the system learned from the feedback of the dialogue partner. asghar2016online (asghar2016online) proposed an active learning approach to learn user explicit feedback online and to combine the offline supervised learning for response generation of conversational agents.  dhingra2016end (dhingra2016end) presented an end-to-end dialogue system for information acquisition from a knowledge base using reinforcement learning. li2016deep (li2016deep) attempted to model the future influence of generated responses using a deep reinforcement learning approach to optimize generation model. Based on this, zhang2018exploring (zhang2018exploring) also added the implicit feedback as a part of the reward.

Different from existing works, our goal is to model the future direction of ongoing conversations. To achieve this, we designed our model to dynamically selected cue words and integrate them into the decoder to generate proper replies in multi-turn conversations, thus improving our model in terms of user engagement.

3 Methods

Figure 1: The pipeline of our proposed RLCw system.

3.1 System Overview

In this paper, our goal is to dynamically shift conversations to better topics and make conversations more attractive. Figure 1

illustrates the pipeline of our proposed RLCw system. We treat the cue word prediction as an action that is taken according to a policy. Given the source input, the policy model firstly samples a relevant cue word. Then a response will be generated based on both dialogue state and the selected cue word. After that, we estimate the expected reward in the following conversation and optimize the parameters of the policy network.

Formally, a dialogue session and history cue words sequence are given to the system, where is the -th turn’s utterance in the -th session, and denotes for the cue word corresponding to . Based on the dialogue state, the system firstly selects an adaptive cue word using policy model, and then generates a reply with cue word augmented response generation framework. The details are as follow.

3.2 Cue Word Augmented Response Generation

We employ neural sequence to sequence model for our response generation. In this paper, we adopt the two LSTM [Hochreiter and Schmidhuber1997] layers framework [Venugopalan et al.2015], which shares parameters between encoder and decoder module.

To generate a response

, we maximize the generation probability conditioned on input query

and selected cue word . Due to the computational complexity, the input query is the concatenation of previous two utterances .

At time , the encoding hidden state of the first layer and second layer are defined as:


where is the input word embedding. The special symbol

denotes padding with zero.


are initialized with zero vectors.

During decoding process, we initialize the decoder with final states of encoder. And then, the decoder generates reply words one by one. To incorporate the predicted cue words into generation process, we introduce cue word information at every step in decoding inspired by yao2017towards (yao2017towards). At generation time step , the decoder hidden state of two layers are given by:


where is the output word embedding last time,

denotes the probability distribution of candidate words at generation time step


is implemented as a multi-layer perceptron (MLP) layer.

denotes cue words information, which is the linear transformation of cue word embedding:


where and are weight matrices and bias terms, respectively. In this way, we fuse cue words information into generation process so that the system is aware of dialogue direction.

Figure 2: The end-to-end framework for cue words selection and topic augmented response generation.

3.3 Policy Model: Cue Words Selection

The policy model is designed for cue words selection. Given the dialogue state, we choose a cue word based on current policy. Then, we estimate the expected reward and optimize the parameters of the policy network.

Inspired by prior work [Li et al.2016c, Lewis et al.2017], we simulate two virtual conversational agents talking with each other. During simulation, we firstly use a message sampled from training data to initialize the dialogue session. Then, two chatbots take turns to encode dialogue history and predict a cue word to express the main gist of the upcoming utterance. Based on the selected cue word, a response is generated until turns. Formally, a simulated conversation is the combination of dialogue session and cue words sequence:


where is the training instance index. Both dialogue session and cue words sequence consist of history and simulation information.

3.3.1 State

We explore context tracker and topic tracker to depict dialogue state. As for context tracker, we only focus on the previous two dialogue utterances due to the computational complexity on modeling the long-term dialogue history. To be specific, it is the encoded vector representation of the previous two utterances.


where denotes for sequence encoding, is final hidden state of the first layer in Eq. 1.

To model a natural and coherent conversation flow, we present topic tracker to represent the topic flow.


where is the hidden state of topic tracker. Note that we do not share the sequence encoding parameters in Eq. 5 and Eq. 6. Further, the dialogue state is given by:


In this way, we describe the dialogue state comprehensively.

3.3.2 Action

Given the current conversation state, an action is a cue word to select. Usually, the cue word is enlightening, which drives the generated response to a specific direction. Different from li2016deep (li2016deep), we aim to optimize cue words selection so that these appropriate cue words shape the conversation flow and make the dialogue more informative and attractive.

3.3.3 Policy

In our reinforcement learning based conversation model, the policy is defined as the probability distribution over the action space. Specifically, based on the current dialogue state, we calculate the probability distribution over pre-defined cue words vocabulary. The one with the highest probability will be selected as current cue word.


where and are weight matrices and bias term. To speed up the training and avoid language divergence, we fix the encoder and decoder parameters and only optimize the policy model during reinforcement learning.

3.3.4 Reward

The reward indicates the contribution of an action to the success of a conversation. We aim to measure a good cue word from different aspects.

1) Effectiveness. A suitable cue word should be related to current dialogue state and reply generation, i.e., the generated reply should be semantically relevant to the predicted cue word, no matter whether the cue word explicitly appears in the reply or not. The reward is given by the log cosine similarity between them:


We adopt an embedding-based metric [Liu et al.2016] to measure the correlation between the predicted cue word and dialogue sentences (current dialogue history or generated reply). We do not compute sentence-level embeddings; instead, the cue word is greedily matched with each token in a dialogue sentence based on the cosine similarity of their word embeddings. The highest cosine score is regarded as the correlation between them.

2) Relevance. The basic requirement of a dialogue system is that the generated response should be related to dialogue context. The relevance reward is given as follows:


where is a pre-trained multi-turn conversation matching network [Wu et al.2017]. We adopt the matching score to measure the relevance between a response and dialogue context.

To sum up, the reward for action is defined as:


where and we set . To estimate the expected reward, we take the future influence into consideration:


where is decay factor and we set . The learning process iteratively estimates and maximizes the expected future rewards.

3.4 Model Training

To warm up for the policy model, we firstly fit our model to human-human conversational patterns with supervised learning. Then, through the simulated dialogues between two agents, we optimize the policy model with reinforcement learning.

1:: number of training instances.
2:: simulation turns.
3:: sampling times.
4:: number of utterances in -th dialogue session.
5:Jointly pre-train policy model and cue word augmented response generation model with supervised learning.
6:for  do
7:     for  do
8:         ;
9:         for  do
10:              for  do
11:                  Sample an action based on
12:                  Compute using Eq. 11
13:                  Transform to state
14:              end for
15:              Estimate using Eq. 12
16:         end for
17:         Compute average reward
18:         Update policy model using Eq. 16.
19:     end for
20:end for
Algorithm 1 Training Process

3.4.1 Single-turn Supervised Learning

For the first stage of training, we aim to learn the cue word augmented response generation model. Given an aligned (query, cue words, response) tuple , we sample a meaningful word (noun, verb, or adjective) from the reference response as the gold cue word . During training, we firstly select a cue word with highest probability based on dialogue context and topic flow . Then, we generate a reply . Formally, the cue word augmented generation model can be formulated as:


Naturally, the objective function is to minimize the cross entropy of cue words selection and responses generation:


where is the number of training instances, denotes the number of utterances in the -th dialogue session, indicates the length of reply words. is the one-hot representation of -th word in reply .

3.4.2 Conversation Simulation

To train the policy model, we firstly initialize it with the supervised model mentioned above. Then, we optimize it with conversation simulation.

The simulation process between two chatbots (sharing same parameters) consists of following steps: 1) An initial instance from training data is fed to agent A as input. 2) Agent A samples a cue word based on the policy model, which computes the probability distribution over cue words vocabulary. 3) Given dialogue context and the selected cue word, agent A generates a response . 4) Transform to new dialogue state using Eq. 7, which is fed to agent B as input. 5) Repeat from step 2) to 4) until the conversation reaches an end.

We define the maximum simulation turns to terminate the simulation process. During simulation, we aim to optimize the policy parameters to improve the probability of an action with a greater expected reward. We adopt policy gradient [Williams1992] for optimization. The objective of learning is to maximize the expected future reward:

Number of sessions
Average number of turns
Vocabulary Size
Average length of utterances
Table 2: Statistics of the Weibo dataset after filtering.

The gradient of objective function is calculated by REINFORCE algorithm [Williams1992]:


where , the average reward of different sampling actions in the same state

, is a bias estimator to reduct variance:


where is the -th sampling action in the same state . Together with supervised learning, the whole training process is summarized in Algorithm 1.

4 Experiments

Method Turns Intra-session Inter-session # U. # B. # T. # Words
Dist-1 Dist-2 Dist-3 Dist-1 Dist-2 Dist-3
S2S 2.57 0.52 0.52 0.41 0.01 0.05 0.10 7.83 9.65 8.63 2,435
S2S-Cw 4.38 0.52 0.57 0.46 0.01 0.07 0.16 11.74 16.36 15.20 4,733
RL-S2S 5.45 0.58 0.66 0.54 0.01 0.05 0.11 18.91 24.31 20.67 4,219
RLCw-E. 5.93 0.50 0.61 0.53 0.01 0.07 0.18 16.78 24.95 23.86 4,889
RLCw-R. 6.30 0.53 0.65 0.56 0.01 0.07 0.19 19.94 29.69 28.30 5,726
RLCw 6.51 0.52 0.64 0.55 0.01 0.08 0.20 19.43 28.95 27.44 5,637
Table 3: Automatic evaluation results of our proposed model against baselines. Suffix “-E.” and “-R.” denote the RLCW model only with the reward of effectiveness or relevance, respectively. Turns refers to the average number of simulated turns. # U., # B., and # T. are the average numbers of distinct unigram, bigram, and trigram in a dialogue session. # Words denotes the number of distinct words in all simulated conversations.

In this section, we compare our method with three representative baselines based on a huge publicly available conversation resource. The objectives of our experiments are to 1) evaluate the effectiveness of our proposed RLCw model, and 2) explore how selected cue words affect the dialogue process.

4.1 Dataset

We conduct experiments on a public multi-turn Weibo dataset111 , which consists of and conversation sessions in training and testing set, respectively.

The datasets are collected from Sina Weibo 222, one of the most popular social media sites in China, used by over 30% of Internet users [Wenjie et al.2018], covering rich real-world topics in our daily life. To ensure higher data quality, we construct the experimental dataset in the following steps: 1) Keep the conversational sessions with more than two turns. 2)Remove repetitive training instances333For a dialogue session, we extract consecutive two utterances as a query and the following one as the reply. Any empty sentence is not allowed.. 3) As for the instances with the same reply, we only use ten of those with the most query words. 4) We build a vocabulary of noun, verb, and adjective444We use Jieba as our segmentation and POS tagging toolkit. Then, we keep top 999 frequent words and a special symbol as the cue words set. For each utterance, we match the longest word from the cue words set. If not, it is labeled as . And we only maintain 1000 training instances with the special label. 5) The special symbol will replace these words whose frequency is less than 11 times in training data. Table 2 presents the statistic of experimental Weibo dataset after filtering. Further, we split it into 8:1:1 for training, validation, and testing.

Choice % RLCw vs S2S RLCw vs S2S-Cw RLCw vs RL-S2S
RLCw S2S Tie Kap. RLCw S2S-Cw Tie Kap. RLCw RL-S2S Tie Kap.
Fluency 48.0 23.5 28.5 0.43 38.8 27.3 33.9 0.42 41.2 32.5 26.3 0.40
Consistency 48.2 25.7 26.1 0.39 38.2 30.8 31.0 0.42 39.0 32.0 29.0 0.43
Relevance 37.0 27.2 35.8 0.43 34.7 26.2 39.1 0.46 34.3 30.7 35.0 0.42
Informativeness 61.5 19.7 18.8 0.40 51.3 26.2 22.5 0.41 51.6 25.8 22.6 0.39
Preference 37.7 20.2 42.1 0.46 35.8 24.3 39.9 0.44 34.7 28.2 37.1 0.42
Table 4:

Human evaluation results on five aspects: fluency, consistency, relevance, informativeness, and overall user preference. We conducted significance test (t-test); ** and * indicate

0.01 and 0.05, respectively. Kap. denotes Kappa coefficient, which shows moderate agreement among evaluators.

4.2 Baselines

In this paper, we conduct experiments to compare our proposed method against three representative baselines.

S2S We implemented sequence to sequence generation model [Venugopalan et al.2015], which is treated as a preliminary baseline.

S2S-Cw In cue word augmented response generation method, we jointly model cue words selection and responses generation. Without future reward optimization, the supervised training process is based on the pre-trained S2S model.

RL-S2S Based on the pre-trained S2S model, RL-S2S [Li et al.2016c] further optimize it with reinforcement learning. To do this, the model explored the action space (to generate a dialogue utterance) and learned the policy, which takes the form of an LSTM encoder-decoder.

Different from RL-S2S baseline, our proposed RLCw model focuses on adaptively cue words selection so as to lead the conversation direction naturally.

4.3 Experiment Setup

As all our baselines and the proposed method are RNNs-based generative models, we conduct the same set of hyper-parameter optimized for them. In our experiments, the batch size was 64. We limited the maximum length of a source input to 44 words (22 for output) and directly discarded the part that exceeds the length constraint. During training and testing, the numbers of simulated conversational turns were set to 3, 10 respectively.

The dimension of word embedding was set to 600. To speed up training, we pre-trained Word2Vec [Mikolov et al.2013] using gensim library555 Both LSTMs have 1000 hidden cells for each layer. We applied Adam [Kingma and Ba2015] to optimize the objective function and the learning rate was initialized as 0.0001. These values were mostly chosen empirically.

Source Input 等你出道 I’m waiting for your official debut.
Cue words 机智 安逸 发型 兄弟 goddess witty comfort hair style brother attract
Dialogue A: 你的美女是谁 A: Who is your goddess?
B: 机智的你 B: It’s you! My cute girl.
A: 你最棒,我是最漂亮的 A: You are the best, and I am the most beautiful.
B: 你的头像是我吗 B: Do you use my photo as portrait?
A: 兄弟,我是你的小粉丝 A: Brother, I’m your fan.
B: 你是不是被我走了 B: Are you attracted by me?
Source Input 我能说我高中三年都是球头吗。这是学校规定 I have been with buzz cut for three years in high school. This is one of the school regulations.
Cue words 美女 重庆 beauty work visit Chongqing
Dialogue A: 哦哦,这样啊,美女,你今天回去了吗 A: Oh, I see. Beauty, did you go back today?
B: 明天回了,我在毕业设计 B: I will go back tomorrow. I’m working on my graduation thesis.
A: 我明天可以去 你嘛,你去哪了 A: May I visit you tomorrow? By the way, where have you been?
B: 我去南京 B: Nanjing.
Table 5: Case studies of the generated cue words and dialogues.

4.4 Evaluation Metrics

We evaluate different methods with both automatic metrics and human judgments.

4.4.1 Automatic metrics.

Inspired by the simulation strategy for training the policy model, we also use the simulation during testing. The system needs high-quality initial input to start the conversation, since it is unclear how to proceed the dialogue with the beginning of “me too”. Therefore, we manually build a set of dull sentences to further filter test data with a meaningless query. Totally, there are 17,832 input messages for testing.

The first automatic metric we use is the average turns of the simulated dialogue. We define termination conditions as follows: 1) A dull sentence is generated. 2) There are more than 80% overlap of words between two consecutive utterances from the same or different agents. 3) Simulation turns reach the maximum limit during testing. This metric is employed to measure the conversational engagement of different methods.

The degree of diversity is another important measurement for conversation systems. We compute the ratio of distinct unigram, bigram, and trigram in the generated utterances, which are denoted as Dist-1, Dist-2, and Dist-3, respectively. To show the fine-grained difference, we report the diversity in intra- and inter-session level.

4.4.2 Subjective metrics.

We also conduct pairwise human evaluation to assess subjective quality of generated multi-turn conversations. Given two simulated dialogues, we compare them from five aspects: fluency (the generated sentences are fluent without grammatical errors), consistency (whether the conversation is logically consistent and coherent), relevance (whether the responses are semantically relevant to dialogue context), informativeness (whether the dialogue is informative and meaningful) and overall user preference (how do users like the dialogues).

4.5 Main results

4.5.1 Automatic evaluation.

The automatic results of our model against all baselines are listed in Table 3. As we see, our proposed RLCw model significantly outperforms baselines in simulation turns, which demonstrates more active engagement of our method. Besides, our proposed RLCw model generates more diverse outputs; it obtains the highest ratio of distinct trigram in both intra- and inter-session level. Besides, our RLCw model is slightly inferior to baselines in Dist-1 metric, mainly because of more simulation turns.

As for the baselines, the performance of S2S model is not as good as others. S2S-Cw is slightly inferior to RL-S2S model. However, it generates more distinct words in all simulated conversations comparing with RL-S2S model, as the augmented cue words provide it a broader space for learning. Our proposed RLCw model absorbs its advantage actively and flows a longer and more diverse conversation.

To verify the effectiveness of proposed rewards, we also conducted an ablation study. From Table 3 we see that, both RLCw-E. (only use effectiveness reward) and RLCw-R. (only use relevance reward) outperform baselines. Comparing with RLCw-E. method, RLCw-R. tends to flow longer and more diverse conversations, which demonstrates the importance of quality constraint in responses generation. Together with these two rewards, RLCw obtains comparable diversity performance and the longest simulation turns, which reflecting the highest user engagement.

4.5.2 Human evaluation.

we randomly sample 150 messages from test data to conduct a pairwise comparison, i.e., given an input message, we group two simulated dialogues together666For fairness, two simulated conversations are pooled and randomly permuted. and ask the evaluators to choose which is better. We invited four native speakers to offer a judgment. The results of human evaluation against all baseline methods are listed in Table 5. Like the automatic evaluation results, RLCw consistently outperforms other baselines, which demonstrates the effectiveness of our proposed method. Especially, our proposed method shows prominent improvement in term of informativeness.

4.6 Analysis

We have elaborated the overall performance of all methods in the last subsection. Next, we will look closer into how cue words affect the dialogue process.

4.6.1 Cue words analysis.

First, we measure the quality and impact of the generated cue words sequence. We try to estimate the quality based on the average cosine similarity of each word pair in the cue words sequence. The result is 0.096 (the correlation of extracted cue words from reply sentences in training data is 0.137), which shows the semantic compactness of them. Again, we use embedding-based metric [Liu et al.2016] to estimate the correlations between a cue word and the corresponding generated response. The correlation score is 0.81. As we found that there are about 41% cue words appearing in the simulated dialogues, the selected cue words have a great impact on response generation.

4.6.2 Case study.

We further present two representative examples of our generated dialogues in Table 5. In the first example, our system dynamically plans a fluent dialogue flow. Based on selected cue words, our RLCw model generates coherent and interesting dialogues. In the second example, our system firstly responds to the given input message, and then shift the topic to “beauty” with cue words augmentation, which further affects the direction of the follow-up dialogues.

5 Conclusion

We study open domain dialogue generation with cue words augmentation which leads the direction of conversations. Specifically, we present the multi-turn cue-words driven conversation system with reinforcement learning (RLCw), which jointly models the cue word prediction and response generation in an end-to-end framework. To select higher quality cue words, we design a new reward to measure the effectiveness and relevance of cue words. We conduct experiments on a publicly available dataset to evaluate our model on dialogue duration, diversity as well as human judgements, showing that the proposed method consistently outperforms a set of competitive baselines.


  • [Asghar et al.2017] Asghar, N.; Poupart, P.; Xin, J.; and Li, H. 2017. Online sequence-to-sequence reinforcement learning for open-domain conversational agents. In Joint Conference on Lexical and Computational Semantics.
  • [Dai and Le2015] Dai, A. M., and Le, Q. V. 2015. Semi-supervised sequence learning. In Conference on Neural Information Processing Systems.
  • [Dhingra et al.2017] Dhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.-N.; Ahmed, F.; and Deng, L. 2017. End-to-end reinforcement learning of dialogue agents for information access. In ACL.
  • [Eric and Manning2017] Eric, M., and Manning, C. D. 2017. A copy-augmented sequence-to-sequence architecture gives good performance on task-oriented dialogue. In EACL.
  • [Higuchi, Rzepka, and Araki2008] Higuchi, S.; Rzepka, R.; and Araki, K. 2008. A casual conversation system using modality and word associations retrieved from the web. In EMNLP.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation.
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • [Lewis et al.2017] Lewis, M.; Yarats, D.; Dauphin, Y. N.; Parikh, D.; and Batra, D. 2017. Deal or no deal? end-to-end learning for negotiation dialogues. In EMNLP.
  • [Li et al.2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016a. A diversity-promoting objective function for neural conversation models. In NAACL.
  • [Li et al.2016b] Li, J.; Galley, M.; Brockett, C.; Spithourakis, G. P.; Gao, J.; and Dolan, B. 2016b. A persona-based neural conversation model. In ACL.
  • [Li et al.2016c] Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Jurafsky, D. 2016c. Deep reinforcement learning for dialogue generation. In EMNLP.
  • [Li et al.2017] Li, J.; Miller, A. H.; Chopra, S.; Ranzato, M.; and Weston, J. 2017. Dialogue learning with human-in-the-loop. In International Conference on Learning Representations.
  • [Liu et al.2016] Liu, C.-W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, L.; and Pineau, J. 2016.

    How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.

    In EMNLP.
  • [Liu et al.2018] Liu, B.; Tur, G.; Hakkani-Tur, D.; Shah, P.; and Heck, L. 2018. End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. In NAACL, 67–73.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Conference on Neural Information Processing Systems.
  • [Mou et al.2016] Mou, L.; Song, Y.; Yan, R.; Li, G.; Zhang, L.; and Jin, Z. 2016. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. In International Conference on Computational Linguistics.
  • [Serban et al.2016] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016.

    Building end-to-end dialogue systems using generative hierarchical neural network models.

    In AAAI.
  • [Serban et al.2017] Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI.
  • [Shang, Lu, and Li2015] Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In ACL.
  • [Shao et al.2017] Shao, Y.; Gouws, S.; Britz, D.; Goldie, A.; Strope, B.; and Kurzweil, R. 2017. Generating high-quality and informative conversation responses with sequence-to-sequence models. In EMNLP.
  • [Venugopalan et al.2015] Venugopalan, S.; Rohrbach, M.; Donahue, J.; Mooney, R.; Darrell, T.; and Saenko, K. 2015. Sequence to sequence-video to text. In

    International Conference on Computer Vision

  • [Vinyals and Le2015] Vinyals, O., and Le, Q. 2015. A neural conversational model. Computer Science.
  • [Vougiouklis, Hare, and Simperl2016] Vougiouklis, P.; Hare, J.; and Simperl, E. 2016. A neural network approach for knowledge-driven response generation. In International Conference on Computational Linguistics.
  • [Wen et al.2017] Wen, T.-H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L. M.; Su, P.-H.; Ultes, S.; and Young, S. 2017. A network-based end-to-end trainable task-oriented dialogue system. In EACL.
  • [Wenjie et al.2018] Wenjie, W.; Minlie, H.; Xin-Shun, X.; Fumin, S.; and Liqiang, N. 2018. Chat more: Deepening and widening the chatting topic via a deep model. In International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • [Williams1992] Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning.
  • [Wu et al.2017] Wu, Y.; Wu, W.; Xing, C.; Zhou, M.; and Li, Z. 2017. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In ACL.
  • [Xing et al.2017] Xing, C.; Wu, W.; Wu, Y.; Liu, J.; Huang, Y.; Zhou, M.; and Ma, W.-Y. 2017. Topic augmented neural response generation with a joint attention mechanism. In AAAI.
  • [Yao et al.2017] Yao, L.; Zhang, Y.; Feng, Y.; Zhao, D.; and Yan, R. 2017. Towards implicit content-introducing for generative short-text conversation systems. In EMNLP.
  • [Yu et al.2016] Yu, Z.; Xu, Z.; Black, A. W.; and Rudnicky, A. 2016. Strategy and policy learning for non-task-oriented conversational systems. In the annual conference of the joint ACL/ISCA Special Interest Group on Discourse and Dialogue.
  • [Zhang et al.2018] Zhang, W.; Li, L.; Cao, D.; and Liu, T. 2018. Exploring implicit feedback for open domain conversation generation. In AAAI.
  • [Zhao, Zhao, and Eskenazi2017] Zhao, T.; Zhao, R.; and Eskenazi, M. 2017.

    Learning discourse-level diversity for neural dialog models using conditional variational autoencoders.

    In ACL.