"Wait, I'm Still Talking!" Predicting the Dialogue Interaction Behavior Using Imagine-Then-Arbitrate Model

by   Zehao Lin, et al.
Zhejiang University

Producing natural and accurate responses like human beings is the ultimate goal of intelligent dialogue agents. So far, most of the past works concentrate on selecting or generating one pertinent and fluent response according to current query and its context. These models work on a one-to-one environment, making one response to one utterance each round. However, in real human-human conversations, human often sequentially sends several short messages for readability instead of a long message in one turn. Thus messages will not end with an explicit ending signal, which is crucial for agents to decide when to reply. So the first step for an intelligent dialogue agent is not replying but deciding if it should reply at the moment. To address this issue, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to help the agent decide whether to wait or to make a response directly. Our method has two imaginator modules and an arbitrator module. The two imaginators will learn the agent's and user's speaking style respectively, generate possible utterances as the input of the arbitrator, combining with dialogue history. And the arbitrator decides whether to wait or to make a response to the user directly. To verify the performance and effectiveness of our method, we prepared two dialogue datasets and compared our approach with several popular models. Experimental results show that our model performs well on addressing ending prediction issue and outperforms baseline models.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning to Predict Persona Information forDialogue Personalization without Explicit Persona Description

Personalizing dialogue agents is important for dialogue systems to gener...

Modeling and Utilizing User's Internal State in Movie Recommendation Dialogue

Intelligent dialogue systems are expected as a new interface between hum...

Design of an Agent for Answering Back in Smart Phones

The objective of the paper is to design an agent which provides efficien...

Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

Different people have different habits of describing their intents in co...

Online Coreference Resolution for Dialogue Processing: Improving Mention-Linking on Real-Time Conversations

This paper suggests a direction of coreference resolution for online dec...

Ranking Enhanced Dialogue Generation

How to effectively utilize the dialogue history is a crucial problem in ...

Aiming to Know You Better Perhaps Makes Me a More Engaging Dialogue Partner

There have been several attempts to define a plausible motivation for a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A multi-turn dialogue fragment. In this case, user sends splited utterances in a turn, e.g. split U1 to {U11, U12 and U13}

All species are unique, but languages make humans uniquest [premack2004language]. Dialogues, especially spoken and written dialogues, are fundamental communication mechanisms for human beings. In real life, tons of businesses and entertainments are done via dialogues. This makes it significant and valuable to build an intelligent dialogue product. So far there are quite a few business applications of dialogue techniques, e.g. personal assistant, intelligent customer service and chitchat companion.

The quality of response is always the most important metric for dialogue agent, targeted by most existing work and models searching the best response. Some works incorporate knowledge [DBLP:conf/acl/FungWM18, lin-etal-2019-task] to improve the success rate of task-oriented dialogue models, while some others [NIPS2015_5866] solve the rare words problem and make response more fluent and informative.

Despite the heated competition of models, however, the pace of interaction is also important for human-computer dialogue agent, which has drawn less or no attention. Figure 1 shows a typical dialogue fragment in an instant message program. A user is asking the service about the schedule of the theater. The user firstly says hello (U11) followed by demand description (U12), and then asks for suggested arrangement (U13), each of which is sent as a single message in one turn. The agent doesn’t answer (A2) until the user finishes his description and throws his question. The user then makes a decision (U21) and asks a new question (U22). And then the agent replies with (A3). It’s quite normal and natural that the user sends several messages in one turn and the agent waits until the user finished his last message, otherwise the pace of the conversation will be messed up. However, existing dialogue agents can not handle well when faced with this scenario and will reply to every utterance received immediately.

There are two issues when applying existing dialogue agents to real life conversation. Firstly, when user sends a short utterance as the start of a conversation, the agent has to make a decision to avoid generating bad responses based on semantically incomplete utterance. Secondly, dialogue agent cutting in the conversation at an unreasonable time could confuse user and mess up the pace of conversation, leading to nonsense interactions.

To address these two issues, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to recognize if it is the appropriate moment for agent to reply when agent receives a message from the user. In our method, we have two imaginator modules and an arbitrator module. Imaginators will learn both of the agent’s and user’s speaking styles respectively. The arbitrator will use the dialogue history and the imagined future utterances generated by the two imaginators to decide whether the agent should wait user or make a response directly.

In summary, this paper makes the following contributions:

  • We first addressed an interaction problem, whether the dialogue model should wait for the end of the utterance or make a response directly in order to simulate real life conversation and tried several popular baseline models to solve it.

  • We proposed a novel Imagine-then-Arbitrate (ITA) neural dialogue model to solve the problem mentioned above, based on both of the historical conversation information and the predicted future possible utterances.

  • We modified two popular dialogue datasets to simulate the real human dialogue interaction behavior.

  • Experimental results demonstrate that our model performs well on addressing ending prediction issue and the proposed imaginator modules can significantly help arbitrator outperform baseline models.

2 Related Work

2.1 Dialogue System

Creating a perfect artificial human-computer dialogue system is always a ultimate goal of natural language processing. In recent years, deep learning has become a basic technique in dialogue system. Lots of work has investigated on applying neural networks to dialogue system’s components or end-to-end dialogue frameworks

[YanDCZZL17, lipton2018bbq-networks]. The advantage of deep learning is its ability to leverage large amount of data from internet, sensors, etc. The big conversation data and deep learning techniques like SEQ2SEQ [NIPS2014_5346] and attention mechanism [DBLP:conf/emnlp/LuongPM15] help the model understand the utterances, retrieve background knowledge and generate responses.

Figure 2: Model Overview. (a) Train the agent and user imaginators using the same dialogues but different samples. (b) During training and inference step, arbitrator uses the dialogue history and two trained imaginators’ predictions.

2.2 Classification in Dialogue

Though end-to-end methods play a more and more important role in dialogue system, the text classification modules [jiang2018text, kowsari2017hdltex] remains very useful in many problems like emotion recognition [song-etal-2019-generating], gender recognition [hoyle-etal-2019-unsupervised]

, verbal intelligence, etc. There have been several widely used text classification methods proposed, e.g. Recurrent Neural Networks (RNNs) and CNNs. Typically RNN is trained to recognize patterns across time, while CNN learns to recognize patterns across space.


proposed TextCNNs trained on top of pre-trained word vectors for sentence-level classification tasks, and achieved excellent results on multiple benchmarks.

Besides RNNs and CNNs, [vaswani2017attention] proposed a new network architecture called Transformer, based solely on attention mechanism and obtained promising performance on many NLP tasks. To make the best use of unlabeled data, [devlin2018bert] introduced a new language representation model called BERT based on transformer and obtained state-of-the-art results.

2.3 Dialogue Generation

Different from retrieval method, Natural Language Generation (NLG) tries converting a communication goal, selected by the dialogue manager, into a natural language form. It reflects the naturalness of a dialogue system, and thus the user experience. Traditional template or rule-based approach mainly contains a set of templates, rules, and hand-craft heuristics designed by domain experts. This makes it labor-intensive yet rigid, motivating researchers to find more data-driven approaches

[ghazvininejad2018knowledge, lin-etal-2019-task] that aim to optimize a generation module from corpora, one of which, Semantically Controlled LSTM (SC-LSTM) [wen2015semantically], a variant of LSTM [hochreiter1997long], gives a semantic control on language generation with an extra component.

3 Task Definition

In this section we will describe the task by taking a scenario and then define the task formally.

As shown in Figure 1, we have two participants in a conversation. One is the dialogue agent, and the other is a real human user. The agent’s behavior is similar to most chatbots, except that it doesn’t reply on every sentence received. Instead, this agent will judge to find the right time to reply.

Our problem is formulated as follows. There is a conversation history represented as a sequence of utterances: , where each utterance itself is a sequence of words . Besides, each utterance has some additional tags:

  • turn tags to show which turn this utterance is in the whole conversation.

  • speakers’ identification tags or to show who sends this utterance.

  • subturn tags for user to indicate which subturn an utterance is in. Note that an utterance will be labelled as even if it doesn’t have one.

Now, given a dialogue history and tags , the goal of the model is to predict a label , the action the agent would take, where means the agent will wait the user for next message, and

means the agent will reply immediately. Formally we are going to maximize following probability:


4 Proposed Framework

Basically, the task can be simplified as a simple text classification problem. However, traditional classification models only use the dialogue history and predict ground truth label. The ground truth label actually ignores all context information in the next utterance. To make the best use of training data, we propose a novel Imagine-then-Arbitrate (ITA) model taking , ground truth label, and the future possible into consideration. In this section, we will describe the architecture of our model and how it works in detail.

4.1 Imaginator

An imaginator is a natural language generator generating next sentence given the dialogue history. There are two imaginators in our method, agent’s imaginator and user’s imaginator. The goal of the two imaginators are to learn the agent’s and user’s speaking style respectively and generate possible future utterances.

As shown in Figure 2 (a), imaginator itself is a sequence generation model. We use one-hot embedding to convert all words and relative tags, e.g. turn tags and place holders, to one-hot vectors , where is the length of vocabulary list. Then we extend each word in utterance by concatenating the token itself with turn tag, identity tag and subturn tag. We adopt SEQ2SEQ as the basic architecture and LSTMs as the encoder and decoder networks. LSTMs will encode each extended word as a continuous vector at each time step . The process can be formulated as following:


where is the embedding of the extended word , , , , , , , , and are learnt parameters.

Though trained on the same dataset, the two imaginators learn different roles independently. So in the same piece of dialogue, we split it into different samples for different imaginators. For example, as shown in Figure 1 and 2 (a), we use utterance (A1, U11, U12) as dialogue history input and U13 as ground truth to train the user imaginator and use utterance (A1, U11, U12, U13) as dialogue history and A2 as ground truth to train the agent imaginator.

During training, the encoder runs as equation 2, and the decoder is the same structured LSTMs but will be fed to a Softmax with

, which will produce a probability distribution

over all words, formally:


the decoder at time step t will select the highest word in , and our imaginator’s loss is the sum of the negative log likelihood of the correct word at each step as follows:


where is the length of the generated sentence. During inference, we also apply beam search to improve the generation performance.

Finally, the trained agent imaginator and user imaginator are obtained.

4.2 Arbitrator

The arbitrator module is fundamentally a text classifier. However, in this task, we make the module maximally utilize both dialogue history and ground truth’s semantic information. So we turned the problem of maximizing

from in equation (1) to:


where and are the trained agent imaginator and user imaginator respectively, and is a selection indicator where means selecting whereas means selecting . And Thus we (1) introduce the generation ground truth semantic information and future possible predicted utterances (2) turn the label prediction problem into a response selection problem.

We adopt several architectures like Bi-GRUs, TextCNNs and BERT as the basis of arbitrator module. We will show how to build an arbitrator by taking TextCNNs as an example.

As is shown in Figure 2, the three CNNs with same structure take the inferred responses , and dialogue history , tags . For each raw word sequence , we embed each word as one-hot vector . By looking up a word embedding matrix , the input text is represented as an input matrix , where is the length of sequence of words and is the dimension of word embedding features. The matrix is then fed into a convolution layer where a filter is applied:


where is the window of token representation and the function is , and are learnt parameters. Applying this filter to possible obtains a feature map:

c (7)

where for filters. And we use different size of filters in parallel in the same convolution layer. This means we will have windows at the same time, so formally:

C (8)

, then we apply max-over-time pooling operation to capture the most important feature:


, and thus we get the final feature map of the input sequence.

We apply same CNNs to get the feature maps of , and :


where function TextCNNs() follows as equations from 6 to 9. Then we will have two possible dialogue paths, with and with , representations and :


And then, the arbitrator will calculate the probability of the two possible dialogue paths:


Through learnt parameters and , we will get a two-dimensional probability distribution , in which the most reasonable response has the max probability. This also indicates whether the agent should wait or not.

And the total loss function of the whole attribution module will be negative log likelihood of the probability of choosing the correct action:


where is the number of samples and is the ground truth label of i-th sample.

The arbitrator module based on Bi-GRU and BERT is implemented similar to TextCNNs.

Datasets MultiWoz DailyDialogue
Train Valid Test Train Valid Test
Vocabulary Size 2443 6219
Dialogues 8423 1000 1000 1118 1000 1000
Avg. Turns/Dialogue 6.32 6.97 6.98 4.09 4.21 4.03
Avg. Split User Turns 1.89 1.92 1.94 2.09 2.12 2.12
Avg. Utterance Length 10.54 10.7 10.56 8.71 8.54 8.75
Avg. Agent’s Utterance 14.43 14.78 14.69 12.04 11.81 12.17
Avg. User’s Utterance 6.18 6.28 6.17 5.91 5.87 5.96
Agent Wait Samples 53249 6970 6983 41547 3846 3689
Agent Reply Samples 47341 6410 6573 49540 4717 4510
Table 1: Datasets Statistics. Note that the statistics are based on the modified dataset described in Section 5.2
Dataset MultiWoz DailyDialogue
Task Agent User Arbitrator Agent User Arbitrator
Baseline: TextCNNs N/A N/A 77.68 N/A N/A 75.79
Agent Imaginator 11.77 0.80 4.51 0.61
LSTM User Imaginator 0.3 8.87 80.04 0.15 8.70 76.37
Agent Imaginator 12.47 0.72 19.19 0.60
LSTM+Attn. User Imaginator 0.24 9.71 80.75 0.26 24.52 79.02
Agent Imaginator 13.37 0.67 19.01 0.67
LSTM (with GLOVE) + Attn. User Imaginator 0.51 10.61 80.38 0.21 24.65 78.56
Table 2: Results of the different imaginators generation performance (in BLEU score) and accuracy score on the same TextCNNs based arbitrator. Better results between imaginators are in BOLD and best results on datasets are in RED.
Dataset MultiWoz DailyDialogue
Random 51.51 55.00
Bi-GRU 79.12 75.23
ITA-GRU 82.03 77.80
TextCNNs 77.68 75.79
ITA-TextCNN 80.75 79.02
BERT 80.75 78.68
ITA-BERT 82.73 79.35
Table 3: Accuracy Results on Two datasets. Better results between baselines and corresponding ITA models are in BOLD and best results on datasets are in RED. Random result is the accuracy of script that making random decisions.

5 Experimental Setup

5.1 Datasets

As the proposed approach mainly concentrates on the interaction of human-computer, we select and modify two very different style datasets to test the performance of our method. One is a task-oriented dialogue dataset MultiWoz 2.0 111http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ and the other is a chitchat dataset DailyDialogue 222http://yanran.li/dailydialog.html. Both datasets are collected from human-to-human conversations. We evaluate and compare the results with the baseline methods in multiple dimensions. Table 1 shows the statistics of datasets.

  • MultiWOZ 2.0 [budzianowski-etal-2018-multiwoz]. MultiDomain Wizard-of-Oz dataset (MultiWOZ) is a fully-labeled collection of human-human written conversations. Compared with previous task-oriented dialogue datasets, e.g. DSTC 2 [henderson-etal-2014-second] and KVR [DBLP:conf/sigdial/EricKCM17], it is a much larger multi-turn conversational corpus and across serveral domains and topics: It is at least one order of magnitude larger than all previous annotated task-oriented corpora, with dialogues spanning across several domains and topics.

  • DailyDialogue [li-etal-2017-dailydialog]. DailyDialogue is a high-quality multi-turn dialogue dataset, which contains conversations about daily life. In this dataset, humans often first respond to previous context and then propose their own questions and suggestions. In this way, people show their attention others’ words and are willing to continue the conversation. Compare to the task-oriented dialogue datasets, the speaker’s behavior will be more unpredictable and complex for the arbitrator.

5.2 Datasets Modification

Because the task we concentrate on is different from traditional ones, to make the datasets fit our problems and real life, we modify the datasets with the following steps:

Dialogue History
User: actually
User: can you suggest [value_count] of them
User: can i get their contact info as well
Agent: sure , i would suggest the [restaurant_name] at [restaurant_address] . you can reach them at
   [restaurant_phone] . i could reserve it for you
User: no
User: that ‘s ok
User: i can take it from here
Ground Truth User: thank,you for all your help
Imagintator Prediction
Agent imaginator: would you like me to book it for you
User imaginator: thanks for all your help
Arbitrator Selection
User imaginator
Table 4: An Example of The Imaginator’s Generation and arbitrator’s Selection.
  • Drop Slots and Values For task-oriented dialogue, slot labels are important for navigating the system to complete a specific task. However, those labels and accurate values from ontology files will not benefit our task essentially. So we replace all specific values with a slot placeholder in preprocessing step.

  • Split Utterances Existing datasets concentrate on the dialogue content, combining multiple sentences into one utterance each turn when gathering the data. In this step, we randomly split the combined utterance into multiple utterances according to the punctuation. And we set a determined probability to decide if the preprocessing program should split a certain sentence.

  • Add Turn Tag We add turn tags, subturn tags and role tags to each split and original sentences to (1) label the speaker role and dialogue turns (2) tag the ground truth for training and testing the supervised baselines and our model.

Finally, we have the modified datasets which imitate the real life human chatting behaviors as shown in Figure 1. Our datasets and code333https://github.com/mumeblossom/ITA will be released to public for further researches in both academic and industry.

5.3 Evaluation Method

To compare with dataset baselines in multiple dimensions and test the model’s performance, we use the overall Bilingual Evaluation Understudy (BLEU) [DBLP:conf/acl/PapineniRWZ02] to evaluate the imaginators’ generation performance. As for arbitrator, we use accuracy score of the classification to evaluate. Accuracy in our experiments is the correct ratio in all samples.

5.4 Baselines and Training Setup

The hyper-parameter settings adopted in baselines and our model are the best practice settings for each training set. All models are tested with various hyper-parameter settings to get their best performance. Baseline models are Bidirectional Gated Recurrent Units (Bi-GRUs)

[chung2014empirical], TextCNNs [kim2014convolutional] and BERT [devlin2018bert].

6 Experimental Results and Analysis

6.1 Results

In Table 2

, we show different imaginators’ generation abilities and their performances on the same TextCNN based arbitrator. Firstly, we gathered the results of agent and user imaginators’ generation based on LSTM, LSTM-attention and LSTM-attention with GLOVE pretrained word embedding. According to the evaluation metric BLEU, the latter two models achieve higher but similar results. Secondly, when fixed the arbitrator on the TextCNNs model, the latter two also get the similar results on accuracy and significantly outperform the others including the TextCNNs baseline.

The performances on different arbitrators with the same LSTM-attention imaginators are shown in Table 3. From those results, we can directly compared with the corresponding baseline models. The imaginators with BERT based arbitrator make the best results in both datasets while all ITA models beat the baseline models.

We also present an example of how our model runs in Table 4. Imaginators predict the agent and user’s utterance according to the dialogue history(shown in model prediction), and then arbitrator selects the user imaginator’s prediction that is more suitable with the dialogue history. It is worth noting that the arbitrator generates a high-quality sentence again if only considering the generation effect. However, referring to the dialogue history, it is not a good choice since its semantic is repeated in the last turn by the agent.

6.2 Analysis

6.2.1 Imaginators Benefit the Performance

From Table 3, we can see that not only our BERT based model get the best results in both datasets, the other two models also significantly beat the corresponding baselines. Even the TextCNNs based model can beat all baselines in both datasets.

Table 2 figures out experiment results on MultiWOZ dataset. The LSTM based agent imaginator get the BLEU score at 11.77 on agent samples, in which the ground truth is agents’ utterances, and 0.80 on user samples. Meanwhile, the user imaginator get the BLEU score at 0.3 on agent samples and 8.87 on user target samples. Similar results are shown in other imaginators’ expermients. Although these comparisons seem unfair to some extends since we do not have the agent and user’s real utterances at the same time and under the same dialogue history, these results show that the imaginators did learn the speaking style of agent and user respectively. So the suitable imaginator’s generation will be more similar to the ground truth, such an example shown in Table 4, which means this response more semantically suitable given the dialogue history.

If we fix the agent and user imaginators’ model, as we take the LSTM-attention model, the arbitrators achieve different performances on different models, shown in Table 

3. As expected, ITA models beat their base models by nearly 2 3% and ITA-BERT model beats all other ITA models.

So from the all results, we can conclude that imaginators will significantly help the arbitrator in predicting the dialogue interaction behavior using the future possible agent and user responses’ semantic information.

6.2.2 Relation of Imaginators and Arbitrator’s Performance

As shown in the DailyDialogue dataset of Table 2, we can see that attention mechanism works in learning the generation task. LSTMs -Attention and LSTMs-attention-GLOVE based imaginators get more than 19 and 24 BLEU scores in corresponding target, while the LSTMs without attention gets only 4.51 and 8.70. These results also impact on the arbitrator results. The imaginator with attention mechanism get an accuracy score of 79.02 and 78.56, significantly better than the others. The evidence also exists in the results on MultiWoz. All imaginators get similar generation performance, so the arbitrators gets the similar accuracy scores.

From those results, we can conclude that there is positive correlation between the performance of imaginators and arbitrators. However, there still exists problems. It’s not easy to evaluate the dialogue generation’s performance. In the results of MultiWoz, we can see that LSTMs-GLOVE based ITA performs a little better than LSTMs-attention based ITA, but not the results of the arbitrator are opposite. This may indicate that (1) when the imaginators’ performance is high enough, the arbitrator’s performance will be stable and (2) the BLEU score will not perfectly present the contribution to the arbitrator. We leave these hypotheses in future work.

7 Conclusion

We first address an interaction problem, whether the dialogue model should wait for the end of the utterance or reply directly in order to simulate user’s real life conversation behavior, and propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to deal with it. Our model introduces the imagined future possible semantic information for prediction. We modified two popular dialogue datasets to fit in the real situation. It is reasonable that additional information is helpful for arbitrator, despite its fantasy.